Principal Component Analysis (PCA): A Practical Example

Let’s first define what a is PCA.

Principal Component Analysis or PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

In other words, Principal Component Analysis (PCA) is a technique to detect the main components of a data set in order to reduce into fewer dimensions retaining the relevant information.

To put an example, Let  X \in\mathbb{R}^{mxn} a data set with zero mean, that is, the matrix formed by n observations of m  variables. Where the elements of X  are denoted as usual by x_ij  meaning that it contains the value of the observable i  of the j-th  observation experiment.

A principal component is a linear combination of the variables so that maximizes the variance.

Let’s now see a PCA example step by step

1. Create a random toy data set

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

m1 = [4.,-1.]
s1 = [[1,0.9],[0.9,1]]
c1 = np.random.multivariate_normal(m1,s1,100)

Let’s plot the data set and compute the PCA. The red dots of the figure show below the considered data, the blue arrow shows the eigenvector of maximum eigenvalue.

vaps,veps = np.linalg.eig(np.cov(c1.T))
idx = np.argmax(vaps)


PCA Closed Solution

Now that we have visualize it, let’s code the closed solution for the PCA

First step is to standardize the data. We are going to use Scikit-learn library.

from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(c1)

Eigendecomposition – Computing Eigenvectors and Eigenvalues

The eigenvectors determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.

mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec) - mean_vec)) / (X_std.shape[0]-1)
print('Covariance Matrix \n%s' %cov_mat)
Covariance Matrix 
[[ 1.01010101  0.88512031]
 [ 0.88512031  1.01010101]]

Let’s now print our Covariance Matrix

#Let's print our Covariance Matrix
print('NumPy Covariance Matrix: \n%s' %np.cov(X_std.T))
NumPy Covariance Matrix: 
[[ 1.01010101  0.88512031]
 [ 0.88512031  1.01010101]]

Now we perform an eigendecomposition on the covariance matrix

cov_mat = np.cov(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]

[ 1.89522132  0.1249807 ]

Let’s sort the eigenvalues to see if everything is ok

# let's sort the eig values to see if everything is ok
for ev in eig_vecs:
    np.testing.assert_array_almost_equal(1.0, np.linalg.norm(ev))
print('Everything ok!')
Everything ok!

Now we need to make a list of the eigenvalue, eigenvectors tuples and sort them from high to low.

# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
Eigenvalues in descending order:

Building a Projection Matrix

# Choose the "top 2" eigenvectors with the highest eigenvalues 
# we are going to use this values to matrix W.
matrix_w = np.hstack((eig_pairs[0][1].reshape(2,1), 

print('Matrix W:\n', matrix_w)
('Matrix W:\n', array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]]))

We will use this data to plot our output later so we can compare with a custom gradient descent approach.

There are several numerical techniques that allow to find a point x^* that corresponds too \nabla x, \lambda L (x^*, \lambda ^*) = 0 , the saddle point. One way to tackle the problem is to “construct a new function, related to the Lagrangian, that (ideally) has a minimum at (x^*, \lambda ^*)

This new function can be considered as ’distorting’ the Lagrangian at infeasible points so as to create a minimum at (x^*, \lambda ^*) . Unconstrained minimization techniques can then be applied to the new function. This approach can make it easier to guarantee convergence to a local solution, but there is the danger that the local convergence properties of the method can be damaged.

The ’distortion’ of the Lagrangian function can lead to a ’distortion’ in the Newton equations for the method. Hence the behavior of the method near the solution may be poor unless care is taken.” Another way to tackle the condition \nabla x, \lambda L (x, \lambda) = 0 is to maintain feasibility at every iteration. That is, to ensure that the updates xk follow the implicit curve h(x) = 0 . For the toy problem we are considering here it is relatively easy. Assume we start from a point x 0 that satisfies h(x 0 ) = 0 , that is it satisfies the constraint.

The algorithm can be summarized as follows:

  1. Compute the gradient \nabla L (x^k)  (observe that we compute the gradient of the Lagrangian with respect to x ).
  2. Compute an estimate of \lambda by computing the value of \lambda that minimizes \nabla L (x^k)^2 .
  3. Assume that the update is x^{k+1} = x^k - \alpha ^k \nabla L (x^k) . For each candidate update x k+1 , project it over the constraint h(x) = 0 . Find the α k value that decreases the L (x^{k+1}) with respect to \nabla L (x^k) .
  4. Goto step 1 and repeat until convergence.

Let’s now implement the KKT conditions to see if we are able to obtain the same result as the one obtained with the closed solution. We will use the projected gradient descent to obtain the solution.

Let’s A be our covariance matrix

# A is the covariance matrix of the considered data
A = np.cov(c1.T)

Now we set up our initial values

# Tolerance

# Initial alpha value (line search)

# Initial values of w. DO NOT CHOOSE w=(0,0)
w = np.array([1., 0.])

Now we compute the eigenvalues and eigenvectors

# let's see now the eigvals and eigvects

eig_vals, eig_vecs = np.linalg.eig(A)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

Now, we compute the projection for the function w=w.T*w

#now let's compute the projection for the function. w = w.T*w
den = np.sqrt(,w))
w = w / den

Next step is to compute lambda

# now we calculate lambda
lam = ( (w.T,(A + w.T) ),w) / 2 *,w)

Let’s review our initial values

print "Initial values"
print "Lagrangian value =", lag
print " w =", w
print " x =", m1
print " y =", s1
Initial values
Lagrangian value = -0.858313040377
 w = [ 1.  0.]
 x = [4.0, -1.0]
 y = [[1, 0.9], [0.9, 1]]

Let’s now compute our function using gradient descent

# let's now compute the entire values for our function

while ((alpha > tol) and (cont < 100000)):
    cont = cont+1
    # Gradient of the Lagrangian
    grw = (w.T,(A + w.T) ) - 2 * lam * w.T
    # Used to know if we finished line search
    finished = 0
    while ((finished == 0) and (alpha > tol)):
        # Update
        aux_w = w - alpha * grw
        # Our Projection 
        den = np.sqrt(,aux_w))
        aux_w = aux_w / den

        # Compute new value of the Lagrangian.
        aux_lam = (,(A+w.T)),aux_w) / 2 * (aux_w.T,aux_w)
        aux_lag = (aux_w.T,,aux_w)) - lam * (,aux_w) - 1)
        # Check if this is a descent
        if aux_lag < lag:
            w = aux_w
            lam = aux_lam
            lag = aux_lag
            alpha = 1.0
            finished = 1
            alpha = alpha / 2.0

Let’s now review our final values

# Let's now review our final values!
print " Our Final Values"
print "  Number of iterations", cont
print "  Obtained values are w =", w
print "  Correct values are  w =", veps[idx]
print "  Eigenvectors are =", eig_vecs
 Our Final Values
  Number of iterations 22
  Obtained values are w = [ 0.71916397  0.69484041]
  Correct values are  w = [ 0.71916398 -0.6948404 ]
  Eigenvectors are = [[ 0.71916398 -0.6948404 ]
 [ 0.6948404   0.71916398]]

Let’s compare our new values vs the ones obtained by the closed solution

# Full comparition
print "  Gradient Descent values   w =", w
print "  PCA analysis approach     w =", matrix_w
print "  Closed Solution           w =", veps[idx]
print "  Closed Solution           w =", veps,vaps
  Gradient Descent values   w = [ 0.71916397  0.69484041]
  PCA analysis approach     w = [[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]
  Closed Solution           w = [ 0.71916398 -0.6948404 ]
  Closed Solution           w = [[ 0.71916398 -0.6948404 ]
 [ 0.6948404   0.71916398]] [ 1.56340502  0.10299214]

Very close! Let’s print it to visualize the new values versus the ones obtaine with sci-kit learn

import seaborn as sns 


The post Principal Component Analysis (PCA): A Practical Example appeared first on 3Blades.

Source: 3blades – Principal Component Analysis (PCA): A Practical Example

Recent Free Online Books for Data Science

This is just a short list of a few books that I have have recently discovered online.

  • Model-Based Machine Learning – Chapters of this book become available as they are being written. It introduces machine learning via case studies instead of just focusing on the algorithms.
  • Foundations of Data Science – This is a much more academic-focused book which could be used at the undergraduate or graduate level. It covers many of the topics one would expect: machine learning, streaming, clustering and more.
  • Deep Learning Book – This book was previously available only in HTML form and not complete. Now, it is free and downloadable.

Source: 101 DS – Recent Free Online Books for Data Science

Becoming a Data Scientist Podcast Special Episode

The hosts of Becoming a Data Scientist podcast, Partially Derivative podcast, Adversarial Learning podcast, and some other awesome data people that do elections forecasting for their day jobs joined together for this talk about the US election and the subsequent major questions surrounding the predictions, since basically all of them heavily leaned toward a different overall outcome than we got. If you’re interested at all in data science surrounding political campaigns, this episode is a must-listen!

Episode Audio (mp3) – also available on iTunes, Stitcher, etc.
(note, there is no video for this episode)

On the panel:

Source: – Becoming a Data Scientist Podcast Special Episode

Data Science and the Perfect Team

Today, I am proud to welcome a guest post by Claire Gilbert, Data Analyst at Gongos. For more on Gongos, see the description at the end of the post.

It’s fair to say that for those who run in business intelligence circles, many admire the work of Fast Forward Labs CEO and Founder Hilary Mason. Perhaps what resonates most with her fans is the moniker she places on data scientists as being ‘awesome nerds’—those who embody the perfect skillsets of math and stats, coding, and communication. She asserts that these individuals have the technical expertise to not only conduct the really, really complex work—but also have the ability to explain the impact of that work to a non-technical audience.

As insights and analytics organizations strive to assemble their own group of ‘awesome nerds,’ there are two ways to consider Hilary’s depiction. Most organizations struggle by taking the first route—searching for those very expensive, highly rare unicorns—individuals that independently sit at this critical intersection of genius. Besides the fact that it would be even more expensive to clone these data scientists, there is simply not enough bandwidth in their day to fulfill on their awesomeness 24/7.

To quote Aristotle, one of the earliest scientists of our time, “the whole is greater than the sum of its parts,” which brings us to the notion of the team. Rather than seeking out those highly sought-after individuals with skills in all three camps, consider creating a collective of individuals with skills from each camp. After all, no one person can solve for the depth and breadth of an organization’s growing data science needs. It takes a specialist such as a mathematician to dive deep; as well as a multidisciplinary mind who can comprehend the breadth, to truly achieve the perfect team.

Awesome Nerds

Team Dynamics of the Data Kind

The ultimate charge for any data science team is to be a problem-solving machine—one that constantly churns in an ever-changing climate. Faced with an increasing abundance of data, which in turn gives rise to once-unanswerable business questions, has led clients to expect new levels of complexity in insights. This chain reaction brings with it a unique set of challenges not previously met by a prescribed methodology. As the sets of inputs become more diverse, so too should the skillsets to answer them. While all three characteristics of the ‘awesome nerd’ are indispensable, it’s the collective of ‘nerds’ that will become the driving force in today’s data world.
True to the construct, no two pieces should operate independent of the third. Furthermore, finding and honing balance within a data science team will result in the highest degree of accuracy and relevancy possible.
Let’s look at the makeup of a perfectly balanced team:

  • Mathematician/Statistician:
    This trained academic builds advanced models based on inputs, while understanding the theory and requirements for the results to be leveraged correctly.
  • Coder/Programmer:
    This hands-on ‘architect’ is in charge of cleaning, managing and reshaping data, as well as building simulators or other highly technical tools that result in user-friendly data.
  • Communicator/Content Expert:
    This business ‘translator’ applies an organizational lens to bring previous knowledge to the table in order to connect technical skill sets to client needs.

It’s the interdependence of these skillsets that completes the team and its ability to deliver fully on the promise of data:
A Mathematician/Statistician’s work relies heavily on the Coder/Programmer’s skills. The notion of garbage-in/garbage-out very much applies here. If the Coder hasn’t sourced and managed the data judiciously, the Mathematician cannot build usable models. Both then rely on the knowledge of the Communicator/Content Expert. Even if the data is perfect, and the results statistically correct, the output cannot be activated against unless it is directly relevant to the business challenge. Furthermore, teams out of balance will be faced with hurdles for which they are not adequately prepared, and output that is not adequately delivered.

To Buy or to Build?

In today’s world of high velocity and high volume of data, companies are faced with a choice. Traditional programmers like those who have coded surveys and collected data are currently integrated in the work streams of most insights organizations. However, many of them are not classically trained in math and/or statistics. Likewise, existing quantitative-minded, client-facing talents can be leveraged in the rebuilding of a team. Training either of these existing individuals who have a bent in math and/or stats is possible, yet is a time-intensive process that calls for patience. If organizations value and believe in their existing talent and choose to go this route, it will then point to the gaps that need to be filled—or bought—to build the ‘perfect’ team.
Organizations have long known the value of data, but no matter how large and detailed it gets, without the human dimension, it will fail to live up to its $30 billion valuation by 2019. The interpretation, distillation and curation of all kinds of data by a team in equilibrium will propel this growth and underscore the importance of data science.
Many people think Hilary’s notion of “awesome nerds” applies only to individuals. But in practice, we must realize this kind of market potential, the team must embody the constitution of awesomeness.
As organizations assemble and recruit teams, perhaps their mission statement quite simply should be…
If you can find the nerds, keep them, but in the absence of an office full of unicorns, create one.

About Gongos

Gongos, Inc. is a decision intelligence company that partners with Global 1000 corporations to help build the capability and competency in making great consumer-minded decisions. Gongos brings a consultative approach in developing growth strategies propelled by its clients’ insights, analytics, strategy and innovation groups.

Enlisting the multidisciplinary talents of researchers, data scientists and curators, the company fuels a culture of learning both internally and within its clients’ organizations. Gongos also works with clients to develop strategic frameworks to navigate the change required for executional excellence. It serves organizations in the consumer products, financial services, healthcare, lifestyle, retail, and automotive spaces.

Source: 101 DS – Data Science and the Perfect Team

Data Science Venn Diagram V 2.0

Drew Conway’s Data Science Venn Diagram, created in 2010, has proven to still be current. We did a reinterpretation of it with only slight updates to the terminology he first used to determine the combination of skills and expertise a Data Scientist requires.

Conway’s “Data Science Venn Diagram” characterizes Data scientists as people with a combination of skills and know-how in three categories:

1) Hacking

2) Math and statistics

3) Domain knowledge

We’ve updated this to be more specific:

1)   Computer science

2)   Math and statistics (no way around this one!)

3)   Subject matter expertise

The difficulty in defining these skills is that the split between substance and methodology is vague, and so it is not clear how to differentiate among hackers or computer science experts, statisticians, subject matter experts, their intersections and where data science fits.

What is clear from Conway’s or this updated diagram is that a Data scientist needs to learn a lot as he aspires to become a well-rounded, competent professional.

It is important to note that each of these skills are super valuable on their own, but when combined with only one other are not data science, or at worst even hazardous.

Conway has very specific thoughts on data science that are very different on what has already been discussed on the topic. On a recent interview, he said “To me, data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science. Science is about discovery and building knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods.”

On that matter, in Conway’s original Venn diagram, he came up with a Danger Zone (which we are calling Traditional Software in order to not sound as ominous).

On that area he placed people who know enough to be dangerous, and he saw it as the most problematic area of the diagram. In this area people are perfectly capable of extracting and structuring data, but they lack any understanding of what those coefficients mean. Either through ignorance or spite, this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there.

We can conclude that the Data Science Venn Diagram can help pinpoint these threatening combinations of skills and therefore, detect professionals who are not properly prepared for the challenge of doing excellent Data Science. It is indeed a great tool for understanding who could end up being great in their job, or just a threat in the road to the accomplishment of a specific task or to a company.

The post Data Science Venn Diagram V 2.0 appeared first on 3Blades.

Source: 3blades – Data Science Venn Diagram V 2.0

How to use Jupyter Notebook with Apache Spark

Jupyter Notebook (formerly known as IPython Notebook) is an interactive notebook environment which supports various programming languages which allows you to interact with your data, combine code with markdown text and perform simple visualizations.

Here are just a couple of reasons why using Jupyter Notebook with Spark is the best choice for users that wish to present their work to other team members or to the public in general:

  • Jupyter notebooks support tab autocompletion on class names, functions, methods and variables.
  • It offers more explicit and colour-highlighted error messages than the command line IPython console.
  • Jupyter notebook integrates with many common GUI modules like PyQt, PyGTK, tkinter and with a wide variety of data science packages.
  • Through the use of kernels, multiple languages are supported.

Using Jupyter notebook with Apache Spark is sometimes difficult to configure, particularly when dealing with different development environments. Apache Toree is our solution of choice to configure Jupyter notebooks to run with Apache Spark, it really helps simplify the installation and configuration steps. Default Toree installation works with Scala, although Toree does offer support for multiple kernels including PySpark. We will go over both configurations.

Note: 3Blades offers a pre-built Jupyter Notebook image already configured with PySpark. This tutorial is based in part on the work of Uri Laserson.

Apache Toree with Jupyter Notebook

This should be performed on the machine where the Jupyter Notebook will be executed. If it’s not run on a Hadoop node, then the Jupyter Notebook instance should have SSH access to the Hadoop node.

This guide is based on:

  • IPython 5.1.0
  • Jupyter 4.1.1
  • Apache Spark 2.0.0

The difference between ‘IPython’ and ‘Jupyter’ can be confusing. Basically, the Jupyter team has renamed ‘IPython Notebook’ as ‘Jupyter Notebook’, however the interactive shell is still known as ‘IPython’. Jupyter Notebook ships with IPython out of the box and as such IPython provides a native kernel spec for Jupyter Notebooks.

In this case, we are adding a new kernel spec, known as PySpark.

IPython, Toree and Jupyter Notebook

1) We recommended running Jupyter Notebooks within a virtual environment. This avoids breaking things on your host system. Assuming python / python3 are installed on your local environment, run:

$ pip install virtualenv

2) Then create a virtualenv folder and activate a session, like so:

$ virtualenv -p python3 env3
$ source env3/bin/activate

3) Install Apache Spark and set environment variable for SPARK_HOME. (The official Spark site has options to install bootstrapped versions of Spark for testing.) We recommend downloading the latest version, which as of this writing is Spark version 2.0.0 with Hadoop 2.7.

$ export SPARK_HOME="/path_to_downloaded_spark/spark-2.0.0-bin-hadoop2.7/bin"

4) (Optional)Add additional ENV variables to your bash profile, or set them manually with exports:

$ echo "PATH=$PATH:/path_to_downloaded_spark/spark-2.0.0-bin-hadoop2.7/bin" >> .bash_profile
$ echo "SPARK_HOME=/path_to_downloaded_spark/spark-2.0.0-bin-hadoop2.7" >> .bash_profile

  • Note: substitute .bash_profile for .bashrc or .profile if using Linux.

4) Install Jupyter Notebook, which will also confirm and install needed IPython dependencies:

$ pip install jupyter

5) Install Apache Toree:

$ pip install toree

6) Configure Apache Toree installation with Jupyter:

$ jupyter toree install --spark_home=$SPARK_HOME

7) Confirm installation:

$ jupyter kernelspec list
Available kernels:
python3 /Users/myuser/Library/Jupyter/kernels/python3
apache_toree_scala /usr/local/share/jupyter/kernels/apache_toree_scala

Launch Jupyter Notebook:

$ jupyter notebook


Direct your browser to http://localhost:8888/, which should show the main access point to the notebooks. You should see a Apache Toree – Scala kernel option:



Example IPython Notebook running with PySpark

IPython, Toree and Jupyter Notebook + PySpark 

Apache Toree supports multiple IPython kernels, including Python via PySpark. The beauty of Apache Toree is that it greatly simplifies adding new kernels with the –interpreters argument.

$ jupyter toree install --interpreters=PySpark

SPARK_HOME ENV var will take precedence over –spark_home argument. Otherwise you can use –spark_home to set your Spark location path:

$ jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark

Verify install. Notice how the second to last line confirms PySpark installation:

$ jupyter kernelspec list
Available kernels:
python3 /Users/myuser/Library/Jupyter/kernels/python3
apache_toree_pyspark /usr/local/share/jupyter/kernels/apache_toree_pyspark
apache_toree_scala /usr/local/share/jupyter/kernels/apache_toree_scala


Now, when running Jupyter notebook, you will be able to use Python within a Spark context:


The post How to use Jupyter Notebook with Apache Spark appeared first on 3Blades.

Source: 3blades – How to use Jupyter Notebook with Apache Spark

Data Science Ethical Framework

The UK government has taken the first step in providing a solid grounding for the future of data science ethics. Recently, they published a “beta” version of the Data Science Ethical Framework.

The framework is based around 6 clear principles:

  1. Start with clear user need and public benefit
  2. Use data and tools which have the minimum intrusion necessary
  3. Create robust data science models
  4. Be alert to public perceptions
  5. Be as open and accountable as possible
  6. Keep data secure

See the above link for further details. The framework is somewhat specific to the UK, but it would be nice to see other countries/organizations adopt a similar framework. Even DJ Patil, U.S. Chief Data Scientist, has stated the importance of ethics in all data science curriculum.

Source: 101 DS – Data Science Ethical Framework

What Will Be The Key Deep Learning Breakthrough in 2016?

Google’s work in artificial intelligence is impressive. It includes networks of hardware and software that are very similar to the system of neurons in the human brain. By analyzing huge amounts of data, the neural nets can learn all sorts of tasks, and, in some cases like with AlphaGo, they can learn a task so well that they beat humans. They can also do it better and in a bigger scale.

AI seems to be the future of Google Search and of the technology world in general. This specific method, called deep learning, is reinventing many of the Internet’s most popular and interesting services.

Google, during its conception and growth, has relied predominantly on algorithms that followed exact rules set by programmers (think ‘if this then that’ rules). Even with that apparent reassurance of human control, there’s still some concerns about the world of machine learning because even the experts don’t fully understand how neural networks work. However, in recent years great strides have been made in understanding the human brain and thus how neural networks could be wired. If you feed enough photos of a dog into a neural net, it is able to learn to identify a dog. In some cases, a neural net can handle queries better than algorithms hand-coded by humans. Artificial intelligence is the future of Google Search and that means it’s probably a big influencer of everything else.

Based on Google Search advances and AI like AlphaGo, experts expect to see:

  • More radical deep learning architectures
  • Better integration of symbolic and subsymbolic systems
  • Expert dialogue systems

And with AI finally dominating the game of Go:

  • Deep learning for more intricate robotic planning and motor control
  • High-quality video summarization
  • More creative and higher-resolution dreaming

Experts consider these methods can accelerate scientific research per se. The idea of having scientists working alongside artificially intelligent systems that can hone in on areas of research is not a farfetched idea anymore. It might happen soon and 2016 looks like a good year for it.

The post What Will Be The Key Deep Learning Breakthrough in 2016? appeared first on 3Blades.

Source: 3blades – What Will Be The Key Deep Learning Breakthrough in 2016?

The Weisfeiler-Lehman algorithm and estimation on graphs


Imagine you have two graphs (G) and (G’) and you’d like to check how similar they are. If all vertices have unique attributes this is quite easy:

FOR ALL vertices (v in G cup G’) DO

  • Check that (v in G) and that (v in G’)
  • Check that the neighbors of v are the same in (G) and (G’)

This algorithm can be carried out in linear time in the size of the graph, alas many graphs do not have vertex attributes, let alone unique vertex attributes. In fact, graph isomorphism, i.e. the task of checking whether two graphs are identical, is a hard problem (it is still an open research question how hard it really is). In this case the above algorithm cannot be used since we have no idea which vertices we should match up.

The Weisfeiler-Lehman algorithm is a mechanism for assigning fairly unique attributes efficiently. Note that it isn’t guaranteed to work, as discussed in this paper by Douglas – this would solve the graph isomorphism problem after all. The idea is to assign fingerprints to vertices and their neighborhoods repeatedly. We assume that vertices have an attribute to begin with. If they don’t then simply assign all of them the attribute 1. Each iteration proceeds as follows:

FOR ALL vertices (v in G) DO

  • Compute a hash of ((a_v, a_{v_1}, ldots, a_{v_n})) where (a_{v_i}) are the attributes of the neighbors of vertex (v).
  • Use the hash as vertex attribute for v in the next iteration.

The algorithm terminates when this iteration has converged in terms of unique assignments of hashes to vertices.

Note that it is not guaranteed to work for all graphs. In particular, it fails for graphs with a high degree of symmetry, e.g. chains, complete graphs, tori and stars. However, whenever it converges to a unique vertex attribute assignment it provides a certificate for graph isomorphism. Moreover, the sets of vertex attributes can be used to show that two graphs are not isomorphic (it suffices to verify that the sets differ at any stage).

Shervashidze et al. 2012 use this idea to define a similarity measure between graphs. Basically the idea is that graphs are most similar if many of their vertex identifiers match since this implies that the associated subgraphs match. Formally they compute a kernel using

$$k(G,G’) = sum_{i=1}^d lambda_d sum_{v in V} sum_{v’ in V’} delta(a(v,i), a(v’,i))$$

Here (a(v,i)) denote the vertex attribute of (v) after WL iteration (i). Morevoer, (lambda_i) are nonnegative coefficients that weigh how much the similarity at level (i) matters. Rather than a brute-force computation of the above test for equality we can sort vertex attribute sets. Note that vertices that have different attributes at any given iteration will never have the same attribute thereafter. This means that we can compare the two sets at all depths at at most (O(d cdot (|V| + |V’|))) cost.

A similar trick is possible if we want to regress between vertices on the same graph since we can use the set of attributes that a vertex obtains during the iterations as features. Finally, we can make our life even easier if we don’t compute kernels at all and use a linear classifier on the vertex attributes directly.

Source: Adventures in Data Land


Beware the bandwidth gap – speeding up optimization

Disks are slow and RAM is fast. Everyone knows that. But many optimization algorithms don’t take advantage of this. More to the point, disks currently stream at about 100-200 MB/s, solid state drives stream at over 500 MB/s with 1000x lower latency than disks, and main memory reigns supreme at about 10-100 GB/s bandwidth (depending on how many memory banks you have). This means that it is 100 times more expensive to retrieve instances from disk rather than recycling them once they’re already in memory. CPU caches are faster yet with 100-1000 GB/s of bandwidth. Everyone knows this. If not, read Jeff Dean’s slides. Page 13 is pure gold.

Ok, so what does this mean for machine learning? If you can keep things in memory, you can do things way faster. This is the main idea behind Spark. It’s a wonderful alternative to Hadoop. In other words, if your data fits into memory, you’re safe and you can process data way faster. A lot of datasets that are considered big in academia fit this bill. But what about real big data? Essentially you have two options – have the systems designer do the hard work or change your algorithm. This post is about the latter. And yes, there’s a good case to be made about who should do the work: the machine learners or the folks designing the computational infrastructure (I think it’s both).

So here’s the problem: Many online algorithms load data from disk, stream it through memory as efficiently as possible and discard it after seeing it once, only to pick it up later for another pass through the data. That is, these algorithms are disk bound rather than CPU bound. Several solvers try to address this by making the disk representation more efficient, e.g. Liblinear or VowpalWabbit, both of which user their own internal representation for efficiency. While this still makes for quite efficient code that can process up to 3TB of data per hour in any given pass, main memory is still much faster. This has led to the misconception that many machine learning algorithms are disk bound. But, they aren’t …

What if we could re-use data that’s in memory? For instance, use a ringbuffer where the disk writes into it (much more slowly) and the CPU reads from it (100 times more rapidly). The problem is what to do with an observation that we’ve already processed. A naive strategy would be to pretend that it is a new instance, i.e. we could simply update on it more than once. But this is very messy since we need to keep track of how many times we’ve seen the instance before, and it creates nonstationarity in the training set.

A much cleaner strategy is to switch to dual variables, similar to the updates in the Dualon of Shalev-Shwartz and Singer. This is what Shin Matsushima did in our dual cached loops paper. Have a look at StreamSVM here. Essentially, it keeps data in memory in a ringbuffer and updates the dual variables. This way, we’re guaranteed to make progress at each step, even if we’re revisiting the same observation more than once. To see what happens have a look at the graph below:

It’s just as fast as LibLinear provided that it’s all in memory. Algorithmically, what happens in the SVM case is that one updates the Lagrange multipliers (alpha_i), while simultaneously keeping an estimate of the parameter vector (w) available.

That said, this strategy is more general: reuse data several times for optimization while it is in memory. If possible, perform successive updates by changing variables of an optimization that is well-defined regardless of the order in which (and how frequently) data is seen.

Source: Adventures in Data Land