Principal Component Analysis (PCA): A Practical Example

Let’s first define what a is PCA.

Principal Component Analysis or PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

In other words, Principal Component Analysis (PCA) is a technique to detect the main components of a data set in order to reduce into fewer dimensions retaining the relevant information.

To put an example, Let  X \in\mathbb{R}^{mxn} a data set with zero mean, that is, the matrix formed by n observations of m  variables. Where the elements of X  are denoted as usual by x_ij  meaning that it contains the value of the observable i  of the j-th  observation experiment.

A principal component is a linear combination of the variables so that maximizes the variance.

Let’s now see a PCA example step by step

1. Create a random toy data set

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

m1 = [4.,-1.]
s1 = [[1,0.9],[0.9,1]]
c1 = np.random.multivariate_normal(m1,s1,100)

Let’s plot the data set and compute the PCA. The red dots of the figure show below the considered data, the blue arrow shows the eigenvector of maximum eigenvalue.

vaps,veps = np.linalg.eig(np.cov(c1.T))
idx = np.argmax(vaps)


PCA Closed Solution

Now that we have visualize it, let’s code the closed solution for the PCA

First step is to standardize the data. We are going to use Scikit-learn library.

from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(c1)

Eigendecomposition – Computing Eigenvectors and Eigenvalues

The eigenvectors determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.

mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec) - mean_vec)) / (X_std.shape[0]-1)
print('Covariance Matrix \n%s' %cov_mat)
Covariance Matrix 
[[ 1.01010101  0.88512031]
 [ 0.88512031  1.01010101]]

Let’s now print our Covariance Matrix

#Let's print our Covariance Matrix
print('NumPy Covariance Matrix: \n%s' %np.cov(X_std.T))
NumPy Covariance Matrix: 
[[ 1.01010101  0.88512031]
 [ 0.88512031  1.01010101]]

Now we perform an eigendecomposition on the covariance matrix

cov_mat = np.cov(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]

[ 1.89522132  0.1249807 ]

Let’s sort the eigenvalues to see if everything is ok

# let's sort the eig values to see if everything is ok
for ev in eig_vecs:
    np.testing.assert_array_almost_equal(1.0, np.linalg.norm(ev))
print('Everything ok!')
Everything ok!

Now we need to make a list of the eigenvalue, eigenvectors tuples and sort them from high to low.

# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
Eigenvalues in descending order:

Building a Projection Matrix

# Choose the "top 2" eigenvectors with the highest eigenvalues 
# we are going to use this values to matrix W.
matrix_w = np.hstack((eig_pairs[0][1].reshape(2,1), 

print('Matrix W:\n', matrix_w)
('Matrix W:\n', array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]]))

We will use this data to plot our output later so we can compare with a custom gradient descent approach.

There are several numerical techniques that allow to find a point x^* that corresponds too \nabla x, \lambda L (x^*, \lambda ^*) = 0 , the saddle point. One way to tackle the problem is to “construct a new function, related to the Lagrangian, that (ideally) has a minimum at (x^*, \lambda ^*)

This new function can be considered as ’distorting’ the Lagrangian at infeasible points so as to create a minimum at (x^*, \lambda ^*) . Unconstrained minimization techniques can then be applied to the new function. This approach can make it easier to guarantee convergence to a local solution, but there is the danger that the local convergence properties of the method can be damaged.

The ’distortion’ of the Lagrangian function can lead to a ’distortion’ in the Newton equations for the method. Hence the behavior of the method near the solution may be poor unless care is taken.” Another way to tackle the condition \nabla x, \lambda L (x, \lambda) = 0 is to maintain feasibility at every iteration. That is, to ensure that the updates xk follow the implicit curve h(x) = 0 . For the toy problem we are considering here it is relatively easy. Assume we start from a point x 0 that satisfies h(x 0 ) = 0 , that is it satisfies the constraint.

The algorithm can be summarized as follows:

  1. Compute the gradient \nabla L (x^k)  (observe that we compute the gradient of the Lagrangian with respect to x ).
  2. Compute an estimate of \lambda by computing the value of \lambda that minimizes \nabla L (x^k)^2 .
  3. Assume that the update is x^{k+1} = x^k - \alpha ^k \nabla L (x^k) . For each candidate update x k+1 , project it over the constraint h(x) = 0 . Find the α k value that decreases the L (x^{k+1}) with respect to \nabla L (x^k) .
  4. Goto step 1 and repeat until convergence.

Let’s now implement the KKT conditions to see if we are able to obtain the same result as the one obtained with the closed solution. We will use the projected gradient descent to obtain the solution.

Let’s A be our covariance matrix

# A is the covariance matrix of the considered data
A = np.cov(c1.T)

Now we set up our initial values

# Tolerance

# Initial alpha value (line search)

# Initial values of w. DO NOT CHOOSE w=(0,0)
w = np.array([1., 0.])

Now we compute the eigenvalues and eigenvectors

# let's see now the eigvals and eigvects

eig_vals, eig_vecs = np.linalg.eig(A)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

Now, we compute the projection for the function w=w.T*w

#now let's compute the projection for the function. w = w.T*w
den = np.sqrt(,w))
w = w / den

Next step is to compute lambda

# now we calculate lambda
lam = ( (w.T,(A + w.T) ),w) / 2 *,w)

Let’s review our initial values

print "Initial values"
print "Lagrangian value =", lag
print " w =", w
print " x =", m1
print " y =", s1
Initial values
Lagrangian value = -0.858313040377
 w = [ 1.  0.]
 x = [4.0, -1.0]
 y = [[1, 0.9], [0.9, 1]]

Let’s now compute our function using gradient descent

# let's now compute the entire values for our function

while ((alpha > tol) and (cont < 100000)):
    cont = cont+1
    # Gradient of the Lagrangian
    grw = (w.T,(A + w.T) ) - 2 * lam * w.T
    # Used to know if we finished line search
    finished = 0
    while ((finished == 0) and (alpha > tol)):
        # Update
        aux_w = w - alpha * grw
        # Our Projection 
        den = np.sqrt(,aux_w))
        aux_w = aux_w / den

        # Compute new value of the Lagrangian.
        aux_lam = (,(A+w.T)),aux_w) / 2 * (aux_w.T,aux_w)
        aux_lag = (aux_w.T,,aux_w)) - lam * (,aux_w) - 1)
        # Check if this is a descent
        if aux_lag < lag:
            w = aux_w
            lam = aux_lam
            lag = aux_lag
            alpha = 1.0
            finished = 1
            alpha = alpha / 2.0

Let’s now review our final values

# Let's now review our final values!
print " Our Final Values"
print "  Number of iterations", cont
print "  Obtained values are w =", w
print "  Correct values are  w =", veps[idx]
print "  Eigenvectors are =", eig_vecs
 Our Final Values
  Number of iterations 22
  Obtained values are w = [ 0.71916397  0.69484041]
  Correct values are  w = [ 0.71916398 -0.6948404 ]
  Eigenvectors are = [[ 0.71916398 -0.6948404 ]
 [ 0.6948404   0.71916398]]

Let’s compare our new values vs the ones obtained by the closed solution

# Full comparition
print "  Gradient Descent values   w =", w
print "  PCA analysis approach     w =", matrix_w
print "  Closed Solution           w =", veps[idx]
print "  Closed Solution           w =", veps,vaps
  Gradient Descent values   w = [ 0.71916397  0.69484041]
  PCA analysis approach     w = [[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]
  Closed Solution           w = [ 0.71916398 -0.6948404 ]
  Closed Solution           w = [[ 0.71916398 -0.6948404 ]
 [ 0.6948404   0.71916398]] [ 1.56340502  0.10299214]

Very close! Let’s print it to visualize the new values versus the ones obtaine with sci-kit learn

import seaborn as sns 


The post Principal Component Analysis (PCA): A Practical Example appeared first on 3Blades.

Source: 3blades – Principal Component Analysis (PCA): A Practical Example

Data Science Venn Diagram V 2.0

Drew Conway’s Data Science Venn Diagram, created in 2010, has proven to still be current. We did a reinterpretation of it with only slight updates to the terminology he first used to determine the combination of skills and expertise a Data Scientist requires.

Conway’s “Data Science Venn Diagram” characterizes Data scientists as people with a combination of skills and know-how in three categories:

1) Hacking

2) Math and statistics

3) Domain knowledge

We’ve updated this to be more specific:

1)   Computer science

2)   Math and statistics (no way around this one!)

3)   Subject matter expertise

The difficulty in defining these skills is that the split between substance and methodology is vague, and so it is not clear how to differentiate among hackers or computer science experts, statisticians, subject matter experts, their intersections and where data science fits.

What is clear from Conway’s or this updated diagram is that a Data scientist needs to learn a lot as he aspires to become a well-rounded, competent professional.

It is important to note that each of these skills are super valuable on their own, but when combined with only one other are not data science, or at worst even hazardous.

Conway has very specific thoughts on data science that are very different on what has already been discussed on the topic. On a recent interview, he said “To me, data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science. Science is about discovery and building knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods.”

On that matter, in Conway’s original Venn diagram, he came up with a Danger Zone (which we are calling Traditional Software in order to not sound as ominous).

On that area he placed people who know enough to be dangerous, and he saw it as the most problematic area of the diagram. In this area people are perfectly capable of extracting and structuring data, but they lack any understanding of what those coefficients mean. Either through ignorance or spite, this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there.

We can conclude that the Data Science Venn Diagram can help pinpoint these threatening combinations of skills and therefore, detect professionals who are not properly prepared for the challenge of doing excellent Data Science. It is indeed a great tool for understanding who could end up being great in their job, or just a threat in the road to the accomplishment of a specific task or to a company.

The post Data Science Venn Diagram V 2.0 appeared first on 3Blades.

Source: 3blades – Data Science Venn Diagram V 2.0

What Will Be The Key Deep Learning Breakthrough in 2016?

Google’s work in artificial intelligence is impressive. It includes networks of hardware and software that are very similar to the system of neurons in the human brain. By analyzing huge amounts of data, the neural nets can learn all sorts of tasks, and, in some cases like with AlphaGo, they can learn a task so well that they beat humans. They can also do it better and in a bigger scale.

AI seems to be the future of Google Search and of the technology world in general. This specific method, called deep learning, is reinventing many of the Internet’s most popular and interesting services.

Google, during its conception and growth, has relied predominantly on algorithms that followed exact rules set by programmers (think ‘if this then that’ rules). Even with that apparent reassurance of human control, there’s still some concerns about the world of machine learning because even the experts don’t fully understand how neural networks work. However, in recent years great strides have been made in understanding the human brain and thus how neural networks could be wired. If you feed enough photos of a dog into a neural net, it is able to learn to identify a dog. In some cases, a neural net can handle queries better than algorithms hand-coded by humans. Artificial intelligence is the future of Google Search and that means it’s probably a big influencer of everything else.

Based on Google Search advances and AI like AlphaGo, experts expect to see:

  • More radical deep learning architectures
  • Better integration of symbolic and subsymbolic systems
  • Expert dialogue systems

And with AI finally dominating the game of Go:

  • Deep learning for more intricate robotic planning and motor control
  • High-quality video summarization
  • More creative and higher-resolution dreaming

Experts consider these methods can accelerate scientific research per se. The idea of having scientists working alongside artificially intelligent systems that can hone in on areas of research is not a farfetched idea anymore. It might happen soon and 2016 looks like a good year for it.

The post What Will Be The Key Deep Learning Breakthrough in 2016? appeared first on 3Blades.

Source: 3blades – What Will Be The Key Deep Learning Breakthrough in 2016?

30 Must Read Books in Analytics / Data Science

So many pages have been dedicated to Data Science that it can be hard to pinpoint the best books among the sea of available content. However, we have compiled our own list and perhaps it would be a good source of reference for you too.

This is not a definitive list of all the books that you would probably have to read during your career as a Data Scientist, but it definitely includes classics, beginners books, specialist books (more related to the business of data science or team-building) and of course, some good ones that explain the complexities of certain programs, languages or processes.

So, bring it on! Find yourself a comfortable reclining chair or a desk, good reading glasses (if needed) and a peaceful mindset to cultivate your data-driven mind.

The post 30 Must Read Books in Analytics / Data Science appeared first on 3Blades.

Source: 3blades – 30 Must Read Books in Analytics / Data Science

What Industries Will Be Next to Adopting Data Science?

It’s no surprise that data science will surely spread to more industries in the next couple of years. So, which of them would probably be the next ones to hire more data scientists and benefit from big data? 

We looked at five very different businesses that are starting to benefit or could benefit from data science and how exactly can big data better help them achieve success in their fields.

Data Science in Sports

1) Sports

If you saw the movie Moneyball you might know why big data is important to baseball and sports in general. Nowadays, for instance, many NBA teams collect millions of data records per game using cameras installed in the courts. The ultimate goal for all these sports teams is to improve health and safety, and thus performance of the team and individual athletes. In the same way that businesses seek to use data to custom their operations, it’s easy to see how these two worlds can crossover to benefit the sports world.

Data Science in On-Demand Services

2) On-demand services

Uber gets attention for its growth and success that came mainly because how the company uses data. The Uber experience relies on data science and algorithms so this is a clear example of how on-demand services can benefit from big data. Uber continues to succeed because of the convenience that its data-driven product provides. Other on-demand services should look up to Uber’s example for their own good and follow up with relying more on data science.

Data Science in Entertainment Industry

3) Entertainment industry

In this era of connected consumers, media and entertainment businesses must do more than simply being digital to compete. Data science already allows some organizations to understand their audience.

A once content-centric model is turning into a consumer-centric one. The entertainment industry is prepared to capitalize on this trend by converting information into insight that boosts production and cross-channel distribution. From now on it can be expected that those who provide a unique audience experience will be the only ones to achieve growth.

Data Science in Real Estate

4) Real estate agents

We continue hearing that the housing market is unpredictable, however some of the top real estate agents claim they saw the housing bubble burst coming way back (think again of movies, exactly like in The Big Short). It’s easy to obtain this information from following data and trend spotting. This is a great way for this volatile industry to be prepared for market shifts.

Data Science in Food Industry

5 ) Restaurant owners 

This business field is the epitome of how important it is being able to tell what customers want. According to the Washington, D.C.-based National Restaurant Association, restaurants face another big obstacle besides rent, licensing and personnel: critics, not only professional but amateurs who offer their opinions on social media. The importance of quality is the reason why restaurants are beginning to use big data to understand customer preferences and to improve their food and service.

The post What Industries Will Be Next to Adopting Data Science? appeared first on 3Blades.

Source: 3blades – What Industries Will Be Next to Adopting Data Science?

What is needed to build a data science team from the ground up?

What specific roles would a data science team need to have to be successful? Some will depend on the organization’s objectives, but there’s a consensus that the following positions are key.

  1. Data scientist. This role should be held by someone who can work on large datasets (on Hadoop/Spark) with machine learning algorithms, who can also create predictive models, interpret and explain model behavior layman terms. This position requires excellent knowledge of SQL and understanding of at least one programming language for predictive data analysis like R and/ Python.
  2. Data engineer / Data software developer. Requires great knowledge of distributed programming, including infrastructure and architecture. The person hired for this position should be very comfortable with installation of distributed programming frameworks like Hadoop MapReduce/Spark clusters, should be able to code in more than one programming language like Scala/Python/Java, and knows Unix scripting and SQL. This role can also evolve into one of the two specialized roles:
    1. Data solutions architect.  Basically a data engineer with an ample range of experience across several technologies and who has great understanding of service-oriented architecture concepts and web applications.
    2. Data platform administrator. This position requires extensive experience managing clusters including production environments and good knowledge of cloud computing.
  3. Designer. This position should be occupied by an expert who has deep knowledge of user experience (UX) and interface design, primarily for web and mobile applications, as well as knowledge of data visualization and ideally some UI coding expertise.
  4. Product manager. This is an optional role required only for teams focused on building data products. This person will be defining the product vision, translating business problems into user stories, and focusing on helping the development team build data products based on the user stories.

The post What is needed to build a data science team from the ground up? appeared first on 3Blades.

Source: 3blades – What is needed to build a data science team from the ground up?

What is the best way to sync data science teams?

A well-defined workflow will help a data science team reach its goals. In order to sync data science teams and its members it’s important to first know each part of the phases needed to get data based results.  

When dealing with big data or any type of data-driven goals it helps to have a defined workflow. Whether we want to perform an analysis with the intent of telling a story (Data Visualization) or building a system that relies on data, like data mining, the process always matters. If a methodology is defined before starting any task, teams will be in sync and it will be easy to avoid losing time figuring out what’s next. This will allow a faster production rhythm of course and an overall understanding of what everyone is bringing into the team.

Here are the four main parts of the workflow that every team member should know in order to sync data science teams.

1) Preliminary analysis. When data is brand new this step has to be performed first, it’s a no-brainer. In order to produce results fast you need to get an overview of all data points. In this phase, the focus is to make the data usable as quickly as possible and get quick and interesting insights.

2) Exploratory analysis. This is the part of the workflow where questions will be asked over and over again, and where the data will be cleaned and ordered to help answer those same questions. Some teams would end the process here but it’s not ideal, however, it all depends on what we want to do with the data. So there are two phases that could be considered ideally most of the times.

3) Data visualization. This step is imperative if we want to show the results of the exploratory analysis. It’s the part where actual storytelling takes place and where we will be able to translate our technical results into something that can be understood by a wider audience. The focus is turned to how to best present the results. The main goal data science teams should aim for in this phase is to create data visualizations that mesmerize users while telling them all the valuable information discovered in the original data sets.

4) Knowledge. If we want to study the patterns in the data to build reliable models, we turn to this phase in which the focus of the team is producing a model that better explains the data, by engineering it and then testing different algorithms to find the best performance possible.
These are the key phases around which a data science team should sync up in order to have a finished, replicable and understandable product based on data analysis.

The post What is the best way to sync data science teams? appeared first on 3Blades.

Source: 3blades – What is the best way to sync data science teams?

How Can Businesses Adopt a Data-Driven Culture?

There are small steps that any business can adopt in order to start incorporating a data-driven philosophy into their business. An Economist Intelligence Unit survey sponsored by Tableau Software highlights best practices.

A survey made by Economist Intelligence Unit, an independent business within The Economist Group providing forecasting and advisory services, sponsored by Tableau Software, highlighted best practices to adopt a data-driven culture among other information relevant to the field of data science. To ensure a seamless and successful transition to a data-driven culture, here are some of the top approaches your business should apply:

Share data and prosper

Appreciating the power of data is only the first step on the road to a data-driven philosophy. Older companies can have a hard time transitioning to a data-driven culture, especially if they have achieved success with minimum use of data in the past. However, times are changing and any type of company can benefit from this type of information. More than half of respondents from the survey (from top-performing companies) said that promotion of data-sharing has helped create a data-driven culture in their organization.

Increased availability of training

Around one in three respondents said it was important to have partnerships or courses in house to make employees more data-savvy.

Hire a chief data officer (CDO)

This position is key to convert data into insight so that it provides maximum impact. This task is not easy, quite the contrary, it can turn out to be very specialized and businesses shouldn’t expect their CIO or CMO to perform the job. A corporate officer is needed who is wholly dedicated to acquiring and using data to improve productivity. You may already have someone who can be promoted to a CDO at your company: someone who understands the value of data and owns it.

Create policies and guidelines

After the CDO runs a data audit internally, it is relevant that company guidelines are crafted around data analysis. This is how all employees will be equipped with replicable strategies focused on improving business challenges.

Encourage employees to seek data

Once new company policies are in place and running, the next step is to motivate employees to seek answers in data. One of the best ways to do this is offering incentives (you pick what type). Employees will then feel encouraged to use (or even create) tools and find solutions on their own without depending on the IT guys.

The post How Can Businesses Adopt a Data-Driven Culture? appeared first on 3Blades.

Source: 3blades – How Can Businesses Adopt a Data-Driven Culture?

Which College Should I Choose? (R Script)

Like many of you, I like to learn based on trial and error, so I decided to try to run my first R script and from “Michael L. Thompson’s script” and adapt it to this new experiment.

This script will tell you which college is best for you based on your ethnicity, age, region and SAT score. Although this is a generic example, it can be easily replicated using 3Blades’s Jupyter Notebook (R) version. Simply select this option when launching a new workspace from your project.

Who Are You?

Let’s pretend you are a foreign citizen that came here from Latin America on a H1-B and you are looking for a B.S. in Engineering. You are about to turn 30 and got married not that long ago and a kid is on his way. You are still earning under 75k but you are certain that if you get this career change you will jump into the 100K club, so what are your best choices to achieve that in the west coast?

studentProfile = list(
    dependent = TRUE,           
    ethnicity = 'hispanic',     
    gender    = 'male',         
    age       = 'gt24',         
    income    = 'gt30Kle110K',  
    earnings  = 'gt48Kle75K',   
    sat       = 'gt1400', 
    fasfa     = 'fsend_5',     
    discipline= 'Engineering', 
    region    = 'FarWest',      
    locale    = 'CityLarge',    
    traits    = list(          
      Risk      = 'M',
      Vision    = 'M',
      Breadth   = 'M',
      Challenge = 'M') 

Setup the Data & Model

This code loads the college database and defines necessary data structures and functions to implement the model.

## Loading college database ...Done.

Now, here are the top colleges so you can make a wise decision

# ENTER N, for top-N colleges:
ntop <- 10
studentProfile$beta <- getParameters(studentProfile,propertyMap)
caseResult          <- studentCaseStudy(studentBF,studentProfile,propertyMap,verbose=FALSE,ntop=ntop)
# This code will display the results
gplt <- plotTopN(caseResult, = TRUE)

R Script choose college

Now let’s tweak a bit and see what are your best options based on your SAT scores

R script which colleges is best for you

R script which colleges is best for you

R script which colleges is best for you

R script which colleges is best for you

R script which colleges is best for you
Full credit goes to Michael for this amazing job, I just tweaked a bit to use it as a brief example of the great things you can do with R.

The post Which College Should I Choose? (R Script) appeared first on 3Blades.

Source: 3blades – Which College Should I Choose? (R Script)