Principal Component Analysis (PCA): A Practical Example

Let’s first define what a is PCA.

Principal Component Analysis or PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

In other words, Principal Component Analysis (PCA) is a technique to detect the main components of a data set in order to reduce into fewer dimensions retaining the relevant information.

To put an example, Let  X \in\mathbb{R}^{mxn} a data set with zero mean, that is, the matrix formed by n observations of m  variables. Where the elements of X  are denoted as usual by x_ij  meaning that it contains the value of the observable i  of the j-th  observation experiment.

A principal component is a linear combination of the variables so that maximizes the variance.

Let’s now see a PCA example step by step

1. Create a random toy data set

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

m1 = [4.,-1.]
s1 = [[1,0.9],[0.9,1]]
c1 = np.random.multivariate_normal(m1,s1,100)

Let’s plot the data set and compute the PCA. The red dots of the figure show below the considered data, the blue arrow shows the eigenvector of maximum eigenvalue.

vaps,veps = np.linalg.eig(np.cov(c1.T))
idx = np.argmax(vaps)


PCA Closed Solution

Now that we have visualize it, let’s code the closed solution for the PCA

First step is to standardize the data. We are going to use Scikit-learn library.

from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(c1)

Eigendecomposition – Computing Eigenvectors and Eigenvalues

The eigenvectors determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.

mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec) - mean_vec)) / (X_std.shape[0]-1)
print('Covariance Matrix \n%s' %cov_mat)
Covariance Matrix 
[[ 1.01010101  0.88512031]
 [ 0.88512031  1.01010101]]

Let’s now print our Covariance Matrix

#Let's print our Covariance Matrix
print('NumPy Covariance Matrix: \n%s' %np.cov(X_std.T))
NumPy Covariance Matrix: 
[[ 1.01010101  0.88512031]
 [ 0.88512031  1.01010101]]

Now we perform an eigendecomposition on the covariance matrix

cov_mat = np.cov(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]

[ 1.89522132  0.1249807 ]

Let’s sort the eigenvalues to see if everything is ok

# let's sort the eig values to see if everything is ok
for ev in eig_vecs:
    np.testing.assert_array_almost_equal(1.0, np.linalg.norm(ev))
print('Everything ok!')
Everything ok!

Now we need to make a list of the eigenvalue, eigenvectors tuples and sort them from high to low.

# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
Eigenvalues in descending order:

Building a Projection Matrix

# Choose the "top 2" eigenvectors with the highest eigenvalues 
# we are going to use this values to matrix W.
matrix_w = np.hstack((eig_pairs[0][1].reshape(2,1), 

print('Matrix W:\n', matrix_w)
('Matrix W:\n', array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]]))

We will use this data to plot our output later so we can compare with a custom gradient descent approach.

There are several numerical techniques that allow to find a point x^* that corresponds too \nabla x, \lambda L (x^*, \lambda ^*) = 0 , the saddle point. One way to tackle the problem is to “construct a new function, related to the Lagrangian, that (ideally) has a minimum at (x^*, \lambda ^*)

This new function can be considered as ’distorting’ the Lagrangian at infeasible points so as to create a minimum at (x^*, \lambda ^*) . Unconstrained minimization techniques can then be applied to the new function. This approach can make it easier to guarantee convergence to a local solution, but there is the danger that the local convergence properties of the method can be damaged.

The ’distortion’ of the Lagrangian function can lead to a ’distortion’ in the Newton equations for the method. Hence the behavior of the method near the solution may be poor unless care is taken.” Another way to tackle the condition \nabla x, \lambda L (x, \lambda) = 0 is to maintain feasibility at every iteration. That is, to ensure that the updates xk follow the implicit curve h(x) = 0 . For the toy problem we are considering here it is relatively easy. Assume we start from a point x 0 that satisfies h(x 0 ) = 0 , that is it satisfies the constraint.

The algorithm can be summarized as follows:

  1. Compute the gradient \nabla L (x^k)  (observe that we compute the gradient of the Lagrangian with respect to x ).
  2. Compute an estimate of \lambda by computing the value of \lambda that minimizes \nabla L (x^k)^2 .
  3. Assume that the update is x^{k+1} = x^k - \alpha ^k \nabla L (x^k) . For each candidate update x k+1 , project it over the constraint h(x) = 0 . Find the α k value that decreases the L (x^{k+1}) with respect to \nabla L (x^k) .
  4. Goto step 1 and repeat until convergence.

Let’s now implement the KKT conditions to see if we are able to obtain the same result as the one obtained with the closed solution. We will use the projected gradient descent to obtain the solution.

Let’s A be our covariance matrix

# A is the covariance matrix of the considered data
A = np.cov(c1.T)

Now we set up our initial values

# Tolerance

# Initial alpha value (line search)

# Initial values of w. DO NOT CHOOSE w=(0,0)
w = np.array([1., 0.])

Now we compute the eigenvalues and eigenvectors

# let's see now the eigvals and eigvects

eig_vals, eig_vecs = np.linalg.eig(A)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

Now, we compute the projection for the function w=w.T*w

#now let's compute the projection for the function. w = w.T*w
den = np.sqrt(,w))
w = w / den

Next step is to compute lambda

# now we calculate lambda
lam = ( (w.T,(A + w.T) ),w) / 2 *,w)

Let’s review our initial values

print "Initial values"
print "Lagrangian value =", lag
print " w =", w
print " x =", m1
print " y =", s1
Initial values
Lagrangian value = -0.858313040377
 w = [ 1.  0.]
 x = [4.0, -1.0]
 y = [[1, 0.9], [0.9, 1]]

Let’s now compute our function using gradient descent

# let's now compute the entire values for our function

while ((alpha > tol) and (cont < 100000)):
    cont = cont+1
    # Gradient of the Lagrangian
    grw = (w.T,(A + w.T) ) - 2 * lam * w.T
    # Used to know if we finished line search
    finished = 0
    while ((finished == 0) and (alpha > tol)):
        # Update
        aux_w = w - alpha * grw
        # Our Projection 
        den = np.sqrt(,aux_w))
        aux_w = aux_w / den

        # Compute new value of the Lagrangian.
        aux_lam = (,(A+w.T)),aux_w) / 2 * (aux_w.T,aux_w)
        aux_lag = (aux_w.T,,aux_w)) - lam * (,aux_w) - 1)
        # Check if this is a descent
        if aux_lag < lag:
            w = aux_w
            lam = aux_lam
            lag = aux_lag
            alpha = 1.0
            finished = 1
            alpha = alpha / 2.0

Let’s now review our final values

# Let's now review our final values!
print " Our Final Values"
print "  Number of iterations", cont
print "  Obtained values are w =", w
print "  Correct values are  w =", veps[idx]
print "  Eigenvectors are =", eig_vecs
 Our Final Values
  Number of iterations 22
  Obtained values are w = [ 0.71916397  0.69484041]
  Correct values are  w = [ 0.71916398 -0.6948404 ]
  Eigenvectors are = [[ 0.71916398 -0.6948404 ]
 [ 0.6948404   0.71916398]]

Let’s compare our new values vs the ones obtained by the closed solution

# Full comparition
print "  Gradient Descent values   w =", w
print "  PCA analysis approach     w =", matrix_w
print "  Closed Solution           w =", veps[idx]
print "  Closed Solution           w =", veps,vaps
  Gradient Descent values   w = [ 0.71916397  0.69484041]
  PCA analysis approach     w = [[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]
  Closed Solution           w = [ 0.71916398 -0.6948404 ]
  Closed Solution           w = [[ 0.71916398 -0.6948404 ]
 [ 0.6948404   0.71916398]] [ 1.56340502  0.10299214]

Very close! Let’s print it to visualize the new values versus the ones obtaine with sci-kit learn

import seaborn as sns 


The post Principal Component Analysis (PCA): A Practical Example appeared first on 3Blades.

Source: 3blades – Principal Component Analysis (PCA): A Practical Example

Data Science Venn Diagram V 2.0

Drew Conway’s Data Science Venn Diagram, created in 2010, has proven to still be current. We did a reinterpretation of it with only slight updates to the terminology he first used to determine the combination of skills and expertise a Data Scientist requires.

Conway’s “Data Science Venn Diagram” characterizes Data scientists as people with a combination of skills and know-how in three categories:

1) Hacking

2) Math and statistics

3) Domain knowledge

We’ve updated this to be more specific:

1)   Computer science

2)   Math and statistics (no way around this one!)

3)   Subject matter expertise

The difficulty in defining these skills is that the split between substance and methodology is vague, and so it is not clear how to differentiate among hackers or computer science experts, statisticians, subject matter experts, their intersections and where data science fits.

What is clear from Conway’s or this updated diagram is that a Data scientist needs to learn a lot as he aspires to become a well-rounded, competent professional.

It is important to note that each of these skills are super valuable on their own, but when combined with only one other are not data science, or at worst even hazardous.

Conway has very specific thoughts on data science that are very different on what has already been discussed on the topic. On a recent interview, he said “To me, data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science. Science is about discovery and building knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods.”

On that matter, in Conway’s original Venn diagram, he came up with a Danger Zone (which we are calling Traditional Software in order to not sound as ominous).

On that area he placed people who know enough to be dangerous, and he saw it as the most problematic area of the diagram. In this area people are perfectly capable of extracting and structuring data, but they lack any understanding of what those coefficients mean. Either through ignorance or spite, this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there.

We can conclude that the Data Science Venn Diagram can help pinpoint these threatening combinations of skills and therefore, detect professionals who are not properly prepared for the challenge of doing excellent Data Science. It is indeed a great tool for understanding who could end up being great in their job, or just a threat in the road to the accomplishment of a specific task or to a company.

The post Data Science Venn Diagram V 2.0 appeared first on 3Blades.

Source: 3blades – Data Science Venn Diagram V 2.0

How to use Jupyter Notebook with Apache Spark

Jupyter Notebook (formerly known as IPython Notebook) is an interactive notebook environment which supports various programming languages which allows you to interact with your data, combine code with markdown text and perform simple visualizations.

Here are just a couple of reasons why using Jupyter Notebook with Spark is the best choice for users that wish to present their work to other team members or to the public in general:

  • Jupyter notebooks support tab autocompletion on class names, functions, methods and variables.
  • It offers more explicit and colour-highlighted error messages than the command line IPython console.
  • Jupyter notebook integrates with many common GUI modules like PyQt, PyGTK, tkinter and with a wide variety of data science packages.
  • Through the use of kernels, multiple languages are supported.

Using Jupyter notebook with Apache Spark is sometimes difficult to configure, particularly when dealing with different development environments. Apache Toree is our solution of choice to configure Jupyter notebooks to run with Apache Spark, it really helps simplify the installation and configuration steps. Default Toree installation works with Scala, although Toree does offer support for multiple kernels including PySpark. We will go over both configurations.

Note: 3Blades offers a pre-built Jupyter Notebook image already configured with PySpark. This tutorial is based in part on the work of Uri Laserson.

Apache Toree with Jupyter Notebook

This should be performed on the machine where the Jupyter Notebook will be executed. If it’s not run on a Hadoop node, then the Jupyter Notebook instance should have SSH access to the Hadoop node.

This guide is based on:

  • IPython 5.1.0
  • Jupyter 4.1.1
  • Apache Spark 2.0.0

The difference between ‘IPython’ and ‘Jupyter’ can be confusing. Basically, the Jupyter team has renamed ‘IPython Notebook’ as ‘Jupyter Notebook’, however the interactive shell is still known as ‘IPython’. Jupyter Notebook ships with IPython out of the box and as such IPython provides a native kernel spec for Jupyter Notebooks.

In this case, we are adding a new kernel spec, known as PySpark.

IPython, Toree and Jupyter Notebook

1) We recommended running Jupyter Notebooks within a virtual environment. This avoids breaking things on your host system. Assuming python / python3 are installed on your local environment, run:

$ pip install virtualenv

2) Then create a virtualenv folder and activate a session, like so:

$ virtualenv -p python3 env3
$ source env3/bin/activate

3) Install Apache Spark and set environment variable for SPARK_HOME. (The official Spark site has options to install bootstrapped versions of Spark for testing.) We recommend downloading the latest version, which as of this writing is Spark version 2.0.0 with Hadoop 2.7.

$ export SPARK_HOME="/path_to_downloaded_spark/spark-2.0.0-bin-hadoop2.7/bin"

4) (Optional)Add additional ENV variables to your bash profile, or set them manually with exports:

$ echo "PATH=$PATH:/path_to_downloaded_spark/spark-2.0.0-bin-hadoop2.7/bin" >> .bash_profile
$ echo "SPARK_HOME=/path_to_downloaded_spark/spark-2.0.0-bin-hadoop2.7" >> .bash_profile

  • Note: substitute .bash_profile for .bashrc or .profile if using Linux.

4) Install Jupyter Notebook, which will also confirm and install needed IPython dependencies:

$ pip install jupyter

5) Install Apache Toree:

$ pip install toree

6) Configure Apache Toree installation with Jupyter:

$ jupyter toree install --spark_home=$SPARK_HOME

7) Confirm installation:

$ jupyter kernelspec list
Available kernels:
python3 /Users/myuser/Library/Jupyter/kernels/python3
apache_toree_scala /usr/local/share/jupyter/kernels/apache_toree_scala

Launch Jupyter Notebook:

$ jupyter notebook


Direct your browser to http://localhost:8888/, which should show the main access point to the notebooks. You should see a Apache Toree – Scala kernel option:



Example IPython Notebook running with PySpark

IPython, Toree and Jupyter Notebook + PySpark 

Apache Toree supports multiple IPython kernels, including Python via PySpark. The beauty of Apache Toree is that it greatly simplifies adding new kernels with the –interpreters argument.

$ jupyter toree install --interpreters=PySpark

SPARK_HOME ENV var will take precedence over –spark_home argument. Otherwise you can use –spark_home to set your Spark location path:

$ jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark

Verify install. Notice how the second to last line confirms PySpark installation:

$ jupyter kernelspec list
Available kernels:
python3 /Users/myuser/Library/Jupyter/kernels/python3
apache_toree_pyspark /usr/local/share/jupyter/kernels/apache_toree_pyspark
apache_toree_scala /usr/local/share/jupyter/kernels/apache_toree_scala


Now, when running Jupyter notebook, you will be able to use Python within a Spark context:


The post How to use Jupyter Notebook with Apache Spark appeared first on 3Blades.

Source: 3blades – How to use Jupyter Notebook with Apache Spark

What Will Be The Key Deep Learning Breakthrough in 2016?

Google’s work in artificial intelligence is impressive. It includes networks of hardware and software that are very similar to the system of neurons in the human brain. By analyzing huge amounts of data, the neural nets can learn all sorts of tasks, and, in some cases like with AlphaGo, they can learn a task so well that they beat humans. They can also do it better and in a bigger scale.

AI seems to be the future of Google Search and of the technology world in general. This specific method, called deep learning, is reinventing many of the Internet’s most popular and interesting services.

Google, during its conception and growth, has relied predominantly on algorithms that followed exact rules set by programmers (think ‘if this then that’ rules). Even with that apparent reassurance of human control, there’s still some concerns about the world of machine learning because even the experts don’t fully understand how neural networks work. However, in recent years great strides have been made in understanding the human brain and thus how neural networks could be wired. If you feed enough photos of a dog into a neural net, it is able to learn to identify a dog. In some cases, a neural net can handle queries better than algorithms hand-coded by humans. Artificial intelligence is the future of Google Search and that means it’s probably a big influencer of everything else.

Based on Google Search advances and AI like AlphaGo, experts expect to see:

  • More radical deep learning architectures
  • Better integration of symbolic and subsymbolic systems
  • Expert dialogue systems

And with AI finally dominating the game of Go:

  • Deep learning for more intricate robotic planning and motor control
  • High-quality video summarization
  • More creative and higher-resolution dreaming

Experts consider these methods can accelerate scientific research per se. The idea of having scientists working alongside artificially intelligent systems that can hone in on areas of research is not a farfetched idea anymore. It might happen soon and 2016 looks like a good year for it.

The post What Will Be The Key Deep Learning Breakthrough in 2016? appeared first on 3Blades.

Source: 3blades – What Will Be The Key Deep Learning Breakthrough in 2016?

30 Must Read Books in Analytics / Data Science

So many pages have been dedicated to Data Science that it can be hard to pinpoint the best books among the sea of available content. However, we have compiled our own list and perhaps it would be a good source of reference for you too.

This is not a definitive list of all the books that you would probably have to read during your career as a Data Scientist, but it definitely includes classics, beginners books, specialist books (more related to the business of data science or team-building) and of course, some good ones that explain the complexities of certain programs, languages or processes.

So, bring it on! Find yourself a comfortable reclining chair or a desk, good reading glasses (if needed) and a peaceful mindset to cultivate your data-driven mind.

The post 30 Must Read Books in Analytics / Data Science appeared first on 3Blades.

Source: 3blades – 30 Must Read Books in Analytics / Data Science

What Industries Will Be Next to Adopting Data Science?

It’s no surprise that data science will surely spread to more industries in the next couple of years. So, which of them would probably be the next ones to hire more data scientists and benefit from big data? 

We looked at five very different businesses that are starting to benefit or could benefit from data science and how exactly can big data better help them achieve success in their fields.

Data Science in Sports

1) Sports

If you saw the movie Moneyball you might know why big data is important to baseball and sports in general. Nowadays, for instance, many NBA teams collect millions of data records per game using cameras installed in the courts. The ultimate goal for all these sports teams is to improve health and safety, and thus performance of the team and individual athletes. In the same way that businesses seek to use data to custom their operations, it’s easy to see how these two worlds can crossover to benefit the sports world.

Data Science in On-Demand Services

2) On-demand services

Uber gets attention for its growth and success that came mainly because how the company uses data. The Uber experience relies on data science and algorithms so this is a clear example of how on-demand services can benefit from big data. Uber continues to succeed because of the convenience that its data-driven product provides. Other on-demand services should look up to Uber’s example for their own good and follow up with relying more on data science.

Data Science in Entertainment Industry

3) Entertainment industry

In this era of connected consumers, media and entertainment businesses must do more than simply being digital to compete. Data science already allows some organizations to understand their audience.

A once content-centric model is turning into a consumer-centric one. The entertainment industry is prepared to capitalize on this trend by converting information into insight that boosts production and cross-channel distribution. From now on it can be expected that those who provide a unique audience experience will be the only ones to achieve growth.

Data Science in Real Estate

4) Real estate agents

We continue hearing that the housing market is unpredictable, however some of the top real estate agents claim they saw the housing bubble burst coming way back (think again of movies, exactly like in The Big Short). It’s easy to obtain this information from following data and trend spotting. This is a great way for this volatile industry to be prepared for market shifts.

Data Science in Food Industry

5 ) Restaurant owners 

This business field is the epitome of how important it is being able to tell what customers want. According to the Washington, D.C.-based National Restaurant Association, restaurants face another big obstacle besides rent, licensing and personnel: critics, not only professional but amateurs who offer their opinions on social media. The importance of quality is the reason why restaurants are beginning to use big data to understand customer preferences and to improve their food and service.

The post What Industries Will Be Next to Adopting Data Science? appeared first on 3Blades.

Source: 3blades – What Industries Will Be Next to Adopting Data Science?

What is needed to build a data science team from the ground up?

What specific roles would a data science team need to have to be successful? Some will depend on the organization’s objectives, but there’s a consensus that the following positions are key.

  1. Data scientist. This role should be held by someone who can work on large datasets (on Hadoop/Spark) with machine learning algorithms, who can also create predictive models, interpret and explain model behavior layman terms. This position requires excellent knowledge of SQL and understanding of at least one programming language for predictive data analysis like R and/ Python.
  2. Data engineer / Data software developer. Requires great knowledge of distributed programming, including infrastructure and architecture. The person hired for this position should be very comfortable with installation of distributed programming frameworks like Hadoop MapReduce/Spark clusters, should be able to code in more than one programming language like Scala/Python/Java, and knows Unix scripting and SQL. This role can also evolve into one of the two specialized roles:
    1. Data solutions architect.  Basically a data engineer with an ample range of experience across several technologies and who has great understanding of service-oriented architecture concepts and web applications.
    2. Data platform administrator. This position requires extensive experience managing clusters including production environments and good knowledge of cloud computing.
  3. Designer. This position should be occupied by an expert who has deep knowledge of user experience (UX) and interface design, primarily for web and mobile applications, as well as knowledge of data visualization and ideally some UI coding expertise.
  4. Product manager. This is an optional role required only for teams focused on building data products. This person will be defining the product vision, translating business problems into user stories, and focusing on helping the development team build data products based on the user stories.

The post What is needed to build a data science team from the ground up? appeared first on 3Blades.

Source: 3blades – What is needed to build a data science team from the ground up?

What is the best way to sync data science teams?

A well-defined workflow will help a data science team reach its goals. In order to sync data science teams and its members it’s important to first know each part of the phases needed to get data based results.  

When dealing with big data or any type of data-driven goals it helps to have a defined workflow. Whether we want to perform an analysis with the intent of telling a story (Data Visualization) or building a system that relies on data, like data mining, the process always matters. If a methodology is defined before starting any task, teams will be in sync and it will be easy to avoid losing time figuring out what’s next. This will allow a faster production rhythm of course and an overall understanding of what everyone is bringing into the team.

Here are the four main parts of the workflow that every team member should know in order to sync data science teams.

1) Preliminary analysis. When data is brand new this step has to be performed first, it’s a no-brainer. In order to produce results fast you need to get an overview of all data points. In this phase, the focus is to make the data usable as quickly as possible and get quick and interesting insights.

2) Exploratory analysis. This is the part of the workflow where questions will be asked over and over again, and where the data will be cleaned and ordered to help answer those same questions. Some teams would end the process here but it’s not ideal, however, it all depends on what we want to do with the data. So there are two phases that could be considered ideally most of the times.

3) Data visualization. This step is imperative if we want to show the results of the exploratory analysis. It’s the part where actual storytelling takes place and where we will be able to translate our technical results into something that can be understood by a wider audience. The focus is turned to how to best present the results. The main goal data science teams should aim for in this phase is to create data visualizations that mesmerize users while telling them all the valuable information discovered in the original data sets.

4) Knowledge. If we want to study the patterns in the data to build reliable models, we turn to this phase in which the focus of the team is producing a model that better explains the data, by engineering it and then testing different algorithms to find the best performance possible.
These are the key phases around which a data science team should sync up in order to have a finished, replicable and understandable product based on data analysis.

The post What is the best way to sync data science teams? appeared first on 3Blades.

Source: 3blades – What is the best way to sync data science teams?

How Can Businesses Adopt a Data-Driven Culture?

There are small steps that any business can adopt in order to start incorporating a data-driven philosophy into their business. An Economist Intelligence Unit survey sponsored by Tableau Software highlights best practices.

A survey made by Economist Intelligence Unit, an independent business within The Economist Group providing forecasting and advisory services, sponsored by Tableau Software, highlighted best practices to adopt a data-driven culture among other information relevant to the field of data science. To ensure a seamless and successful transition to a data-driven culture, here are some of the top approaches your business should apply:

Share data and prosper

Appreciating the power of data is only the first step on the road to a data-driven philosophy. Older companies can have a hard time transitioning to a data-driven culture, especially if they have achieved success with minimum use of data in the past. However, times are changing and any type of company can benefit from this type of information. More than half of respondents from the survey (from top-performing companies) said that promotion of data-sharing has helped create a data-driven culture in their organization.

Increased availability of training

Around one in three respondents said it was important to have partnerships or courses in house to make employees more data-savvy.

Hire a chief data officer (CDO)

This position is key to convert data into insight so that it provides maximum impact. This task is not easy, quite the contrary, it can turn out to be very specialized and businesses shouldn’t expect their CIO or CMO to perform the job. A corporate officer is needed who is wholly dedicated to acquiring and using data to improve productivity. You may already have someone who can be promoted to a CDO at your company: someone who understands the value of data and owns it.

Create policies and guidelines

After the CDO runs a data audit internally, it is relevant that company guidelines are crafted around data analysis. This is how all employees will be equipped with replicable strategies focused on improving business challenges.

Encourage employees to seek data

Once new company policies are in place and running, the next step is to motivate employees to seek answers in data. One of the best ways to do this is offering incentives (you pick what type). Employees will then feel encouraged to use (or even create) tools and find solutions on their own without depending on the IT guys.

The post How Can Businesses Adopt a Data-Driven Culture? appeared first on 3Blades.

Source: 3blades – How Can Businesses Adopt a Data-Driven Culture?

App and Workspace Discovery Demo

Whether you are selling tooth paste or software, reducing operations costs while improving efficiency is every business’s goal. In this post, we will go over some general concepts that we have encountered while setting up a modern micro services based architecture based on the Python stack, Docker, Consul, Registrator and AWS.

This example application demonstrates the simplest service discovery setup for a load balanced application and multiple named sub-applications. This is a typical setup for micro services based architectures. For example, could route to one micro service and could route to another micro service. Additionally, ../authentication and ../forum could be any number of instances of the micro service, so service discovery becomes necessary due to dynamic instance updates.

This setup uses Consul and Registrator to dynamically register new containers in a way that can be used by consul-template to reconfigure Nginx at runtime. This allows new instances of the app to be started and immediately added to the load balancer. It also allows new instances of workspaces to become accessible at their own path based on their name.

This guide is in large part the result of a consulting engagement with Glider Labs. We also used this post from the good people at Real Python as a guide. The source code for this post can be found here.


This demo assumes these versions of Docker tools, which are included in the Docker Toolbox. Older versions may work as well.

  • docker 1.9.1
  • docker-machine 0.5.1
  • docker-compose 1.5.2

You can install Docker Machine from here and Docker Compose from here. Then verify versions like so:

$ docker-machine --version
docker-machine version 0.5.1 (04cfa58)
$ docker-compose --version
docker-compose version: 1.5.2

Configure Docker

Configure Docker Machine

Change directory (cd) into repository root and then run:

$ cd myrepo
$ docker-machine create -d virtualbox dev;
Running pre-create checks...
Creating machine...
Waiting for machine to be running, this may take a few minutes...
Machine is running, waiting for SSH to be available...
Detecting operating system of created instance...
Provisioning created instance...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
To see how to connect Docker to this machine, run: docker-machine env dev

The above step creates a virtualbox image named dev. Then configure docker to point to local dev environment:

$ eval "$(docker-machine env dev)"

To view running machines, type:

$ docker-machine ls
NAME      ACTIVE   DRIVER       STATE     URL                         SWARM
dev       *        virtualbox   Running   tcp://


Launch locally with Docker Compose

This demo setup is for a single host such as you laptop, but can work across hosts with minimal changes. There are notes later for running with AWS ECS.

First, we’re going to build the containers upfront. You can view the Dockerfiles in each directory to see what they’re building. An Nginx container and two simple Flask apps: our primary app, and our workspace app. Consul is pulled in from Docker Hub.

Let’s take a look at docker-compose.yml. Some images are pulled from DockerHub, while others are built from the included Dockerfiles in the repo:

  image: gliderlabs/consul-server
  container_name: consul-server
  net: "host"
  command: -advertise -bootstrap
  image: gliderlabs/registrator
  container_name: registrator
  net: "host"
    - /var/run/docker.sock:/tmp/docker.sock
  command: -ip= consul://

  build: app
   - "8080"
  build: workspace
    SERVICE_NAME: workspace
    SERVICE_TAGS: workspace1
    WORKSPACE_ID: workspace1
   - "8080"

  build: nginx
  container_name: nginx
    - "80:80"
  command: -consul

To build all images, run:

$ docker-compose build

Now we can start all our services (-d is used to run as daemon):

$ docker-compose up -d

This gives us a single app instance and a workspace called workspace1. We can check them out in another terminal session, hitting the IP of the Docker VM created by docker-machine:

$ curl
App from container: 6294fb10b701

This shows us the hostname of the container, which happens to be the container ID. We’ll use this later to see load balancing come into effect. Before that, let’s check our workspace app running as workspace1:

$ curl
Workspace [workspace1] from container: 68021fff0419

Each workspace instance is given a name that is made available as a subpath. The app itself also spits out the hostname of the container, as well as mentioning what its name is.

Now let’s add another app instance by manually starting one with Docker:

$ docker run -d -P -e "SERVICE_NAME=app" 3blades_app

We’re using the image produced by docker-compose and providing a service name in its environment to be picked up by Registrator. We also publish all exposed ports with -P, which is important for Registrator as well.

Now we can run curl a few times to see Nginx balancing across multiple instances of our app:

$ curl
App from container: 6294fb10b701
$ curl
App from container: 044b1f584475

You should see the hostname changed, representing a different container serving the request. No re-configuration necessary, we just ran a container to be picked up by service discovery.

Similarly, we can run a new workspace. Here we’ll start a workspace called workspace2. The service name is workspace but we provide workspace1 as an environment variable for the workspace app, and as a service tag used by Nginx:

$ docker run -d -P -e "SERVICE_NAME=workspace" -e "SERVICE_TAGS=workspace2" -e "WORKSPACE_ID=workspace2" 3blades_workspace

Now we can access this workspace via curl at /workspace2:

$ curl
Workspace [workspace2] from container: 8067ad9cfaf3

You can also try running that same docker command again to create a second workspace2 instance. Nginx will load balance across them just like the app instances.

You can also try stopping any of these instances and see that they’ll be taken out of the load balancer.

To view logs, type:

$ docker-compose logs

Launch locally with Docker Compose

Launch to AWS with Docker Compose

Docker Machine has various drivers to seamlessly deploy your docker stack to several cloud providers. We essentially used the VirtualBox driver when deploying locally. Now we will use the AWS driver.

For AWS, you will need your access key, secret key and VPC ID:

$ docker-machine create --driver amazonec2 --amazonec2-access-key AKI******* --amazonec2-secret-key 8T93C********* --amazonec2-vpc-id vpc-****** production

This will set up a new Docker Machine on AWS called production:

Running pre-create checks...
Creating machine...
Waiting for machine to be running, this may take a few minutes...
Machine is running, waiting for SSH to be available...
Detecting operating system of created instance...
Provisioning created instance...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
To see how to connect Docker to this machine, run: docker-machine env production

Now we have two Machines running, one locally and one on AWS:

$ docker-machine ls
NAME         ACTIVE   DRIVER         STATE     URL                         SWARM
dev          *        virtualbox     Running   tcp://
production   -        amazonec2      Running   tcp://<awspublicip>:2376

Then switch to production on AWS as the active machine:

$ eval "$(docker-machine env production)"

Finally, let’s build the Flask app again in the cloud:

$ docker-compose build
$ docker-compose up -d

How it works - Data Science

How it works


The heart of our service discovery setup is Consul. It provides DNS and HTTP interfaces to register and lookup services. In this example, the -bootstrap option was used, which effectively combines the consul agent and server. The only problem is getting services into Consul. That’s where Registrator comes in.


Registrator is designed to run on all hosts and listens to each host’s Docker event stream. As new containers are started, it inspects them for metadata to help define one or more service definitions. A service is created for each published port. By default the service name is based on the image name, but in our demo we override that with environment variables. We also use environment variables to tell Registrator to tag our workspace services with specific names.

Nginx + Consul Template

In this demo we have a very simple Nginx container with a very simple configuration managed by consul-template. Although there are a number of ways to achieve this with various kinds of limitations and drawbacks, this is the simplest mechanism for this case.

In short, our configuration template creates upstream backends for every app and every tag of the workspace app. Each instance is used as a backend server for the upstream. It then maps the / location to our app upstream, and creates a location for each tag of our workspace apps mapping to their upstream.

Consul-template ensures as services change, this configuration is re-rendered and reloaded without downtime in Nginx. with ECS

There is an Amazon article on using Registrator and Consul with ECS. In short, it means you have to manage your own ECS instances to make sure Consul and Registrator are started on them. This could be done via infrastructure configuration management like CloudFormation or Terraform, or many other means. Or they could be baked into an AMI, requiring much less configuration and faster boot times. Tools like Packer make this easy.

In a production deployment, you’ll run Registrator on every host, along with Consul in client mode. In our example, we’re running a single Consul instance in server bootstrap mode. A production Consul cluster should not be run this way. In a production deployment, Consul server should be run with N/2 + 1 servers (usually 3 or 5) behind an ELB or joined to Hashicorp’s Atlas service. In other words, Consul server instances should probably not be run on ECS, and instead on dedicated instances.

A production deployment will also require more thought about IPs. In our demo, we use a single local IP. In production, we’d want to bind Consul client to the private IP of the hosts. In fact, all but port 80 of Nginx should be on private IPs unless ELB is used. This leaves the IP to use to connect to the Consul server cluster, which is most easily provided with an elastic IP to one of them or an internal ELB.


Star and Watch our GitHub repo for future example updates, such as using AWS CLI Tools with Docker Compose, example configurations with other Cloud providers, among others.


The post App and Workspace Discovery Demo appeared first on 3Blades.

Source: 3blades – App and Workspace Discovery Demo