How to use Jupyter Notebook with Apache Spark

Jupyter Notebook (formerly known as IPython Notebook) is an interactive notebook environment which supports various programming languages which allows you to interact with your data, combine code with markdown text and perform simple visualizations.

Here are just a couple of reasons why using Jupyter Notebook with Spark is the best choice for users that wish to present their work to other team members or to the public in general:

  • Jupyter notebooks support tab autocompletion on class names, functions, methods and variables.
  • It offers more explicit and colour-highlighted error messages than the command line IPython console.
  • Jupyter notebook integrates with many common GUI modules like PyQt, PyGTK, tkinter and with a wide variety of data science packages.
  • Through the use of kernels, multiple languages are supported.

Using Jupyter notebook with Apache Spark is sometimes difficult to configure, particularly when dealing with different development environments. Apache Toree is our solution of choice to configure Jupyter notebooks to run with Apache Spark, it really helps simplify the installation and configuration steps. Default Toree installation works with Scala, although Toree does offer support for multiple kernels including PySpark. We will go over both configurations.

Note: 3Blades offers a pre-built Jupyter Notebook image already configured with PySpark. This tutorial is based in part on the work of Uri Laserson.

Apache Toree with Jupyter Notebook

This should be performed on the machine where the Jupyter Notebook will be executed. If it’s not run on a Hadoop node, then the Jupyter Notebook instance should have SSH access to the Hadoop node.

This guide is based on:

  • IPython 5.1.0
  • Jupyter 4.1.1
  • Apache Spark 2.0.0

The difference between ‘IPython’ and ‘Jupyter’ can be confusing. Basically, the Jupyter team has renamed ‘IPython Notebook’ as ‘Jupyter Notebook’, however the interactive shell is still known as ‘IPython’. Jupyter Notebook ships with IPython out of the box and as such IPython provides a native kernel spec for Jupyter Notebooks.

In this case, we are adding a new kernel spec, known as PySpark.

IPython, Toree and Jupyter Notebook

1) We recommended running Jupyter Notebooks within a virtual environment. This avoids breaking things on your host system. Assuming python / python3 are installed on your local environment, run:

$ pip install virtualenv

2) Then create a virtualenv folder and activate a session, like so:

$ virtualenv -p python3 env3
$ source env3/bin/activate

3) Install Apache Spark and set environment variable for SPARK_HOME. (The official Spark site has options to install bootstrapped versions of Spark for testing.) We recommend downloading the latest version, which as of this writing is Spark version 2.0.0 with Hadoop 2.7.

$ export SPARK_HOME="/path_to_downloaded_spark/spark-2.0.0-bin-hadoop2.7/bin"

4) (Optional)Add additional ENV variables to your bash profile, or set them manually with exports:

$ echo "PATH=$PATH:/path_to_downloaded_spark/spark-2.0.0-bin-hadoop2.7/bin" >> .bash_profile
$ echo "SPARK_HOME=/path_to_downloaded_spark/spark-2.0.0-bin-hadoop2.7" >> .bash_profile

  • Note: substitute .bash_profile for .bashrc or .profile if using Linux.

4) Install Jupyter Notebook, which will also confirm and install needed IPython dependencies:

$ pip install jupyter

5) Install Apache Toree:

$ pip install toree

6) Configure Apache Toree installation with Jupyter:

$ jupyter toree install --spark_home=$SPARK_HOME

7) Confirm installation:

$ jupyter kernelspec list
Available kernels:
python3 /Users/myuser/Library/Jupyter/kernels/python3
apache_toree_scala /usr/local/share/jupyter/kernels/apache_toree_scala

Launch Jupyter Notebook:

$ jupyter notebook


Direct your browser to http://localhost:8888/, which should show the main access point to the notebooks. You should see a Apache Toree – Scala kernel option:



Example IPython Notebook running with PySpark

IPython, Toree and Jupyter Notebook + PySpark 

Apache Toree supports multiple IPython kernels, including Python via PySpark. The beauty of Apache Toree is that it greatly simplifies adding new kernels with the –interpreters argument.

$ jupyter toree install --interpreters=PySpark

SPARK_HOME ENV var will take precedence over –spark_home argument. Otherwise you can use –spark_home to set your Spark location path:

$ jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark

Verify install. Notice how the second to last line confirms PySpark installation:

$ jupyter kernelspec list
Available kernels:
python3 /Users/myuser/Library/Jupyter/kernels/python3
apache_toree_pyspark /usr/local/share/jupyter/kernels/apache_toree_pyspark
apache_toree_scala /usr/local/share/jupyter/kernels/apache_toree_scala


Now, when running Jupyter notebook, you will be able to use Python within a Spark context:


The post How to use Jupyter Notebook with Apache Spark appeared first on 3Blades.

Source: 3blades – How to use Jupyter Notebook with Apache Spark

App and Workspace Discovery Demo

Whether you are selling tooth paste or software, reducing operations costs while improving efficiency is every business’s goal. In this post, we will go over some general concepts that we have encountered while setting up a modern micro services based architecture based on the Python stack, Docker, Consul, Registrator and AWS.

This example application demonstrates the simplest service discovery setup for a load balanced application and multiple named sub-applications. This is a typical setup for micro services based architectures. For example, could route to one micro service and could route to another micro service. Additionally, ../authentication and ../forum could be any number of instances of the micro service, so service discovery becomes necessary due to dynamic instance updates.

This setup uses Consul and Registrator to dynamically register new containers in a way that can be used by consul-template to reconfigure Nginx at runtime. This allows new instances of the app to be started and immediately added to the load balancer. It also allows new instances of workspaces to become accessible at their own path based on their name.

This guide is in large part the result of a consulting engagement with Glider Labs. We also used this post from the good people at Real Python as a guide. The source code for this post can be found here.


This demo assumes these versions of Docker tools, which are included in the Docker Toolbox. Older versions may work as well.

  • docker 1.9.1
  • docker-machine 0.5.1
  • docker-compose 1.5.2

You can install Docker Machine from here and Docker Compose from here. Then verify versions like so:

$ docker-machine --version
docker-machine version 0.5.1 (04cfa58)
$ docker-compose --version
docker-compose version: 1.5.2

Configure Docker

Configure Docker Machine

Change directory (cd) into repository root and then run:

$ cd myrepo
$ docker-machine create -d virtualbox dev;
Running pre-create checks...
Creating machine...
Waiting for machine to be running, this may take a few minutes...
Machine is running, waiting for SSH to be available...
Detecting operating system of created instance...
Provisioning created instance...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
To see how to connect Docker to this machine, run: docker-machine env dev

The above step creates a virtualbox image named dev. Then configure docker to point to local dev environment:

$ eval "$(docker-machine env dev)"

To view running machines, type:

$ docker-machine ls
NAME      ACTIVE   DRIVER       STATE     URL                         SWARM
dev       *        virtualbox   Running   tcp://


Launch locally with Docker Compose

This demo setup is for a single host such as you laptop, but can work across hosts with minimal changes. There are notes later for running with AWS ECS.

First, we’re going to build the containers upfront. You can view the Dockerfiles in each directory to see what they’re building. An Nginx container and two simple Flask apps: our primary app, and our workspace app. Consul is pulled in from Docker Hub.

Let’s take a look at docker-compose.yml. Some images are pulled from DockerHub, while others are built from the included Dockerfiles in the repo:

  image: gliderlabs/consul-server
  container_name: consul-server
  net: "host"
  command: -advertise -bootstrap
  image: gliderlabs/registrator
  container_name: registrator
  net: "host"
    - /var/run/docker.sock:/tmp/docker.sock
  command: -ip= consul://

  build: app
   - "8080"
  build: workspace
    SERVICE_NAME: workspace
    SERVICE_TAGS: workspace1
    WORKSPACE_ID: workspace1
   - "8080"

  build: nginx
  container_name: nginx
    - "80:80"
  command: -consul

To build all images, run:

$ docker-compose build

Now we can start all our services (-d is used to run as daemon):

$ docker-compose up -d

This gives us a single app instance and a workspace called workspace1. We can check them out in another terminal session, hitting the IP of the Docker VM created by docker-machine:

$ curl
App from container: 6294fb10b701

This shows us the hostname of the container, which happens to be the container ID. We’ll use this later to see load balancing come into effect. Before that, let’s check our workspace app running as workspace1:

$ curl
Workspace [workspace1] from container: 68021fff0419

Each workspace instance is given a name that is made available as a subpath. The app itself also spits out the hostname of the container, as well as mentioning what its name is.

Now let’s add another app instance by manually starting one with Docker:

$ docker run -d -P -e "SERVICE_NAME=app" 3blades_app

We’re using the image produced by docker-compose and providing a service name in its environment to be picked up by Registrator. We also publish all exposed ports with -P, which is important for Registrator as well.

Now we can run curl a few times to see Nginx balancing across multiple instances of our app:

$ curl
App from container: 6294fb10b701
$ curl
App from container: 044b1f584475

You should see the hostname changed, representing a different container serving the request. No re-configuration necessary, we just ran a container to be picked up by service discovery.

Similarly, we can run a new workspace. Here we’ll start a workspace called workspace2. The service name is workspace but we provide workspace1 as an environment variable for the workspace app, and as a service tag used by Nginx:

$ docker run -d -P -e "SERVICE_NAME=workspace" -e "SERVICE_TAGS=workspace2" -e "WORKSPACE_ID=workspace2" 3blades_workspace

Now we can access this workspace via curl at /workspace2:

$ curl
Workspace [workspace2] from container: 8067ad9cfaf3

You can also try running that same docker command again to create a second workspace2 instance. Nginx will load balance across them just like the app instances.

You can also try stopping any of these instances and see that they’ll be taken out of the load balancer.

To view logs, type:

$ docker-compose logs

Launch locally with Docker Compose

Launch to AWS with Docker Compose

Docker Machine has various drivers to seamlessly deploy your docker stack to several cloud providers. We essentially used the VirtualBox driver when deploying locally. Now we will use the AWS driver.

For AWS, you will need your access key, secret key and VPC ID:

$ docker-machine create --driver amazonec2 --amazonec2-access-key AKI******* --amazonec2-secret-key 8T93C********* --amazonec2-vpc-id vpc-****** production

This will set up a new Docker Machine on AWS called production:

Running pre-create checks...
Creating machine...
Waiting for machine to be running, this may take a few minutes...
Machine is running, waiting for SSH to be available...
Detecting operating system of created instance...
Provisioning created instance...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
To see how to connect Docker to this machine, run: docker-machine env production

Now we have two Machines running, one locally and one on AWS:

$ docker-machine ls
NAME         ACTIVE   DRIVER         STATE     URL                         SWARM
dev          *        virtualbox     Running   tcp://
production   -        amazonec2      Running   tcp://<awspublicip>:2376

Then switch to production on AWS as the active machine:

$ eval "$(docker-machine env production)"

Finally, let’s build the Flask app again in the cloud:

$ docker-compose build
$ docker-compose up -d

How it works - Data Science

How it works


The heart of our service discovery setup is Consul. It provides DNS and HTTP interfaces to register and lookup services. In this example, the -bootstrap option was used, which effectively combines the consul agent and server. The only problem is getting services into Consul. That’s where Registrator comes in.


Registrator is designed to run on all hosts and listens to each host’s Docker event stream. As new containers are started, it inspects them for metadata to help define one or more service definitions. A service is created for each published port. By default the service name is based on the image name, but in our demo we override that with environment variables. We also use environment variables to tell Registrator to tag our workspace services with specific names.

Nginx + Consul Template

In this demo we have a very simple Nginx container with a very simple configuration managed by consul-template. Although there are a number of ways to achieve this with various kinds of limitations and drawbacks, this is the simplest mechanism for this case.

In short, our configuration template creates upstream backends for every app and every tag of the workspace app. Each instance is used as a backend server for the upstream. It then maps the / location to our app upstream, and creates a location for each tag of our workspace apps mapping to their upstream.

Consul-template ensures as services change, this configuration is re-rendered and reloaded without downtime in Nginx. with ECS

There is an Amazon article on using Registrator and Consul with ECS. In short, it means you have to manage your own ECS instances to make sure Consul and Registrator are started on them. This could be done via infrastructure configuration management like CloudFormation or Terraform, or many other means. Or they could be baked into an AMI, requiring much less configuration and faster boot times. Tools like Packer make this easy.

In a production deployment, you’ll run Registrator on every host, along with Consul in client mode. In our example, we’re running a single Consul instance in server bootstrap mode. A production Consul cluster should not be run this way. In a production deployment, Consul server should be run with N/2 + 1 servers (usually 3 or 5) behind an ELB or joined to Hashicorp’s Atlas service. In other words, Consul server instances should probably not be run on ECS, and instead on dedicated instances.

A production deployment will also require more thought about IPs. In our demo, we use a single local IP. In production, we’d want to bind Consul client to the private IP of the hosts. In fact, all but port 80 of Nginx should be on private IPs unless ELB is used. This leaves the IP to use to connect to the Consul server cluster, which is most easily provided with an elastic IP to one of them or an internal ELB.


Star and Watch our GitHub repo for future example updates, such as using AWS CLI Tools with Docker Compose, example configurations with other Cloud providers, among others.


The post App and Workspace Discovery Demo appeared first on 3Blades.

Source: 3blades – App and Workspace Discovery Demo