Beware the bandwidth gap – speeding up optimization

Disks are slow and RAM is fast. Everyone knows that. But many optimization algorithms don’t take advantage of this. More to the point, disks currently stream at about 100-200 MB/s, solid state drives stream at over 500 MB/s with 1000x lower latency than disks, and main memory reigns supreme at about 10-100 GB/s bandwidth (depending on how many memory banks you have). This means that it is 100 times more expensive to retrieve instances from disk rather than recycling them once they’re already in memory. CPU caches are faster yet with 100-1000 GB/s of bandwidth. Everyone knows this. If not, read Jeff Dean’s slides. Page 13 is pure gold.

Ok, so what does this mean for machine learning? If you can keep things in memory, you can do things way faster. This is the main idea behind Spark. It’s a wonderful alternative to Hadoop. In other words, if your data fits into memory, you’re safe and you can process data way faster. A lot of datasets that are considered big in academia fit this bill. But what about real big data? Essentially you have two options – have the systems designer do the hard work or change your algorithm. This post is about the latter. And yes, there’s a good case to be made about who should do the work: the machine learners or the folks designing the computational infrastructure (I think it’s both).

So here’s the problem: Many online algorithms load data from disk, stream it through memory as efficiently as possible and discard it after seeing it once, only to pick it up later for another pass through the data. That is, these algorithms are disk bound rather than CPU bound. Several solvers try to address this by making the disk representation more efficient, e.g. Liblinear or VowpalWabbit, both of which user their own internal representation for efficiency. While this still makes for quite efficient code that can process up to 3TB of data per hour in any given pass, main memory is still much faster. This has led to the misconception that many machine learning algorithms are disk bound. But, they aren’t …

What if we could re-use data that’s in memory? For instance, use a ringbuffer where the disk writes into it (much more slowly) and the CPU reads from it (100 times more rapidly). The problem is what to do with an observation that we’ve already processed. A naive strategy would be to pretend that it is a new instance, i.e. we could simply update on it more than once. But this is very messy since we need to keep track of how many times we’ve seen the instance before, and it creates nonstationarity in the training set.

A much cleaner strategy is to switch to dual variables, similar to the updates in the Dualon of Shalev-Shwartz and Singer. This is what Shin Matsushima did in our dual cached loops paper. Have a look at StreamSVM here. Essentially, it keeps data in memory in a ringbuffer and updates the dual variables. This way, we’re guaranteed to make progress at each step, even if we’re revisiting the same observation more than once. To see what happens have a look at the graph below:

It’s just as fast as LibLinear provided that it’s all in memory. Algorithmically, what happens in the SVM case is that one updates the Lagrange multipliers (alpha_i), while simultaneously keeping an estimate of the parameter vector (w) available.

That said, this strategy is more general: reuse data several times for optimization while it is in memory. If possible, perform successive updates by changing variables of an optimization that is well-defined regardless of the order in which (and how frequently) data is seen.

Source: Adventures in Data Land

 

Distributing Data in a Parameterserver

 

One of the key features of a parameter server is that it, well, serves parameters. In particular, it serves more parameters than a single machine can typically hold and provides more bandwidth than what a single machine offers.

image

A sensible strategy to increase both aspects is to arrange data in the form of a bipartite graph with clients on one side and the server machines on the other. This way bandwidth and storage increase linearly with the number of machines involved. This is well understood. For instance, distributed (key,value) stores such as memcached or Basho Riak use it. It dates back to the ideas put forward e.g. in the STOC 1997 paper by David Karger et al. on Consistent Hashing and Random Trees.

A key problem is that we can obviously not store a mapping table from the keys to the machines. This would require a database that is of the same size as the set of keys and that would need to be maintained and updated on each client. One way around this is to use the argmin hash mapping. That is, given a machine pool (M), we assign a given (key,value) pair to the machine that has the smallest hash, i.e.

$$m(k, M) = mathrm{argmin}_{m in M} h(m,k)$$

The advantage of this scheme is that it allows for really good load balancing and repair. First off, the load is almost uniformly distributed, short of a small number of heavy hitters. Secondly, if a machine is removed or added to the machine pool, rebalancing affects all other machines uniformly. To see this, notice that the choice of machine with the smallest and second-smallest hash value is uniform.

Unfortunately, this is a stupid way of distributing (key,value) pairs for machine learning. And this is what we did in our 2010 VLDB and 2012 WSDM papers. To our excuse, we didn’t know any better. And others copied that approach … after all, how you can you improve on such nice rebalancing aspects.

This begs the question why it is a bad idea. It all comes down to the issue of synchronization. Basically, whenever a client attempts to synchronize its keys, it needs to traverse the list of the keys it owns and communicate with the appropriate servers. In the above scheme, it means that we need to communicate to a new random server for each key. This is amazingly costly. Probably the best comparison would be a P2P network where each byte is owned by a different machine. Downloads would take forever.

We ‘fixed’ this problem by cleverly reordering the access and then performing a few other steps of randomization. There’s even a nice load balancing lemma in the 2012 WSDM paper. However, a much better solution is to prevent the problem from happening and to borrow from key distribution algorithms such as Chord. In it, servers are inserted into a ring via a hash function. So are keys. This means that each server now owns a contiguous segment of keys. As a result, we can easily determine which keys go to which server, simply by knowing where in the ring the server sits.

image

In the picture above, keys are represented by little red stars. They are randomly assigned using a hash function via (h(k)) to the segments ‘owned’ by servers (s) that are inserted in the same way, i.e. via (h(s)). In the picture above, each server ‘owns’ the segment to its left. Also have a look at the Amazon Dynamo paper for a related description.

Obviously, such a load-balancing isn’t quite as ideal as the argmin hash. For instance, if a machine fails, the next machine inherits the entire segment. However, by inserting each server (log n) times we can ensure that a good load balance is achieved and also that when machines are removed, there are several other machines that pick up the work. Moreover, it is now also very easy to replicate things (more on this later). If you’re curious on how to do this, have a look at Amar Phanishayee’s excellent thesis. In a nutshell, the machines to the left hold the replicas. More details in the next post.

Source: Adventures in Data Land

 

30 Must Read Books in Analytics / Data Science


So many pages have been dedicated to Data Science that it can be hard to pinpoint the best books among the sea of available content. However, we have compiled our own list and perhaps it would be a good source of reference for you too.

This is not a definitive list of all the books that you would probably have to read during your career as a Data Scientist, but it definitely includes classics, beginners books, specialist books (more related to the business of data science or team-building) and of course, some good ones that explain the complexities of certain programs, languages or processes.

So, bring it on! Find yourself a comfortable reclining chair or a desk, good reading glasses (if needed) and a peaceful mindset to cultivate your data-driven mind.

The post 30 Must Read Books in Analytics / Data Science appeared first on 3Blades.



Source: 3blades – 30 Must Read Books in Analytics / Data Science

What Industries Will Be Next to Adopting Data Science?


It’s no surprise that data science will surely spread to more industries in the next couple of years. So, which of them would probably be the next ones to hire more data scientists and benefit from big data? 

We looked at five very different businesses that are starting to benefit or could benefit from data science and how exactly can big data better help them achieve success in their fields.

Data Science in Sports

1) Sports

If you saw the movie Moneyball you might know why big data is important to baseball and sports in general. Nowadays, for instance, many NBA teams collect millions of data records per game using cameras installed in the courts. The ultimate goal for all these sports teams is to improve health and safety, and thus performance of the team and individual athletes. In the same way that businesses seek to use data to custom their operations, it’s easy to see how these two worlds can crossover to benefit the sports world.

Data Science in On-Demand Services

2) On-demand services

Uber gets attention for its growth and success that came mainly because how the company uses data. The Uber experience relies on data science and algorithms so this is a clear example of how on-demand services can benefit from big data. Uber continues to succeed because of the convenience that its data-driven product provides. Other on-demand services should look up to Uber’s example for their own good and follow up with relying more on data science.

Data Science in Entertainment Industry

3) Entertainment industry

In this era of connected consumers, media and entertainment businesses must do more than simply being digital to compete. Data science already allows some organizations to understand their audience.

A once content-centric model is turning into a consumer-centric one. The entertainment industry is prepared to capitalize on this trend by converting information into insight that boosts production and cross-channel distribution. From now on it can be expected that those who provide a unique audience experience will be the only ones to achieve growth.

Data Science in Real Estate

4) Real estate agents

We continue hearing that the housing market is unpredictable, however some of the top real estate agents claim they saw the housing bubble burst coming way back (think again of movies, exactly like in The Big Short). It’s easy to obtain this information from following data and trend spotting. This is a great way for this volatile industry to be prepared for market shifts.

Data Science in Food Industry

5 ) Restaurant owners 

This business field is the epitome of how important it is being able to tell what customers want. According to the Washington, D.C.-based National Restaurant Association, restaurants face another big obstacle besides rent, licensing and personnel: critics, not only professional but amateurs who offer their opinions on social media. The importance of quality is the reason why restaurants are beginning to use big data to understand customer preferences and to improve their food and service.

The post What Industries Will Be Next to Adopting Data Science? appeared first on 3Blades.



Source: 3blades – What Industries Will Be Next to Adopting Data Science?

What is needed to build a data science team from the ground up?


What specific roles would a data science team need to have to be successful? Some will depend on the organization’s objectives, but there’s a consensus that the following positions are key.

  1. Data scientist. This role should be held by someone who can work on large datasets (on Hadoop/Spark) with machine learning algorithms, who can also create predictive models, interpret and explain model behavior layman terms. This position requires excellent knowledge of SQL and understanding of at least one programming language for predictive data analysis like R and/ Python.
  2. Data engineer / Data software developer. Requires great knowledge of distributed programming, including infrastructure and architecture. The person hired for this position should be very comfortable with installation of distributed programming frameworks like Hadoop MapReduce/Spark clusters, should be able to code in more than one programming language like Scala/Python/Java, and knows Unix scripting and SQL. This role can also evolve into one of the two specialized roles:
    1. Data solutions architect.  Basically a data engineer with an ample range of experience across several technologies and who has great understanding of service-oriented architecture concepts and web applications.
    2. Data platform administrator. This position requires extensive experience managing clusters including production environments and good knowledge of cloud computing.
  3. Designer. This position should be occupied by an expert who has deep knowledge of user experience (UX) and interface design, primarily for web and mobile applications, as well as knowledge of data visualization and ideally some UI coding expertise.
  4. Product manager. This is an optional role required only for teams focused on building data products. This person will be defining the product vision, translating business problems into user stories, and focusing on helping the development team build data products based on the user stories.

The post What is needed to build a data science team from the ground up? appeared first on 3Blades.



Source: 3blades – What is needed to build a data science team from the ground up?

What is the best way to sync data science teams?


A well-defined workflow will help a data science team reach its goals. In order to sync data science teams and its members it’s important to first know each part of the phases needed to get data based results.  

When dealing with big data or any type of data-driven goals it helps to have a defined workflow. Whether we want to perform an analysis with the intent of telling a story (Data Visualization) or building a system that relies on data, like data mining, the process always matters. If a methodology is defined before starting any task, teams will be in sync and it will be easy to avoid losing time figuring out what’s next. This will allow a faster production rhythm of course and an overall understanding of what everyone is bringing into the team.

Here are the four main parts of the workflow that every team member should know in order to sync data science teams.

1) Preliminary analysis. When data is brand new this step has to be performed first, it’s a no-brainer. In order to produce results fast you need to get an overview of all data points. In this phase, the focus is to make the data usable as quickly as possible and get quick and interesting insights.

2) Exploratory analysis. This is the part of the workflow where questions will be asked over and over again, and where the data will be cleaned and ordered to help answer those same questions. Some teams would end the process here but it’s not ideal, however, it all depends on what we want to do with the data. So there are two phases that could be considered ideally most of the times.

3) Data visualization. This step is imperative if we want to show the results of the exploratory analysis. It’s the part where actual storytelling takes place and where we will be able to translate our technical results into something that can be understood by a wider audience. The focus is turned to how to best present the results. The main goal data science teams should aim for in this phase is to create data visualizations that mesmerize users while telling them all the valuable information discovered in the original data sets.

4) Knowledge. If we want to study the patterns in the data to build reliable models, we turn to this phase in which the focus of the team is producing a model that better explains the data, by engineering it and then testing different algorithms to find the best performance possible.
These are the key phases around which a data science team should sync up in order to have a finished, replicable and understandable product based on data analysis.

The post What is the best way to sync data science teams? appeared first on 3Blades.



Source: 3blades – What is the best way to sync data science teams?

How Can Businesses Adopt a Data-Driven Culture?


There are small steps that any business can adopt in order to start incorporating a data-driven philosophy into their business. An Economist Intelligence Unit survey sponsored by Tableau Software highlights best practices.

A survey made by Economist Intelligence Unit, an independent business within The Economist Group providing forecasting and advisory services, sponsored by Tableau Software, highlighted best practices to adopt a data-driven culture among other information relevant to the field of data science. To ensure a seamless and successful transition to a data-driven culture, here are some of the top approaches your business should apply:

Share data and prosper

Appreciating the power of data is only the first step on the road to a data-driven philosophy. Older companies can have a hard time transitioning to a data-driven culture, especially if they have achieved success with minimum use of data in the past. However, times are changing and any type of company can benefit from this type of information. More than half of respondents from the survey (from top-performing companies) said that promotion of data-sharing has helped create a data-driven culture in their organization.

Increased availability of training

Around one in three respondents said it was important to have partnerships or courses in house to make employees more data-savvy.

Hire a chief data officer (CDO)

This position is key to convert data into insight so that it provides maximum impact. This task is not easy, quite the contrary, it can turn out to be very specialized and businesses shouldn’t expect their CIO or CMO to perform the job. A corporate officer is needed who is wholly dedicated to acquiring and using data to improve productivity. You may already have someone who can be promoted to a CDO at your company: someone who understands the value of data and owns it.

Create policies and guidelines


After the CDO runs a data audit internally, it is relevant that company guidelines are crafted around data analysis. This is how all employees will be equipped with replicable strategies focused on improving business challenges.

Encourage employees to seek data


Once new company policies are in place and running, the next step is to motivate employees to seek answers in data. One of the best ways to do this is offering incentives (you pick what type). Employees will then feel encouraged to use (or even create) tools and find solutions on their own without depending on the IT guys.

The post How Can Businesses Adopt a Data-Driven Culture? appeared first on 3Blades.



Source: 3blades – How Can Businesses Adopt a Data-Driven Culture?

App and Workspace Discovery Demo


Whether you are selling tooth paste or software, reducing operations costs while improving efficiency is every business’s goal. In this post, we will go over some general concepts that we have encountered while setting up a modern micro services based architecture based on the Python stack, Docker, Consul, Registrator and AWS.

This example application demonstrates the simplest service discovery setup for a load balanced application and multiple named sub-applications. This is a typical setup for micro services based architectures. For example,https://myapp.com/authentication could route to one micro service and https://myapp.com/forum could route to another micro service. Additionally, ../authentication and ../forum could be any number of instances of the micro service, so service discovery becomes necessary due to dynamic instance updates.

This setup uses Consul and Registrator to dynamically register new containers in a way that can be used by consul-template to reconfigure Nginx at runtime. This allows new instances of the app to be started and immediately added to the load balancer. It also allows new instances of workspaces to become accessible at their own path based on their name.

This guide is in large part the result of a consulting engagement with Glider Labs. We also used this post from the good people at Real Python as a guide. The source code for this post can be found here.

Requirements

This demo assumes these versions of Docker tools, which are included in the Docker Toolbox. Older versions may work as well.

  • docker 1.9.1
  • docker-machine 0.5.1
  • docker-compose 1.5.2

You can install Docker Machine from here and Docker Compose from here. Then verify versions like so:

$ docker-machine --version
docker-machine version 0.5.1 (04cfa58)
$ docker-compose --version
docker-compose version: 1.5.2

Configure Docker

Configure Docker Machine

Change directory (cd) into repository root and then run:

$ cd myrepo
$ docker-machine create -d virtualbox dev;
Running pre-create checks...
Creating machine...
Waiting for machine to be running, this may take a few minutes...
Machine is running, waiting for SSH to be available...
Detecting operating system of created instance...
Provisioning created instance...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
To see how to connect Docker to this machine, run: docker-machine env dev

The above step creates a virtualbox image named dev. Then configure docker to point to local dev environment:

$ eval "$(docker-machine env dev)"

To view running machines, type:

$ docker-machine ls
NAME      ACTIVE   DRIVER       STATE     URL                         SWARM
dev       *        virtualbox   Running   tcp://192.168.99.100:2376

 

Launch locally with Docker Compose

This demo setup is for a single host such as you laptop, but can work across hosts with minimal changes. There are notes later for running with AWS ECS.

First, we’re going to build the containers upfront. You can view the Dockerfiles in each directory to see what they’re building. An Nginx container and two simple Flask apps: our primary app, and our workspace app. Consul is pulled in from Docker Hub.

Let’s take a look at docker-compose.yml. Some images are pulled from DockerHub, while others are built from the included Dockerfiles in the repo:

consul:
  image: gliderlabs/consul-server
  container_name: consul-server
  net: "host"
  command: -advertise 10.0.2.15 -bootstrap
registrator:
  image: gliderlabs/registrator
  container_name: registrator
  net: "host"
  volumes:
    - /var/run/docker.sock:/tmp/docker.sock
  command: -ip=10.0.2.15 consul://10.0.2.15:8500

app:
  build: app
  environment:
    SERVICE_NAME: app
  ports:
   - "8080"
workspace:
  build: workspace
  environment:
    SERVICE_NAME: workspace
    SERVICE_TAGS: workspace1
    WORKSPACE_ID: workspace1
  ports:
   - "8080"

nginx:
  build: nginx
  container_name: nginx
  ports:
    - "80:80"
  command: -consul 10.0.2.15:8500

To build all images, run:

$ docker-compose build

Now we can start all our services (-d is used to run as daemon):

$ docker-compose up -d

This gives us a single app instance and a workspace called workspace1. We can check them out in another terminal session, hitting the IP of the Docker VM created by docker-machine:

$ curl http://192.168.99.100
App from container: 6294fb10b701

This shows us the hostname of the container, which happens to be the container ID. We’ll use this later to see load balancing come into effect. Before that, let’s check our workspace app running as workspace1:

$ curl http://192.168.99.100/workspace1/
Workspace [workspace1] from container: 68021fff0419

Each workspace instance is given a name that is made available as a subpath. The app itself also spits out the hostname of the container, as well as mentioning what its name is.

Now let’s add another app instance by manually starting one with Docker:

$ docker run -d -P -e "SERVICE_NAME=app" 3blades_app

We’re using the image produced by docker-compose and providing a service name in its environment to be picked up by Registrator. We also publish all exposed ports with -P, which is important for Registrator as well.

Now we can run curl a few times to see Nginx balancing across multiple instances of our app:

$ curl http://192.168.99.100
App from container: 6294fb10b701
$ curl http://192.168.99.100
App from container: 044b1f584475

You should see the hostname changed, representing a different container serving the request. No re-configuration necessary, we just ran a container to be picked up by service discovery.

Similarly, we can run a new workspace. Here we’ll start a workspace called workspace2. The service name is workspace but we provide workspace1 as an environment variable for the workspace app, and as a service tag used by Nginx:

$ docker run -d -P -e "SERVICE_NAME=workspace" -e "SERVICE_TAGS=workspace2" -e "WORKSPACE_ID=workspace2" 3blades_workspace

Now we can access this workspace via curl at /workspace2:

$ curl http://192.168.99.100/workspace2/
Workspace [workspace2] from container: 8067ad9cfaf3

You can also try running that same docker command again to create a second workspace2 instance. Nginx will load balance across them just like the app instances.

You can also try stopping any of these instances and see that they’ll be taken out of the load balancer.

To view logs, type:

$ docker-compose logs

Launch locally with Docker Compose

Launch to AWS with Docker Compose

Docker Machine has various drivers to seamlessly deploy your docker stack to several cloud providers. We essentially used the VirtualBox driver when deploying locally. Now we will use the AWS driver.

For AWS, you will need your access key, secret key and VPC ID:

$ docker-machine create --driver amazonec2 --amazonec2-access-key AKI******* --amazonec2-secret-key 8T93C********* --amazonec2-vpc-id vpc-****** production

This will set up a new Docker Machine on AWS called production:

Running pre-create checks...
Creating machine...
Waiting for machine to be running, this may take a few minutes...
Machine is running, waiting for SSH to be available...
Detecting operating system of created instance...
Provisioning created instance...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
To see how to connect Docker to this machine, run: docker-machine env production

Now we have two Machines running, one locally and one on AWS:

$ docker-machine ls
NAME         ACTIVE   DRIVER         STATE     URL                         SWARM
dev          *        virtualbox     Running   tcp://192.168.99.100:2376
production   -        amazonec2      Running   tcp://<awspublicip>:2376

Then switch to production on AWS as the active machine:

$ eval "$(docker-machine env production)"

Finally, let’s build the Flask app again in the cloud:

$ docker-compose build
$ docker-compose up -d

How it works - Data Science

How it works

Consul

The heart of our service discovery setup is Consul. It provides DNS and HTTP interfaces to register and lookup services. In this example, the -bootstrap option was used, which effectively combines the consul agent and server. The only problem is getting services into Consul. That’s where Registrator comes in.

Registrator

Registrator is designed to run on all hosts and listens to each host’s Docker event stream. As new containers are started, it inspects them for metadata to help define one or more service definitions. A service is created for each published port. By default the service name is based on the image name, but in our demo we override that with environment variables. We also use environment variables to tell Registrator to tag our workspace services with specific names.

Nginx + Consul Template

In this demo we have a very simple Nginx container with a very simple configuration managed by consul-template. Although there are a number of ways to achieve this with various kinds of limitations and drawbacks, this is the simplest mechanism for this case.

In short, our configuration template creates upstream backends for every app and every tag of the workspace app. Each instance is used as a backend server for the upstream. It then maps the / location to our app upstream, and creates a location for each tag of our workspace apps mapping to their upstream.

Consul-template ensures as services change, this configuration is re-rendered and reloaded without downtime in Nginx.

https://github.com/3Blades/deploymentRunning with ECS

There is an Amazon article on using Registrator and Consul with ECS. In short, it means you have to manage your own ECS instances to make sure Consul and Registrator are started on them. This could be done via infrastructure configuration management like CloudFormation or Terraform, or many other means. Or they could be baked into an AMI, requiring much less configuration and faster boot times. Tools like Packer make this easy.

In a production deployment, you’ll run Registrator on every host, along with Consul in client mode. In our example, we’re running a single Consul instance in server bootstrap mode. A production Consul cluster should not be run this way. In a production deployment, Consul server should be run with N/2 + 1 servers (usually 3 or 5) behind an ELB or joined to Hashicorp’s Atlas service. In other words, Consul server instances should probably not be run on ECS, and instead on dedicated instances.

A production deployment will also require more thought about IPs. In our demo, we use a single local IP. In production, we’d want to bind Consul client to the private IP of the hosts. In fact, all but port 80 of Nginx should be on private IPs unless ELB is used. This leaves the IP to use to connect to the Consul server cluster, which is most easily provided with an elastic IP to one of them or an internal ELB.

Updates

Star and Watch our GitHub repo for future example updates, such as using AWS CLI Tools with Docker Compose, example configurations with other Cloud providers, among others.

 

The post App and Workspace Discovery Demo appeared first on 3Blades.



Source: 3blades – App and Workspace Discovery Demo

Which College Should I Choose? (R Script)


Like many of you, I like to learn based on trial and error, so I decided to try to run my first R script and from “Michael L. Thompson’s script” and adapt it to this new experiment.

This script will tell you which college is best for you based on your ethnicity, age, region and SAT score. Although this is a generic example, it can be easily replicated using 3Blades’s Jupyter Notebook (R) version. Simply select this option when launching a new workspace from your project.

Who Are You?

Let’s pretend you are a foreign citizen that came here from Latin America on a H1-B and you are looking for a B.S. in Engineering. You are about to turn 30 and got married not that long ago and a kid is on his way. You are still earning under 75k but you are certain that if you get this career change you will jump into the 100K club, so what are your best choices to achieve that in the west coast?

studentProfile = list(
    dependent = TRUE,           
    ethnicity = 'hispanic',     
    gender    = 'male',         
    age       = 'gt24',         
    income    = 'gt30Kle110K',  
    earnings  = 'gt48Kle75K',   
    sat       = 'gt1400', 
    fasfa     = 'fsend_5',     
    discipline= 'Engineering', 
    region    = 'FarWest',      
    locale    = 'CityLarge',    
    traits    = list(          
      Risk      = 'M',
      Vision    = 'M',
      Breadth   = 'M',
      Challenge = 'M') 
)

Setup the Data & Model

This code loads the college database and defines necessary data structures and functions to implement the model.

## Loading college database ...Done.

Now, here are the top colleges so you can make a wise decision

# ENTER N, for top-N colleges:
ntop <- 10
studentProfile$beta <- getParameters(studentProfile,propertyMap)
caseResult          <- studentCaseStudy(studentBF,studentProfile,propertyMap,verbose=FALSE,ntop=ntop)
# This code will display the results
gplt <- plotTopN(caseResult,plot.it = TRUE)

R Script choose college

Now let’s tweak a bit and see what are your best options based on your SAT scores

R script which colleges is best for you

R script which colleges is best for you

R script which colleges is best for you

R script which colleges is best for you

R script which colleges is best for you
Full credit goes to Michael for this amazing job, I just tweaked a bit to use it as a brief example of the great things you can do with R.

The post Which College Should I Choose? (R Script) appeared first on 3Blades.



Source: 3blades – Which College Should I Choose? (R Script)

OpenAI: A new non-profit AI company


A new non-profit artificial intelligence research company has just been founded. According to the announcement made from the company’s website, the goal of the company is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return. Announcement was made on the last day of NIPS 2015 conference and a small event was held by OpenAI on 12 December 2015 near the conference venue.

OpenAI’s research director is Ilya Sutskever, one of the world experts in machine learning. Our CTO is Greg Brockman, formerly the CTO of Stripe. The group’s other founding members are world-class research engineers and scientists: Trevor Blackwell, Vicki Cheung, Andrej Karpathy, Durk Kingma, John Schulman, Pamela Vagata, and Wojciech Zaremba. Pieter Abbeel, Yoshua Bengio, Alan Kay, Sergey Levine, and Vishal Sikka are advisors to the group. OpenAI’s co-chairs are Sam Altman and Elon Musk.
Sam, Greg, Elon, Reid Hoffman, Jessica Livingston, Peter Thiel, Amazon Web Services (AWS), Infosys, and YC Research are donating to support OpenAI. In total, these funders have committed $1 billion, although they expect to only spend a tiny fraction of this in the next few years.
Medium published an interview about OpenAI with Altman, Musk, and Brockman [2], where the founders answered various questions about their new AI initiative.
In a Guardian article about OpenAI written by Neil Lawrence[3], a professor of machine learning from University of Sheffield, importance of open data for the AI community is emphasized besides the open algorithms.

[1] OpenAI, Introducing OpenAI, https://openai.com/blog/introducing-openai/, Greg Beckman, Ilya Sutskever, OpenAI team, December 11 2015.
[2] Medium, How Elon Musk and Y Combinator Plan to Stop Computers From Taking Over, https://medium.com/backchannel/how-elon-musk-and-y-combinator-plan-to-stop-computers-from-taking-over-17e0e27dd02a#.x79zvtwsl, Steven Levy, December 11 2015.
[3] The Guardian, OpenAI won’t benefit humanity without data-sharing, http://www.theguardian.com/media-network/2015/dec/14/openai-benefit-humanity-data-sharing-elon-musk-peter-thiel, Neil Lawrence, December 12 2015.



Source: DeepLearning – OpenAI: A new non-profit AI company