Distributing Data in a Parameterserver


One of the key features of a parameter server is that it, well, serves parameters. In particular, it serves more parameters than a single machine can typically hold and provides more bandwidth than what a single machine offers.


A sensible strategy to increase both aspects is to arrange data in the form of a bipartite graph with clients on one side and the server machines on the other. This way bandwidth and storage increase linearly with the number of machines involved. This is well understood. For instance, distributed (key,value) stores such as memcached or Basho Riak use it. It dates back to the ideas put forward e.g. in the STOC 1997 paper by David Karger et al. on Consistent Hashing and Random Trees.

A key problem is that we can obviously not store a mapping table from the keys to the machines. This would require a database that is of the same size as the set of keys and that would need to be maintained and updated on each client. One way around this is to use the argmin hash mapping. That is, given a machine pool (M), we assign a given (key,value) pair to the machine that has the smallest hash, i.e.

$$m(k, M) = mathrm{argmin}_{m in M} h(m,k)$$

The advantage of this scheme is that it allows for really good load balancing and repair. First off, the load is almost uniformly distributed, short of a small number of heavy hitters. Secondly, if a machine is removed or added to the machine pool, rebalancing affects all other machines uniformly. To see this, notice that the choice of machine with the smallest and second-smallest hash value is uniform.

Unfortunately, this is a stupid way of distributing (key,value) pairs for machine learning. And this is what we did in our 2010 VLDB and 2012 WSDM papers. To our excuse, we didn’t know any better. And others copied that approach … after all, how you can you improve on such nice rebalancing aspects.

This begs the question why it is a bad idea. It all comes down to the issue of synchronization. Basically, whenever a client attempts to synchronize its keys, it needs to traverse the list of the keys it owns and communicate with the appropriate servers. In the above scheme, it means that we need to communicate to a new random server for each key. This is amazingly costly. Probably the best comparison would be a P2P network where each byte is owned by a different machine. Downloads would take forever.

We ‘fixed’ this problem by cleverly reordering the access and then performing a few other steps of randomization. There’s even a nice load balancing lemma in the 2012 WSDM paper. However, a much better solution is to prevent the problem from happening and to borrow from key distribution algorithms such as Chord. In it, servers are inserted into a ring via a hash function. So are keys. This means that each server now owns a contiguous segment of keys. As a result, we can easily determine which keys go to which server, simply by knowing where in the ring the server sits.


In the picture above, keys are represented by little red stars. They are randomly assigned using a hash function via (h(k)) to the segments ‘owned’ by servers (s) that are inserted in the same way, i.e. via (h(s)). In the picture above, each server ‘owns’ the segment to its left. Also have a look at the Amazon Dynamo paper for a related description.

Obviously, such a load-balancing isn’t quite as ideal as the argmin hash. For instance, if a machine fails, the next machine inherits the entire segment. However, by inserting each server (log n) times we can ensure that a good load balance is achieved and also that when machines are removed, there are several other machines that pick up the work. Moreover, it is now also very easy to replicate things (more on this later). If you’re curious on how to do this, have a look at Amar Phanishayee’s excellent thesis. In a nutshell, the machines to the left hold the replicas. More details in the next post.

Source: Adventures in Data Land


30 Must Read Books in Analytics / Data Science

So many pages have been dedicated to Data Science that it can be hard to pinpoint the best books among the sea of available content. However, we have compiled our own list and perhaps it would be a good source of reference for you too.

This is not a definitive list of all the books that you would probably have to read during your career as a Data Scientist, but it definitely includes classics, beginners books, specialist books (more related to the business of data science or team-building) and of course, some good ones that explain the complexities of certain programs, languages or processes.

So, bring it on! Find yourself a comfortable reclining chair or a desk, good reading glasses (if needed) and a peaceful mindset to cultivate your data-driven mind.

The post 30 Must Read Books in Analytics / Data Science appeared first on 3Blades.

Source: 3blades – 30 Must Read Books in Analytics / Data Science

What Industries Will Be Next to Adopting Data Science?

It’s no surprise that data science will surely spread to more industries in the next couple of years. So, which of them would probably be the next ones to hire more data scientists and benefit from big data? 

We looked at five very different businesses that are starting to benefit or could benefit from data science and how exactly can big data better help them achieve success in their fields.

Data Science in Sports

1) Sports

If you saw the movie Moneyball you might know why big data is important to baseball and sports in general. Nowadays, for instance, many NBA teams collect millions of data records per game using cameras installed in the courts. The ultimate goal for all these sports teams is to improve health and safety, and thus performance of the team and individual athletes. In the same way that businesses seek to use data to custom their operations, it’s easy to see how these two worlds can crossover to benefit the sports world.

Data Science in On-Demand Services

2) On-demand services

Uber gets attention for its growth and success that came mainly because how the company uses data. The Uber experience relies on data science and algorithms so this is a clear example of how on-demand services can benefit from big data. Uber continues to succeed because of the convenience that its data-driven product provides. Other on-demand services should look up to Uber’s example for their own good and follow up with relying more on data science.

Data Science in Entertainment Industry

3) Entertainment industry

In this era of connected consumers, media and entertainment businesses must do more than simply being digital to compete. Data science already allows some organizations to understand their audience.

A once content-centric model is turning into a consumer-centric one. The entertainment industry is prepared to capitalize on this trend by converting information into insight that boosts production and cross-channel distribution. From now on it can be expected that those who provide a unique audience experience will be the only ones to achieve growth.

Data Science in Real Estate

4) Real estate agents

We continue hearing that the housing market is unpredictable, however some of the top real estate agents claim they saw the housing bubble burst coming way back (think again of movies, exactly like in The Big Short). It’s easy to obtain this information from following data and trend spotting. This is a great way for this volatile industry to be prepared for market shifts.

Data Science in Food Industry

5 ) Restaurant owners 

This business field is the epitome of how important it is being able to tell what customers want. According to the Washington, D.C.-based National Restaurant Association, restaurants face another big obstacle besides rent, licensing and personnel: critics, not only professional but amateurs who offer their opinions on social media. The importance of quality is the reason why restaurants are beginning to use big data to understand customer preferences and to improve their food and service.

The post What Industries Will Be Next to Adopting Data Science? appeared first on 3Blades.

Source: 3blades – What Industries Will Be Next to Adopting Data Science?

What is needed to build a data science team from the ground up?

What specific roles would a data science team need to have to be successful? Some will depend on the organization’s objectives, but there’s a consensus that the following positions are key.

  1. Data scientist. This role should be held by someone who can work on large datasets (on Hadoop/Spark) with machine learning algorithms, who can also create predictive models, interpret and explain model behavior layman terms. This position requires excellent knowledge of SQL and understanding of at least one programming language for predictive data analysis like R and/ Python.
  2. Data engineer / Data software developer. Requires great knowledge of distributed programming, including infrastructure and architecture. The person hired for this position should be very comfortable with installation of distributed programming frameworks like Hadoop MapReduce/Spark clusters, should be able to code in more than one programming language like Scala/Python/Java, and knows Unix scripting and SQL. This role can also evolve into one of the two specialized roles:
    1. Data solutions architect.  Basically a data engineer with an ample range of experience across several technologies and who has great understanding of service-oriented architecture concepts and web applications.
    2. Data platform administrator. This position requires extensive experience managing clusters including production environments and good knowledge of cloud computing.
  3. Designer. This position should be occupied by an expert who has deep knowledge of user experience (UX) and interface design, primarily for web and mobile applications, as well as knowledge of data visualization and ideally some UI coding expertise.
  4. Product manager. This is an optional role required only for teams focused on building data products. This person will be defining the product vision, translating business problems into user stories, and focusing on helping the development team build data products based on the user stories.

The post What is needed to build a data science team from the ground up? appeared first on 3Blades.

Source: 3blades – What is needed to build a data science team from the ground up?

What is the best way to sync data science teams?

A well-defined workflow will help a data science team reach its goals. In order to sync data science teams and its members it’s important to first know each part of the phases needed to get data based results.  

When dealing with big data or any type of data-driven goals it helps to have a defined workflow. Whether we want to perform an analysis with the intent of telling a story (Data Visualization) or building a system that relies on data, like data mining, the process always matters. If a methodology is defined before starting any task, teams will be in sync and it will be easy to avoid losing time figuring out what’s next. This will allow a faster production rhythm of course and an overall understanding of what everyone is bringing into the team.

Here are the four main parts of the workflow that every team member should know in order to sync data science teams.

1) Preliminary analysis. When data is brand new this step has to be performed first, it’s a no-brainer. In order to produce results fast you need to get an overview of all data points. In this phase, the focus is to make the data usable as quickly as possible and get quick and interesting insights.

2) Exploratory analysis. This is the part of the workflow where questions will be asked over and over again, and where the data will be cleaned and ordered to help answer those same questions. Some teams would end the process here but it’s not ideal, however, it all depends on what we want to do with the data. So there are two phases that could be considered ideally most of the times.

3) Data visualization. This step is imperative if we want to show the results of the exploratory analysis. It’s the part where actual storytelling takes place and where we will be able to translate our technical results into something that can be understood by a wider audience. The focus is turned to how to best present the results. The main goal data science teams should aim for in this phase is to create data visualizations that mesmerize users while telling them all the valuable information discovered in the original data sets.

4) Knowledge. If we want to study the patterns in the data to build reliable models, we turn to this phase in which the focus of the team is producing a model that better explains the data, by engineering it and then testing different algorithms to find the best performance possible.
These are the key phases around which a data science team should sync up in order to have a finished, replicable and understandable product based on data analysis.

The post What is the best way to sync data science teams? appeared first on 3Blades.

Source: 3blades – What is the best way to sync data science teams?

How Can Businesses Adopt a Data-Driven Culture?

There are small steps that any business can adopt in order to start incorporating a data-driven philosophy into their business. An Economist Intelligence Unit survey sponsored by Tableau Software highlights best practices.

A survey made by Economist Intelligence Unit, an independent business within The Economist Group providing forecasting and advisory services, sponsored by Tableau Software, highlighted best practices to adopt a data-driven culture among other information relevant to the field of data science. To ensure a seamless and successful transition to a data-driven culture, here are some of the top approaches your business should apply:

Share data and prosper

Appreciating the power of data is only the first step on the road to a data-driven philosophy. Older companies can have a hard time transitioning to a data-driven culture, especially if they have achieved success with minimum use of data in the past. However, times are changing and any type of company can benefit from this type of information. More than half of respondents from the survey (from top-performing companies) said that promotion of data-sharing has helped create a data-driven culture in their organization.

Increased availability of training

Around one in three respondents said it was important to have partnerships or courses in house to make employees more data-savvy.

Hire a chief data officer (CDO)

This position is key to convert data into insight so that it provides maximum impact. This task is not easy, quite the contrary, it can turn out to be very specialized and businesses shouldn’t expect their CIO or CMO to perform the job. A corporate officer is needed who is wholly dedicated to acquiring and using data to improve productivity. You may already have someone who can be promoted to a CDO at your company: someone who understands the value of data and owns it.

Create policies and guidelines

After the CDO runs a data audit internally, it is relevant that company guidelines are crafted around data analysis. This is how all employees will be equipped with replicable strategies focused on improving business challenges.

Encourage employees to seek data

Once new company policies are in place and running, the next step is to motivate employees to seek answers in data. One of the best ways to do this is offering incentives (you pick what type). Employees will then feel encouraged to use (or even create) tools and find solutions on their own without depending on the IT guys.

The post How Can Businesses Adopt a Data-Driven Culture? appeared first on 3Blades.

Source: 3blades – How Can Businesses Adopt a Data-Driven Culture?

Risk vs. Loss

A risk is defined as the probability of an undesirable event to take place. Since most risks are not totally random but rather dependent of a range of influences, we try to quantify a risk function, that gives the probability for each set of influences. We then calculate the expected loss by multiplying the costs that are caused by the occurrence of this event with the risk, i.e. its probability.

Often, the influences can be changed by our actions. We might have a choice. So it makes sense to look for a course of actions that would minimize the loss function, i.e. lead to as little expected damages as possible.
Algorithms that run in many procedures and on many devices often make decisions. Prominent examples are credit scoring or shop recommendation systems. In both cases it is clear that the algorithm should be designed to optimize the economic outcome of its decision. In both cases, two risks emerge: The risk of a false negative (i.e. wrongly give credit to someone who cannot pay it back, resp. make a recommendation that does not fit the customer’s preferences), and the risk of a false positive (not granting credit to a person that would have been creditworthy, resp. not offering something that would have been exactly what the customer was looking for).
There is however an asymmetry in the losses of these two risks. For the vast majority of cases, it is far more easy to calculate the loss for a false negative than for the false positive. The cost of credit default is straightforward. The cost of someone not getting the money is however most certainly bigger than just the missed interests; the potential borrower might very well go away and never come back, without us ever realizing.
Even worse, while calculating risk is (more or less) just maths and statistics, different people might not even agree on the losses. In our credit scoring example: One might say, let’s just take what we know for sure, i.e. the opportunity costs of missed interests, the other might insist to evaluate a broader range of damages. The line where to stop is obviously arbitrary. So while the risk function can be made somehow objective, the loss function will be much more tricky and most of the time prone to doubt and discussion.

Collision decision

In the IoT – the world of connected devices, of programmable object, the problem of risks and losses becomes vital. Self-driving cars will cause accidents, too, even if they are much safer than human drivers. If a collision is inevitable, how should the car react? This was the key question ask by Majken Sander in our talk on algorithm ethics at Strata+Hadoop World. If it is just me in the car, a possible manoeuvre would turn the car sideways. If however my children sit next to me, I might very well prefer a frontal crash and rather have me injured than my passengers. Whatever I would see as the right way to act, it is clear that I want to make the decision myself. I would not want to have it decided remotely without my even knowing on what grounds.
Sometimes people mention that even for human casualties, a monetary calculation could be done -no matter how cruel that might sound. We could e.g. take the valuation of humans according to their life expectancy, insurance costs, or any other financial indicator. However, this is clearly not, how we would usually deal with lethal risks. “No man left behind” -how could we explain Saving-Private-Ryan-ish campaigns on economic grounds? Since the human casualty in the values of our society is regarded as total, not commensurable (even if a compensation can be defined), we get a singularity in our loss function. Our metric just doesn’t work here. Hence there will be no just algorithm to deal with a decision of that dimension.

Calculate risks, let losses be open

We will nevertheless have to find a solution. One suggestion for the car example is, that in risky situations, the car would re-delegate the driving back to a human to let them decide.
This can be generalized: Since the losses might be valuated differently by different people, it should always be well documented and fully transparent to the users, how the losses are calculated. In many cases, the loss function could be kept open. The algorithm could offer different sets of parameters to let the users decide on the behavior of product.
As a society we have to demand to be in charge defining the ethics behind the algorithms. It is a strong cause for regulation, I am convinced about that. It is not an economic, but a political task.

Source: Beautiful Data



What to expect from Strata Conference 2015? An empirical outlook.

In one week, the 2015 edition of Strata Conference (or rather: Strata + Hadoop World) will open its doors to data scientists and big data practitioners from all over the world. What will be the most important big data technology trends for this year? As last year, I ran an analysis on the Strata abstract for 2015 and compared them to the previous years.

One thing immediately strikes: 2015 will be probably known as the “Spark Strata”:

If you compare mentions of the major programming languages in data science, there’s another interesting find: R seems to have a comeback and Python may be losing some of its momentum:

R is also among the rising topics if you look at the word frequencies for 2015 and 2014:

Now, let’s take a look at bigrams that have been gaining a lot of traction since the last Strata conference. From the following table, we could expect a lot more case studies than in the previous years:

This analysis has been done with IPython and Pandas. See the approach in this notebook.
Looking forward to meeting you all at Strata Conference next week! I’ll be around all three days and always in for a chat on data science.

Source: Beautiful Data

Slow Data


Abstract: Data is the new media. Thus the postulates of our Slow Media Manifesto should be applicable on Big Data, too. Slow Data in this sense is meaningful data, relevant for society, driving creativity and scientific thinking. Slow Data is beautiful data.
From Slow Media to Slow Data
Five years ago, we wrote the Slow Media Manifesto. We were concerned about the strange dichotomy by which people separated old media from new media to make their point about quality, ethics, and aesthetics. With Big Data, I now encounter a similar mindset. Just like people were scoffing social media to be just doodles, scribbling, or worse, I now see people scornfully raising their eyebrows about the lack of structure, missing consistency, and other alleged flaws they imagine Big Data to carry. As if “good old data” with a small sample size, representativeness, and other formalistic criteria would be a better thing, as such. Again what these people see, is just an evil new vice swamped over their mature businesses by unseasoned startups, however insanely well funded. I have gone through this argument twice already. It was wrong in the 90s when the web started, it was wrong again in the 2000s regarding social media, and it will not become right this time. Because it is not the technology paradigm that makes quality.
A mathematician, like a painter or a poet, is a maker of patterns. If his patterns are more permanent than theirs, it is because they are made with ideas. Beauty is the first test: there is no permanent place in the world for ugly mathematics.
Godfrey Harold Hardy
Data is the new media. I have written about this too. The traditional concept of media becomes more and more directly intertwined with data, with data storytelling, data journalism, and their likes, indirectly because search, targeted advertising, content filtering, and other predictive technologies increasingly influence what we will find presented as media content.
Thus I think it makes sense to take Slow Media and ask about Slow Data, too.
Highly curated small data
For what is useful above all is technique.
Godfrey Harold Hardy
Direct marketing data sets tend to be not very high quality (sorry CRM folks, but I know what I am talking about). Many records are only partly qualified, if at all. Moreover the information, on which the targeting is based, is often outdated.
Small samples can enhance large heaps of data
In 2006 I oversaw a major market survey, the Typologie der Wünsche. This very expensive market research was conducted diligently according to the rules of the trade of social science. The questionnaire went through the toughest lectorship before it would be considered ready to be send out to the interviewers. The survey was done face to face, based on a cautiously drawn sample of 10.000 people per year. The results underwent permanent quality assurance. To be sure about the quality of the survey, it was conducted by three independent research agencies. Thus we could cross check plausibility.
Since my employer was also involved in direct marketing with a huge database of addresses, call centers, and logistics, we developed a method to use the highly curated market survey with its rather small sample to calibrate and enhance the “dirty” records of the CRM business. This was working so well that we started a cooperation with Deutsche Post to do the same, but on a much larger scale. Our small but precious data was matched with all 40 million addresses in Germany.
When working for MediaCom I was involved with a similar project. Television ratings are measured by expensive panels in most markets, usually run and funded by joint industry committees like BARB in the UK or AGF in Germany. Of course such panel is restricted to just a few thousand households. Since there are only some ten relevant TV channels, this panel size is sufficient to support media planning. But internet usage is so much more fragmented, that a panel of that sort would hardly make sense. So we took the data that we had collected via web tracking – again some 40 million records. We again found a way to infuse the TV panel data into the online data and could by that calculate the probabilities that the owner of a certain cookie would have had contact with a certain advertising campaign on TV or not. So again, a small but highly curated and very specialized data set was used to greatly increase the value of the larger Big Data set.
Bringing scientific knowledge into Big Data
Archimedes will be remembered when Aeschylus is forgotten, because languages die and mathematical ideas do not.
Godfrey Harold Hardy
Another example where small but highly curated data is crucial for data science, are data sets that contain scientific information, which otherwise is not inherent in the data. Text mining works best, when you can use quantitative methods without thinking about those difficult cultural concepts like ‘meaning’ or ‘semantics’. Detection of relevant content with ngram ranking, or text comparison based on cosine vector distance are the most powerful tools to analyze texts even in unfamiliar languages or alphabets. However, all the quantitative text mining procedures require the text to be preprocessed: All vocabulary with only grammatical function that would not add to the meaning has to be stripped off first. It is also useful to bring the words to their root form (picture verbs into infinitive, nouns into nominative singular). This indispensable work is done with special corpora, dictionaries, or better call it libraries, that contain all the required information. These corpora are handmade by linguists. Packages like Python’s NLTK have them incorporated in a handy way.
In his talk “The Sidekick Pattern: Using Small Data to Increase the Value of Big Data” Abe Gong from Jawbone gives more examples of small data that transmutes the leaden Big Data heaps into gold. His alchemist data science presentation is highly recommended read.
Data as art
I am interested in mathematics only as a creative art.
Godfrey Harold Hardy
“Beautiful evidence” is what Edward Tufte calls good visualization. Information can truly be brought to us in a beautiful way. Data visualization as an art form had also entered the Sanctum of high arts when the group Asymptote was presented at Documenta 11 in 2002. Visual storytelling today has transformed. What used to be cartoons or engravings like this one here to illustrate the text, is now infographics that are the story.

Generative art is another data-driven art format. When I was an undergraduate, The Fractal Geometry of Nature had finally tickled down to the math classes. With my Atari Mega ST I devoured all fractal code snippets I could get into my hands. What fascinated me most were not the (usually rather kitschy) colorful fractal images. I wanted to have fractal music, generative music that would evolve algorithmically from my code.
Although fractals as an art-thing where certainly more a fad, not well suited to turn into real art, generative art as such has since then become a strong branch in the Arts. Much of today’s music relies heavily on algorithmic patterns in many of its dimensions, from rhythm to tune, to overtone spectra. Also in video art, algorithmically rendered images are ubiquitous.
Art from data will further evolve. I trust we will see data fiction become a genre of its own.
Data as critique
… there is no scorn more profound, or on the whole more justifiable, than that of the men who make for the men who explain. Exposition, criticism, appreciation, is work for second-rate minds.
Godfrey Harold Hardy
Critique is the way to think in the alternative. Critique means not to trust what is sold to you as truth. Data is always ambiguous. Meaning is imposed upon data by interpretation. Critique is to deconstruct interpretation, to give room for other ways to interpret. The other stories we may draw from our data do not have to be more plausible, at all. Often the absurd is what unveils hidden aspects of our models. As long as our alternative interpretations are at least possible, we should follow these routes to see where they end. Data fiction is the means to turn data into a tool of critique.
Data science has changed our perception of how lasting we take our results to be. In data science we usually do not see a conclusion as true or permanent. Rather we hope that a correlation or pattern that we observe will remain stable, at least for a while. There is no hypothesis that we would accept and then tick off just because our test statistics turned significant. We would always continue to a/b-test alternative models, that would substitute an earlier winner of the test-game. In data science, we maximize critical thinking by not even seeing what we do as falsification because we would not have thought of the previous state as true in the first place. Truth in data science means just the most plausible interpretation at a time; ephemeral.
Slow Data thus means to use data to deconstruct the obvious, as well as to built alternatives.
Ethical data
A science is said to be useful if its development tends to accentuate the existing inequalities in the distribution of wealth, or more directly promotes the destruction of human life.
Godfrey Harold Hardy
The two use cases that dominate the discussion about Big Data are the right opposite of ethical: Targeted advertising, and mass surveillance. As Bruce Sterling points out, both are in essence just two aspects of the same thing, that he calls ‘surveillance marketing’. I feel sad that this is what seams to be the prominent use of our work: To sell things to people who do not want them, and to keep people down.
However, I am confident that the benign uses of Big Data will soon offer such high incentive, that we will awake from our military marketing nightmares. With open data we build a public space. All the most useful Big Data tools are all in the pubic domain anyway: Hadoop, Mesos, R, Python, Gephy, etc. etc.
Ethical data is data that makes a difference for society. Ethical data is relevant for people’s lives: To control traffic, to make agriculture more sustainable, to supply energy, to help plan cities and administer the states. This data will be crucial to facilitate our living together with ten billion people.
Slow Data is data that makes a difference for people’s lives.
Political data
It is never worth a first class man’s time to express a majority opinion. By definition, there are plenty of others to do that.
Godfrey Harold Hardy
“Code is Law” is the catch phrase of Lawrence Lessig famous bestseller on the future of democracy. From the beginning of the Internet revolution, there has been the discussion, whether our new forms of media and communication would lead to another revolution as well: a political one. Many of the media and platforms that rose over last decade show aspects of communal or even social systems – and hence might be called Social Media with good cause. Thus, it does not come as a surprise that we start to see the development of the communication platforms that are genuinely meant to support and at the same time to experiment with new forms of political participation, like Proxy-Voting or Liquid Democracy, which had been hardly conceivable without the infrastructure of the Web. Since these new forms of presenting, debating, and voting for policies have been occurring just recently, we can expect that many other varieties will appear, new concepts to translate the internet paradigm into social decision making. Nevertheless how do these new forms of voting work? Are they really mapping the volonté generale into decisions? If so, will it work in a sustainable, stable, continuous way? And how to evaluate the systems, one compared to another? I currently work in a scientific research project on how to deal with these questions. Today I am not yet ready to present conclusions. Nonetheless, I already see that using data for quantitative simulation is a good approach to approximate the complex dynamics of future data-driven political decision-making.
Politics as defined by Aristotle means to have the freedom to make decisions based on ethics and beliefs, and not driven by necessities, the latter is what he calls economics. To deal with law in this sense is similar to my text mining example above. If law is codified, it can be executed syntactically, indeed quite similar to a computer program. But to define what is just, what should be put into the laws, is not syntactical at all. Ideally this would be exclusively political. I don’t think, algorithmic legislation would be desirable, I doubt that it would be even feasible.
Slow Data means to use data to explore new forms of political participation without rush.
Machine thinking
Chess problems are the hymn-tunes of mathematics.
Godfrey Harold Hardy
‘Could a machine think?’ is the core question of AI. The way we think about answering this question immediately lead us beyond computer science: What does it mean to think? What is consciousness? Since the 1980s there has been a fascinating exchange of arguments about the possibility of artificial intelligence, culminating in the Chinese Room debate between John Searle and the Churchlands. Searle and in an even more abstract way, David Chalmers made good points why a simulation of consciousness that would even pass the Turing test, would never become really conscious. Their counterparts, most prominent Douglas Hofstadter, would reject Chalmers neo-Kantianism as metaphysics.
Google has recently published an interesting paper on artificial visual intelligence. They trained mathematical models with random pictures from social media sites. And – surprise! – their algorithm came up with a concept of “What is a cat?”. The point is, nobody had told the algorithm to look for cat-like patterns. Are we witnessing the birth of artificial intelligence here? On the one hand, Google’s algorithm seams to do exactly what Hofstadter predicted. It is adaptive to environmental influences and translates the sensory inputs into something that we interpret as meaning. On the other side was the training sample far from random. The pictures were what people had pictured. It was a collaboratively curated set of rather small variety. The pattern the algorithm found was in fact imposed by “classic” consciousnesses, by the minds of “real” people.
Slow Data is the essence that makes our algorithm intelligent.
The beauty of scientific data
Beauty is the first test: there is no permanent place in this world for ugly mathematics.
Godfrey Harold Hardy
Now returning to Hardy’s quote from the beginning, when I was studying mathematics, I was puzzled by the strange aestheticism that many mathematicians would force upon their train of thoughts. Times have changed since then. Today we have many theorems solved that were considered hard problems. Computational proofing has taken its role in mathematical epistemology. Proofs filling thousands of pages are not uncommon.
Science, physics in particular, is driven by accurate data. Kepler could dismiss the simple heliocentric model because Tycho Brahe had measured the movements of the planets to such accuracy that the model of circular orbits could no longer be maintained. Edwin Hubble discovered the structure of our expanding universe because Milton Humason and other astronomers at Mt. Wilson had provided for spectroscopic images of thousands of galaxies, exact enough to derive Hubble’s constant from the redshift of the prominent Fraunhofer lines. Einstein’s Special Theory of Relativity relies on the data of Michelson and Morley who had shown that light would travel at constant speed, no matter the angle to the direction of our earth’s travel around the sun it was measured. Such uncompromisingly accurate data, collected in a painstaking struggle without any guarantee to pay off – this is what really brought the great breakthroughs in science.
Finally, while mathematics is turning partially into syntax, the core of physics at the same time unfolds in the strange blossoms of the most beautiful mathematics imaginable. In the intersect of cosmology, dealing with the very largest objective imaginable – the entirety of the cosmos, and quantum physics on the smallest scale lies the alien world of black holes, string theory, and quantum gravity. The scale of these phenomena, the fabric of space-time is likely defined by relating Planck’s constant to Newton’s constant and the speed of light is so unimaginably small – some 40 powers of magnitude smaller than the size of an electron – that we can’t expect to measure any data even near to it any time soon. We thus can only rely on our logic, our sense for mathematical harmony, and the creative mind.
Slow Data
Slow Data – for me the space of beautiful data is spanned by these aspects. I am confident that we do not need an update to our manifesto. However, I hope that we will see many examples of valuable data, of data that helps people, that creates experiences unseen, and that opens the doors to new worlds of our knowledge and imagination.

Appendix: Slow Media
The Slow Media movement was kicked off with the Slow Media Manifesto that Sabria David, Benedikt Koehler and I had written on new year’s day 2010. Immediately after we had published the manifesto, it was translated into Russian, French, and some other 20 languages.
On our Slow Media blog you may find more on slowness:
In German: slow-media.net
In English: en.slow-media.net
also: “Slow – the open alternative to platform capitalism”

Source: Beautiful Data



How to analyze smartphone sensor data with R and the BreakoutDetection package

Yesterday, Jörg has written a blog post on Data Storytelling with Smartphone sensor data. Here’s a practical approach on how to analyze smartphone sensor data with R. In this example I will be using the accelerometer smartphone data that Datarella provided in its Data Fiction competition. The dataset shows the acceleration along the three axes of the smartphone:

x – sideways acceleration of the device
y – forward and backward acceleration of the device
z – acceleration up and down

The interpretation of these values can be quite tricky because on the one hand there are manufacturer, device and sensor specific variations and artifacts. On the other hand, all acceleration is measured relative to the sensor orientation of the device. So, for example, the activity of taking the smartphone out of your pocket and reading a tweet can look the following way:

y acceleration – the smartphone had been in the pocket top down and is now taken out of the pocket
z and y acceleration – turning the smartphone so that is horizontal
x acceleration – moving the smartphone from the left to the middle of your body
z acceleration – lifting the smartphone so you can read the fine print of the tweet

And third, there is gravity influencing all the movements.
So, to find out what you are really doing with your smartphone can be quite challenging. In this blog post, I will show how to do one small task – identifying breakpoints in the dataset. As a nice side effect, I use this opportunity to introduce an application of the Twitter BreakoutDetection Open Source library (see Github) that can be used for Behavioral Change Point analysis.
First, I load the dataset and take a look at it:

accel <- read.csv(“SensorAccelerometer.csv”, stringsAsFactors=F)

user_id x y z updated_at type
1 88 -0.06703765 0.05746084 9.615114 2014-05-09 17:56:21.552521 Probe::Accelerometer
2 88 -0.05746084 0.10534488 9.576807 2014-05-09 17:56:22.139066 Probe::Accelerometer
3 88 -0.04788403 0.03830723 9.605537 2014-05-09 17:56:22.754616 Probe::Accelerometer
4 88 -0.01915361 0.04788403 9.567230 2014-05-09 17:56:23.372244 Probe::Accelerometer
5 88 -0.06703765 0.08619126 9.615114 2014-05-09 17:56:23.977817 Probe::Accelerometer
6 88 -0.04788403 0.07661445 9.595961 2014-05-09 17:56:24.53004 Probe::Accelerometer

This is the sensor data for one user on one day:

accel$day <- substr(accel$updated_at, 1, 10)
df <- accel[accel$day == ‘2014-05-12’ & accel$user_id == 88,]
df$timestamp <- as.POSIXlt(df$updated_at) # Transform to POSIX datetime
ggplot(df) + geom_line(aes(timestamp, x, color=”x”)) +
geom_line(aes(timestamp, y, color=”y”)) +
geom_line(aes(timestamp, z, color=”z”)) +
scale_x_datetime() + xlab(“Time”) + ylab(“acceleration”)

Let’s zoom in to the period between 12:32 and 13:00:

ggplot(df[df$timestamp >= ‘2014-05-12 12:32:00’ & df$timestamp < ‘2014-05-12 13:00:00’,]) +
geom_line(aes(timestamp, x, color=”x”)) +
geom_line(aes(timestamp, y, color=”y”)) +
geom_line(aes(timestamp, z, color=”z”)) +
scale_x_datetime() + xlab(“Time”) + ylab(“acceleration”)

Then, I load the Breakoutdetection library:

bo <- breakout(df$x[df$timestamp >= ‘2014-05-12 12:32:00’ & df$timestamp < ‘2014-05-12 12:35:00′],
min.size=10, method=’multi’, beta=.001, degree=1, plot=TRUE)

This quick analysis of the acceleration in the x direction gives us 4 change points, where the acceleration suddenly changes. In the beginning, the smartphone seems to lie flat on a horizontal surface – the sensor is reading a value of around 9.8 in positive direction – this means, the gravitational force only effects this axis and not the x and y axes. Ergo: the smartphone is lying flat. But then things change and after a few movements (our change points) the last observation has the smartphone on a position where the x axis has around -9.6 acceleration, i.e. the smartphone is being held in landscape orientation pointing to the right.

Source: Beautiful Data