How to analyze smartphone sensor data with R and the BreakoutDetection package

Yesterday, Jörg has written a blog post on Data Storytelling with Smartphone sensor data. Here’s a practical approach on how to analyze smartphone sensor data with R. In this example I will be using the accelerometer smartphone data that Datarella provided in its Data Fiction competition. The dataset shows the acceleration along the three axes of the smartphone:

x – sideways acceleration of the device
y – forward and backward acceleration of the device
z – acceleration up and down

The interpretation of these values can be quite tricky because on the one hand there are manufacturer, device and sensor specific variations and artifacts. On the other hand, all acceleration is measured relative to the sensor orientation of the device. So, for example, the activity of taking the smartphone out of your pocket and reading a tweet can look the following way:

y acceleration – the smartphone had been in the pocket top down and is now taken out of the pocket
z and y acceleration – turning the smartphone so that is horizontal
x acceleration – moving the smartphone from the left to the middle of your body
z acceleration – lifting the smartphone so you can read the fine print of the tweet

And third, there is gravity influencing all the movements.
So, to find out what you are really doing with your smartphone can be quite challenging. In this blog post, I will show how to do one small task – identifying breakpoints in the dataset. As a nice side effect, I use this opportunity to introduce an application of the Twitter BreakoutDetection Open Source library (see Github) that can be used for Behavioral Change Point analysis.
First, I load the dataset and take a look at it:

setwd(“~/Documents/Datarella”)
accel <- read.csv(“SensorAccelerometer.csv”, stringsAsFactors=F)
head(accel)

user_id x y z updated_at type
1 88 -0.06703765 0.05746084 9.615114 2014-05-09 17:56:21.552521 Probe::Accelerometer
2 88 -0.05746084 0.10534488 9.576807 2014-05-09 17:56:22.139066 Probe::Accelerometer
3 88 -0.04788403 0.03830723 9.605537 2014-05-09 17:56:22.754616 Probe::Accelerometer
4 88 -0.01915361 0.04788403 9.567230 2014-05-09 17:56:23.372244 Probe::Accelerometer
5 88 -0.06703765 0.08619126 9.615114 2014-05-09 17:56:23.977817 Probe::Accelerometer
6 88 -0.04788403 0.07661445 9.595961 2014-05-09 17:56:24.53004 Probe::Accelerometer

This is the sensor data for one user on one day:

accel$day <- substr(accel$updated_at, 1, 10)
df <- accel[accel$day == ‘2014-05-12’ & accel$user_id == 88,]
df$timestamp <- as.POSIXlt(df$updated_at) # Transform to POSIX datetime
library(ggplot2)
ggplot(df) + geom_line(aes(timestamp, x, color=”x”)) +
geom_line(aes(timestamp, y, color=”y”)) +
geom_line(aes(timestamp, z, color=”z”)) +
scale_x_datetime() + xlab(“Time”) + ylab(“acceleration”)

Let’s zoom in to the period between 12:32 and 13:00:

ggplot(df[df$timestamp >= ‘2014-05-12 12:32:00’ & df$timestamp < ‘2014-05-12 13:00:00’,]) +
geom_line(aes(timestamp, x, color=”x”)) +
geom_line(aes(timestamp, y, color=”y”)) +
geom_line(aes(timestamp, z, color=”z”)) +
scale_x_datetime() + xlab(“Time”) + ylab(“acceleration”)

Then, I load the Breakoutdetection library:

install.packages(“devtools”)
devtools::install_github(“twitter/BreakoutDetection”)
library(BreakoutDetection)
bo <- breakout(df$x[df$timestamp >= ‘2014-05-12 12:32:00’ & df$timestamp < ‘2014-05-12 12:35:00′],
min.size=10, method=’multi’, beta=.001, degree=1, plot=TRUE)
bo$plot

This quick analysis of the acceleration in the x direction gives us 4 change points, where the acceleration suddenly changes. In the beginning, the smartphone seems to lie flat on a horizontal surface – the sensor is reading a value of around 9.8 in positive direction – this means, the gravitational force only effects this axis and not the x and y axes. Ergo: the smartphone is lying flat. But then things change and after a few movements (our change points) the last observation has the smartphone on a position where the x axis has around -9.6 acceleration, i.e. the smartphone is being held in landscape orientation pointing to the right.

Source: Beautiful Data

 

 

Data Storytelling: Stepwise Abstraction from Raw Data

Data storytelling has become a regular topic at data science conferences, and with good cause. First: The story is what gives meaning to the data, leads to people understanding our analysis, and supports the discussion of our findings, but second: Our interpretation of the data is at least to some extend arbitrary and subjective, and no harm is done to admit that. Compared however to stories without any data support, data-driven narratives have a far better chance to maintain their statement. No wonder, data-driven journalism is on the rise.

In social sciences, we are used to data that are already highly abstract. We ask people, “Can you remember this ad?” Without much questioning the concept behind using what we presume to be words of everyday language. Hence the interpretation is straight forward.

When we use measurements instead of verbal surveys, the situation is much more complicated (but also much more interesting). The data we collect, e.g. from tracking mobile phones, as such doesn’t tell much, at all.

A useful step-by-step way to get meaning into data by gradually abstracting was proposed by Pei et.al.: “Human Behavior Cognition Using Smartphone Sensors“, Sensors 2013, 13, 1402-1424; doi:10.3390/s130201402
My approach is just a simplification of theirs.
In the first layer, we collect the raw data – which often is a demanding task in its own right.Raw data is just tables with numbers. Of course we know how to interpret latitude and longitude. But even the location data is much richer than just coordinates. To interpret the other readings we need to have meta data.
With the data just collected, we still do not see much. We have absolute numbers that are encoded to an arbitrary scale. If e.g. we have distances or speed measurements, the numbers won’t tell us, if metric or imperial scale is applicable. We don’t know of any tolerances either, don’t see the bias in missing values, and so on. So we usually have to enrich the raw readings with meta data. This step is called data munging.
Now we start abstracting from the raw data. In this example of gyroscope data, collected on a smartphone, we see sharp spikes shooting out regularly. This is a typical hardware artefact to be found everywhere in sensor data. These artefacts are quite unique to a specific device and can be used to re-identify it, like a fingerprint.For the gyroscope data, collected e.g. with some fitness-tracker wristband, that would mean to calculate the number of steps walked. Thus, in the second layer, we derive events from the data. What an event is, might be highly arbitrary. Most tracking-gadgets count the number of steps significantly different, depending on the model chosen.
What somebody understands as the occurrence of certain event is also at least partly subjective. I might count some movement of mine as a step while someone else might already call it a leap. What we need to understand the events, is context.
I the third layer, we derive simple context, e.g. by adding location data, or other environmental information like temperature. Fitness tracker usually put the measured data into some simple context on a dashboard. Strava e.g. shows grade and change in altitude.Most fitness trackers do this in their dashboards by showing our training efforts in the context of the situation they could easily match with it. Did we run uphill or downhill?
The fourth layer is finally the rich context. What did really happen? The rich context is hardly ever to be drawn just from our data. Historic, cultural, or medical conditions add to that. We won’t tell a plausible story, if we don’t embed it in the panorama that our audience would expect us to experience, if they would have lived through the story in person. For rich context, we regularly need people’s opinions and personal situation. This is when data science finally gets married to classic social research: The questionnaire based interview – just ask people what they experienced while we measured what happened.
Data science lays the grounding for our pyramid, with social science at its pinnacle.

Source: Beautiful Data

 

 

How Data Science Can Be Used For Social Good


How Data Science Can Be Used For Social Good

08 Jan 2015 – Chicago

To get automatically notified about new posts, you can subscribe by clicking here. You can also subscribe via RSS to this blog to get updates.


Introduction

Give Directly

Credit: Google Images

In 2013 Kush Varshney, a researcher at IBM, signed up through a non profit called DataKind to volunteer his technical skills on assisting pro bono projects. DataKind’s flagship program, DataCorps, assembles teams of data scientists to partner with social organizations like governments, foundations or NGOs for three- to six-month collaborations to clean, analyze, visualize and otherwise use data to make the world a better place.

Kush, who holds a PhD in electrical engineering and computer science from MIT, was promptly contacted by DataKind to work on a project with GiveDirectly. He was joined by another team member, Brian Abelson, himself now a data scientist at an open data search company. The two of them were brought together to tackle a challenging problem for a non profit called GiveDirectly.

GiveDirectly conducts direct cash transfers to low-income families in Uganda and Kenya through mobile payments. These donations are given with no strings attached, trusting that the poor know how to best use the money effectively. One of the top-rated charities on GiveWell, GiveDirectly has had randomized controlled trials conducted evaluating the effectiveness of its approach, with strong positive results.

GiveDirectly’s model is to conduct direct cash transfers to villages with large number of residents in poverty. However, to assess which villages these are, the organization relied upon staff members to individually visit villages in Uganda and Kenya and assess the relative poverty of the inhabitants.

When I spoke with Kush he described some drawbacks of this method, saying, “This method could be costly in both time required to visit each site, and in using donations to help pay wages for inspections that could otherwise be going directly to the poor.”

Together with GiveDirectly, Kush and Brian sought a better way to accomplish this task.

Enter data science.

What Is Data Science?

Data Science Venn Diagram

Credit: Drew Conway – The Data Science Venn Diagram

Data science is an emerging discipline that combines techniques of computer science, statistics, mathematics, and other computational and quantitative disciplines to analyze large amounts of data for better decision making. The field arose in response to the fast growing amount of information and the need for computational tools to augment humans in understanding and using that data.

Rayid Ghani, Director of the Data Science for Social Good Fellowship and former Chief Scientist for Obama, noted that “the power of data science is typically harnessed in a spectrum with the following two extremes: helping humans in discovering new knowledge that can be used to inform decision making, or through automated predictive models that are plugged into operational systems and operate autonomously.” Put plainly, these two ways of using data can be summarized as turning data into knowledge, or converting data into action.

Chiefly responsible for wrangling findings and crafting models using the data is an emerging profession: the data scientist. The “scientist” portion of the title conjures a vision of academia, partially as a result of many data scientists holding advanced STEM degrees, but it also paints a false picture of a data scientist as someone holed up in the research lab of an organization tinkering away on esoteric questions. This view of the data scientist characterizes peering into the depths of “Big Data” in pursuit of knowledge.

Rayid debunks this myth, saying that “frequently, however, the challenge in data science is not the science, but rather the understanding and formulation of the problem; the knowledge of how to acquire and use the right data; and once all that work is done, how to operationalize the results of the entire process.” Accordingly, the real role of a data scientist should be thought of as much more embedded in the core of a company or non profit, directly shaping the scope and direction of the organization’s products and services.

The handiwork of data scientists can be found in a plethora of products we interact with every day. Facebook uses data from each visit to tailor the posts you see in your News Feed. Amazon takes account of what you’ve purchased to recommend other items for purchase. PayPal roots out fraudulent behavior by analyzing the data from seller-buyer transactions.

So far, most of the uses of data science have been towards business objectives. The technology, financial services and advertising industries are rife with opportunities to convert data into profit. But now, more and more innovative social sector organizations like GiveDirectly are catching on to how technology and data science can be used to solve their problems.

Organizations like Rayid’s Data Science for Social Good Fellowship, Y Combinator-backed nonprofit Bayes Impact, and DataKind are popping up to fund, train and deploy excellent data scientists to tackle pressing social issues.

Data Science In Action

In the case of GiveDirectly, Kush and Brian were tasked to use their computational data science skills to help discover where the poorest villages were located, so that donations could be channeled to households with the highest needs.

To do this, Kush and Brian used GiveDirectly’s knowledge that an indication of the poverty of a household is the type of roofing of their home. Kush told me that in Kenya, “poorer families tended to live in homes with thatched roofs. On the other hand, a home with a metal roof typically meant the family was well-to-do enough to purchase a more sturdy shelter.”

Thatched vs. Metal Roofs

Credit: GiveDirectly

Using this knowledge, Kush and Brian used Google Maps to extract satellite images of the various villages in Kenya and deployed an algorithm that used the coloring of the roof to determine whether it was made of metal or straw. Doing this across all of the houses in the village could gave an estimate of the level of poverty in that village.

In early 2014, GiveDirectly piloted this algorithm to detect poverty levels in 50 different villages in Kenya. It was doing so in one of its largest campaigns, moving $4 million to households all over western Kenya.

By employing Kush and Brian’s algorithm, GiveDirectly eliminated over 100 days of manual inspection of each village. Through doing so, over $4,000 was saved, allowing GiveDirectly to fund four more households.

Excited by the potential of data science playing a role in more effectively help families escape poverty, GiveDirectly is now discussing with Kush, Brian and DataKind to see how their algorithm can be used even more precisely, and scaled to additional villages.

Potential To Build The Future

As an increasing volume of information is generated by the world, there will be more opportunities to apply data science towards socially meaningful causes. What if we could help guidance counselors predict which students were the most likely to drop out, and then design to successful interventions around them? What if we improve parole decisions, reduce prison overcrowding and lower prison recidivism?

Examples of how data science can be applied to the social sector include:

  • Reduce crime and recidivism: Predictive modeling can be used to assess whether an inmate would be likely to reoffend, informing the parole decision.
  • Give tailored feedback and content to students: Adaptive tutoring software can be used to model how much students are learning and understanding, tailoring problems.
  • Spot nutrition deficiencies: Data tools can be built that monitor vitamin and mineral intake, warning users of deficiencies in their dietary and health habits.
  • Early prevention of shootings: Network-based analyses of gangs can be used to predict where and when future shootings will occur.
  • Diagnose diseases early on: Leveraging genetic, imaging, and EMR data to provide early diagnosis of diseases such as Parkinson’s, M.S., and Autism.

It’s clear that we can be optimistic about how data scientists can use the data at their fingertips for social good. As an emerging technological frontier, data science is in a position of immense potential. As a result, there is much to explore about how we can use it to push the human race forward.

References

Targeting direct cash transfers to the extremely poor (2014), Kush Varshney and Brian Abelson


I write about data science applied to social causes. If you want to be notified when my next post is published, subscribe by clicking here.



Source: Carl Shan – How Data Science Can Be Used For Social Good

Weeks 7-12: Summer Wrapup


Weeks 7-12: Summer Wrapup

13 October 2014 – Chicago

This is the final post in a series of posts chronicling my reflections on participating in the 2014 Data Science for Social Good Fellowship at the University of Chicago. While I had intended to post once a week, I ended up falling short of my goals. Work from DSSG piled up, making it tough to write thoughtul posts on a weekly schedule.

Nevertheless, I intend for this to be a wraup post that summarizes the work that my team and I did. Reading this will allow you to glean all the different experiences, learnings and findings I encountered over the summer.

You can read my last post here:

To get automatically notified about new posts, you can subscribe by clicking here. You can also subscribe via RSS to this blog to get updates.


Health Leads

“It is health that is real wealth and not pieces of gold and silver.”– Mahatma Gandhi

Introduction

President Obama’s Affordable Care Act enacted broad reforms across the United States’ healthcare system. While the healthcare landscape has changed drastically, one important constant has remained the same: a person’s health is affected significantly by non-medical factors.

For example, a patient with an asthmatic condition caused by a moldy apartment will not be cured simply with better medicine. She needs a better apartment, and yet our health care system is not traditionally set up to handle these non-medical issues.

During this summer’s DSSG Fellowship, our team — Chris Bopp, Cindy Chen, Isaac McCreery, myself and mentor Young-Jin Kim — worked with a nonprofit called Health Leads to apply data science to address these non-medical needs, to help patients get access to basic resources vital for a healthy life.

Health Leads

In 1996, Harvard sophomore Rebecca Onie was a volunteer at Greater Boston Legal Services, assisting low-income clients with housing problems. She found herself speaking with clients facing health issues brought on by their poverty. Some lived in dilapidated apartments, infested with rodents and insects. Others couldn’t afford basic necessities like food. Modern medicine was largely ineffective against these issues. Doctors were trained to treat medical ills, not social ones.

Inspired by her experiences, Rebecca launched a health services nonprofit called Health Leads, which recruits and trains college students to work closely with patients referred by doctors who needed basic resources such as food, transportation, or housing. These college students, called “Advocates” in Health Leads lexicon, learn about each patient’s needs, and meticulously dig up resource providers — food banks, employment opportunities, childcare services — that can fulfill them.

In the nearly two decades since Health Leads’ inception, its impact on the health landscape has been tremendous. In 2013 alone, Health Leads Advocates worked with over 11,000 patients to connect them with basic services and resources.

The Problem

Serving a predominantly low-income patient population can pose a challenge for Health Leads. Some patients will lack stable, permanent housing or employment. Others may not own a cell phone on which they can be consistently reached. Health Leads noticed that these circumstances affected their work with some patients: despite Advocates’ best efforts, a proportion of their clients would disconnect from working with the program. These clients would be unreachable, not returning phone calls and ultimately Advocates would be forced to close their cases — never knowing if these clients received the basic resources they needed.

Below is an image displaying the phone calls made to a random group of 200 different patients and whether they responded or not. Half of the clients worked with Health Leads through the completion of their case and the other half ultimately disconnected from Health Leads’ program.

Patient Disconnection vs. Success

(The cases with negative days are ones where Health Leads took down the information for patient, didn’t begin working with them until a few days later.)

Just at a glance, there appears to be pretty clear differences between the two groups. Most obviously, the disconnected patients seem to have many more failed communication attempts (red dots) than successful ones (green dots).

However, Health Leads wanted to know: exactly what are the factors that contribute to a patient disconnecting from Health Leads? How does the difficulty of a patient’s need play into the problem? What other factors might be important to consider?

Against the backdrop of these pressing questions, Health Leads came to our DSSG team to use data to help discover some answers.

The Challenges

When we began tackling the problem, we ran into a slew of challenges. Unlike in the internet world where companies can track every iota of data down to the click, nonprofits serve their clients in person – meaning data must be manually recorded, rather than passively accumulated.

Furthermore, it may be that the factors we end up discovering as influencing patient outcome may be outside of the control of Health Leads. What if we found that the most significant indicators of patients’ success was gender or age? It would be hard to translate a finding like this into operationalizable actions for Advocates.

Our Findings

Over the summer, our team worked through the data to distill insight, discovering findings that Health Leads can use to improve their practice.

For example, we developed a “Patient Complexity Index” that tries to capture the probability that a patient will disconnect from Health Leads. We incorporate information about the type of resources this patient requires and historic performance information about the Health Leads clinic where the patient is served. For instance, needs involving employment or housing are typically much harder to resolve than needs around childcare or transportation. The success rates of each of these resource connections also vary per desk. We found that different Health Leads sites specialize in different types of resource connections.

By combining this information, Health Leads can more accurately quantify the difficulty of each patient so that more experienced Advocates can work with patients with more complex needs. By doing so, Health Leads can better address each patient’s different circumstances, lowering the chance that they’ll disconnect.

Patient Needs

A Need Complexity Index can help quantify the difficulty of these patients’ needs

Furthermore, Health Leads currently standardizes the intervals at which patients call patients: a minimum of once every 10 days. The findings from the data confirmed previous Health Leads research that Advocates should try to get in touch with patients frequently in the beginning stages of building a relationship with a patient. When an Advocate successfully contacts a client in the first month, that one successful phone call significantly decreases the likelihood of disconnection:

Call Frequency

Health Leads should call new clients frequently in the first month

Conclusion

We presented our findings and models to Health Leads at the end of this summer, and our results validate Health Leads’ emphasis on regular follow up. We believe that the information we provided reinforces organizational strategies that can increase client engagement: calling clients regularly and leveraging communication tools such as text messaging. By investigating the different factors influencing a patient’s likelihood to disconnect, our team’s findings have pointed to important steps that Health Leads can continue to take to ensure that more people get the resources they need for a healthy life.


I write about data science applied to social causes. If you want to be notified when my next post is published, subscribe by clicking here.



Source: Carl Shan – Weeks 7-12: Summer Wrapup

Week 6: Progress Thus Far


Week 6: Progress Thus Far

12 July 2014 – Chicago

This is the sixth in a series of posts chronicling my reflections on participating in the 2014 Data Science for Social Good Fellowship at the University of Chicago.
You can read my last post here:

To get automatically notified about new posts, you can subscribe by clicking here. You can also subscribe via RSS to this blog to get updates.


Throughout the DSSG Fellowship, it’s been clear that my team is quite unique — unlike other groups, we were tasked with two separate projects with two different partners: Health Leads and The Chicago Alliance to End Homelessness.

However, after spending a few weeks wrestling with the challenges of context-switching between different projects, tangoing with multiple parties through different communication channels, and wading through raw and smelly data, our team decided to break up into two sub-groups that would each tackle a different project.

I ended up gravitating to focusing on tackling the problems presented by Health Leads.

Now that it’s been just over six weeks since the Fellowship began, it would be a worthwhile reflection to assess what we’ve been able to accomplish up to this point, and what is still left to be done.

Health Leads

Health Leads

The Goal

I’ve written briefly about Health Leads before, in which I recounted the story of how Rebecca Onie founded the organization upon discovering the hidden link between social services and debilitating health conditions. To briefly summarize Health Leads mission: many health clinic patients experience health concerns that are brought on more so by social ills than medical ones. Asthma can be treated with medication, but not if the root cause is a mold-infested apartment.

Health Leads trains college students to work with patients who are referred by health service providers to work on identifying patient needs and working with them to acquire the resources that satisfy these needs.

Unfortunately, Health Leads is seeing a large number of their patients drop off. After one or two successful contacts, many patients stop returning phone calls. They don’t reply to emails and may live transitory lives, rendering direct mail a difficult channel of reaching them.

Our team is working on sifting through the collection of interaction data Health Leads has provided us with and bringing to light the possible reasons as to why a patient may disengage. In the end, we also hope to provide insights as to how Health Leads could direct their energy and activities to boost patient responsiveness in such a way that can increases the chances they will receive the resources they need.

The Challenges

What we quickly realized upon tackling this project was that Health Leads had yet to seriously determine exactly what it meant to for a patient to be “engaged.” To be fair, even in the world of technology product management this definition can be difficult to pin down. Groupon and Zynga, both struggling companies, certainly saw high usage and engagement numbers in their heyday by measurement of engagement. However, unlike in the web-world where companies can track every iota of data down to the click, non profits oftentimes have to make do with infrequently collected data that must be actively (and sometimes, painfully) recorded rather than passively accumulated.

Translating this into practice packs a painful twofold punch. Not only do we not have a great deal of data (our entire dataset totals less than 250mb), but a large portion of is afflicted with data quality issues. We see fields with low coverage, data clearly generated from user error or have otherwise untrustworthy cleanliness issues that raise our eyebrows.

All this presents a rather challenging scenario. After all, it’s hard to do data science without good data.

In addition to data concerns, I also mentioned earlier that nailing down the exactly definition of engagement is proving to be a challenge. The difficulty therein lies in translating a more nebulous human intuition into some rigorous formulation. If we were to proceed on the wrong calculation of engagement, any statistical machine learning methods we build to model it become suspect.

Our team had initially ran a logistic regression attempting to predict outcome as a function of responsiveness, only to discover that my calculation of responsiveness was off. However, upon recalculating it I learned that the accuracy of my predictions was actually higher on the erroneous calculations, presenting quite a conundrum.

Furthermore, even beyond the practical implementation concerns our team has, there are higher level questions that we’re asking ourselves. Namely, we’re questioning the underlying assumption of the entire problem: does higher engagement actually increase the chance of a successful patient outcomes?

After all, if the answer is a resounding ‘No’, then the entire foundation upon which we’ve been working crumbles into sand. Unfortunately, there are some small inklings that we’re finding possibly pointing in this direction. Tentatively, we believe this surprising finding is due more to low-quality data and an iffy definition of engagement than actual causal processes in the real world. Nevertheless, finding this raises a red flag in our minds.

Finally, one last challenge may be that the factors we end up discovering as influencing patient outcome may be ones outside of Health Leads control. Perhaps the most significant indicators of patients successfully acquiring necessary resources are variables such as gender or age. It may be quite difficult for Health Leads to translate into operationalizable steps their Advocates can take.

The Adventure Continues

The previous section might have come off as pessimistic. But I didn’t mean it to be. Reviewing them, none of the challenges in my list are insurmountable or dead ends. In fact there are also a number of reasons to be quite positive about in thinking about what I, as a data scientist, can do to help Health Leads achieve their vision of creating a healthcare system in which all patients’ basic resource needs are adequately addressed.

For starters, our team has already started to think about ways to more carefully redefine and incorporate engagement as a measurement of patient outcome. We think that our initial findings of a disconnection between patient outcome and engagement is due more so to faulty wiring at the definition level rather than an actual lack of relationship between the two factors.

With Health Leads’ help, we’re also thinking of ways of engineering more substantive and accurate features from the data we have available that can paint a more nuanced and informative story about how a patient traverses through the process of getting the resources they need. As an example, one road we’re exploring is to, rather than summarize engagement using one single number averaging across multiple interactions, instead look at vectorizing engagement by calculating it at various points through a patient’s relationship with Health Leads.[1]

Even if engagement ends up proving a dud — bearing little to no predictive significance on a patient’s outcome — this itself would be a landmark discovery for Health Leads. And based upon their stellar team and impressive organizational quality that I’ve observed up to this point, I have no doubt that they’ll thoughtfully incorporate this finding into their model so as to better continue serving the health and social service needs of individuals all over America.

Footnotes

[1] Another opportunity we’ll be exploring as we continue working with Health Leads will be to directly predict the outcome of a patient, rather than using engagement as a proxy.


I write posts about data science applied to social causes. If you want to be notified when my next reflection is published, subscribe by clicking here.



Source: Carl Shan – Week 6: Progress Thus Far

Week 5: Learning and Doing


Week 5: Learning and Doing

5 July 2014 – Chicago

This is the fifth in a series of posts chronicling my reflections on participating in the 2014 Data Science for Social Good Fellowship at the University of Chicago.
You can read my last post here:

To get automatically notified about new posts, you can subscribe by clicking here. You can also subscribe via RSS to this blog to get updates.


My number one goal in participating in the 2014 DSSG Fellowship was to create something of value. By something valuable, I had in mind a piece of software that helped satisfy the needs of the non profit partners I would work with.

Now halfway through the Fellowship, I’ve increasingly noticed a tension between my goal and that of the Fellowship. As mentioned in my initial reflections about DSSG, the goals of the Fellowship were focused around:

“(a) helping Fellows learn how to use various techniques and tools and (b) developing each Fellow’s interests in working towards social good, open government and open science.”

Both of these are also priorities of mine, but I believe that they could both be obtained primarily through creating a valuable piece of software. Rather than through listening to lectures or reading, I find myself learning most productively (as measured by amount of content retained per unit of time) when I have the chance to apply them in practice.[1] Being exposed to and navigating this tension has made me more aware of how the different goals of learning versus doing translate into mindsets and behaviors.

As someone who came into the Fellowship leaning more towards the doing camp, my mindset towards problem-solving is to focus on effectiveness, and not necessarily on efficiency. I work with the implicit assumption that my code won’t be as clean or properly abstracted as I would like it to be. I take an iterative approach where I take small stabs at the problem and refine my code as I build up my understanding of the problem.

When I come across edge cases in the data (e.g., a client who has a negative age, or a field with multiple values that really mean the same thing), I put aside my curiosity to dig further, make a mental note to explore it later and exclude this data from my analyses. With less hand-wringing about how to deal with strange outliers or edge cases, I default towards simplicity and building the least complex model. In fact, the first model my team and I looked at for Health Leads was one of the simplest possible: a single-variable logistic regression.

As a result of prioritizing doing over learning, I work primarily in iPython Notebook, a web-based Python interpreter. Only after properly mapping out and charting the territory of the problem do I then try to translate the code I’ve hacked together into more cleanly abstracted modules and scripts.[2]

In contrast to my attitude when my aim is doing, when I am clearly optimizing for learning, I focus on efficiency of process rather than effectiveness. Ira Glass, the host of the spectacular radio show This American Life, once said that it’s your taste combined with the relentless effort to materialize it your work that will propel you to greatness.

When I’m optimizing for learning, I’m driven more by the curiosity to understand than by the goal of achieving. I work more slowly, pausing to try to understand the edge cases. By poking around the shells of these outliers, searching for cracks, I end up discovering potholes to fill up in my knowledge. In this state of mind, progress on my work feels slower, but the density of learning is much higher.

Putting these thoughts into the framework of cognitive theories of behavior, I suspect that prioritizing learning over doing is aligns your mindset with an attitude of deliberate practice), key to becoming great at your craft.

In this way, learning is complementary to doing. A strong burst of effort towards learning the fundamentals, the shortcuts and the heuristics are the precursors to getting a ton done.

The Learning/Doing Curve

As the graph I made above shows, I suspect the beginning of a new project will require a heavy commitment towards learning. However, each of the dips in the curve represents a cycle of work that attempts to meet a project deadline happening at the trough of the curve. The oscillations that occur afterwards represent the various stumbling blocks that you encounter during the course of a project.

As I progress I’ll have to keep in mind just how much I want to optimize towards learning versus doing, and try to feel out where the various inflection points are.

Footnotes

[1] I’ve noticed that I’ve been in the state of ‘flow’ more often when I’m in the act of creation, such as through writing or coding, than when I’m passively absorbing information, such as through watching a talk. One intermediate point between these two different ends is that I’ve also noticed I can easily go into a state of ‘flow’ when I’m absorbing information through reading. However even when I read I find myself underlining key passages, jotting notes and counterpoints and actively thinking about the content.

[2] No matter how much I focus on doing, I can’t escape my aesthetic preference for cleanly and well-factored code. I spent last night obsessing over how to speed a function. I left the office happily at 1:30am with the code running faster by a factor of about 8.


I write weekly posts about data science applied to social causes. If you want to be notified when my next reflection is published, subscribe by clicking here.



Source: Carl Shan – Week 5: Learning and Doing