Deep Learning Summer School 2016 Videos


Deep Learning Summer School, Montreal 2016 is aimed at graduate students and industrial engineers and researchers who already have some basic knowledge of machine learning (and possibly but not necessarily of deep learning) and wish to learn more about this rapidly growing field of research. If that is you, there are plenty of videos to help you learn more.



Source: 101 DS – Deep Learning Summer School 2016 Videos

Predicting Wheat Rust in Ethiopia with the Bill & Melinda Gates Foundation


Cultivated by about five million households, wheat is an important crop in Ethiopia as both a source of income for small farmers and a source of food and nutrition for millions of Ethiopians. Despite the country’s huge potential to grow wheat, the average wheat productivity of 2.5 tonnes per hectare is lower than the global average of 3 tonnes per hectare. This is due in part to recurrent outbreaks of a fungal disease called wheat rust that causes devastating pre-harvest losses.

Several international development agencies have been supporting scientists to study the spread of wheat rust as part of their efforts to increase agricultural productivity and reduce hunger and poverty for millions of farming families in Sub-Saharan Africa. However, it can be challenging to even know where wheat rust croplands are located in Ethiopia, as the field survey data that exists is incomplete and costly to collect.

Given advances in satellite imagery, we wondered – is it possible to detect wheat rust from space so that an early warning system could be developed to predict and prevent future outbreaks?

Last August, we held a DataDive with the Bill & Melinda Gates Foundation to tackle this question and more. Using a combination of survey data, remote sensing data and satellite imagery, a DataDive volunteer team was able to develop a proof of concept statistical model using survey data to distinguish severe yellow rust from no rust (of any type) with about 82% accuracy. A model like this could enable governments, funding agencies and researchers to better detect the spread of the disease and evolution of new strains of pathogens, and more quickly deploy protective measures to help farmers and their communities.

We’re pleased to announce we’re continuing our work with the Bill & Melinda Gates Foundation and will be kicking off a long-term multi-phase project to develop a more accurate predictive model using a combination of satellite imagery, multispectral imaging and computer vision techniques. The goal of the first phase of the project is to find a way to automatically detect wheat cropland.

Satellite Imagery Experts, Join Us!

We’re looking for a team of volunteers, including satellite imagery and machine learning experts, to help work on this project over the next several months. If you have significant experience in these areas and would like to contribute, email Sina Kashuk, DataKind’s Data Scientist managing the project, at sina@datakind.org with details on your background. 



Source: DataKind – Predicting Wheat Rust in Ethiopia with the Bill & Melinda Gates Foundation

Get Involved – Monthly Roundup!


Eager to flex your data skills for good? Each month, we do a roundup of volunteer opportunities through DataKind and other organizations around the world!

Don’t see anything in your area? Check out DataLook’s definitive guide to doing data science for good and our Data4Good Kit for help getting started.

DataKind Opportunities 

Satellite Imagery Volunteers – We need your help on our newest project launching in December with the Bill & Melinda Gates Foundation! Email sina@datakind.org to get involved.

Web Developer Volunteers – We need a front-end web developer to help on one of our DataCorps projects. Email contact@datakind.org to get involved.

SAVE THE DATE! DataDive – March 3-5, New York City
New York – it’s high time for a DataDive! RSVPs to open in January – stay tuned.

SAVE THE DATE! DataDive – April 28-30, North Carolina
We’re co-hosting our first ever North Carolina DataDive. RSVPs to open in February – more details soon!

Upcoming Events and Conferences 

DataKind’s Jake Porway at Stanford Social innovation Review (SSIR) Data on Purpose/Do Good Data conference – Feb 7-8, Stanford, CA
Join Jake and other data experts, academics, practitioners, and social sector leaders for two days of skillfully-led sessions on topics ranging from aligning practice with policy to creating a culture of data, and how Silicon Valley is facilitating data practices in civil society.
Learn more >

Beyond DataKind – Our Top Picks To Get Involved 

Data Science for Good: Support America’s Warrior Partnership – Dec 9, College Park, MD
Join Immuta for a Hackathon to support the America’s Warrior Partnership (AWP). AWP works to help communities to empower veterans by providing a community based program offering a proactive approach to serving Veterans. Bring your data skills and get ready to dive into datasets to assist AWP in forwarding its vision and goals. Help AWP organizations effectively find Veterans in an area, identify factors that lead to more successful outcomes for Veterans, better predict needs for follow up actions, determine the probability of success related to various services, and help prevent homelesenses.
Sign up >

beyond.uptake Data Fellows Program – Dec 9 Deadline
Social enterprises are attacking some of the biggest problems in the world. But there is a lack of professional development and mentoring for data professionals at social enterprises. To help, beyond.uptake has introduced a four-month Data Fellows Program designed to connect data leaders in nonprofits to experts in data science; providing them with the opportunity to hone their data skills and network with like-minded data for good professionals. Apply now!
Apply >

Become a Data & Society Fellow! – Dec 19 Deadline
Our friends at Data & Society are assembling its fourth class of fellows to further its mission of producing rigorous research that can have impact, and supporting and connecting the young but growing field of actors working on the social, cultural, and political effects of data.
Apply >

IBM Watson AI XPRIZE – Jan 19 Deadline
How can artificial intelligence solve the world’s grandest challenges in health, learning, energy, exploration and global development? The IBM Watson AI XPRIZE, a $5 million global competition to develop life-changing human + AI collaborations launched by IBMWatson and XPRIZE, aims to answer this question. Take the challenge!
Register >

The Measured Summit: Measuring the Impact of Social Design on Human Health – Jan 24, New York, NY
Does human centered design lead to better health outcomes? Does it make patients smarter and more informed? Can it make health care companies more innovative and successful? Can it improve delivery of products and services? Find out at The Measured Summit. Join leaders in philanthropy, business, healthcare, research and design as they create a shared approach to understanding how design can become a more powerful tool for systems-level transformation.
Get Tickets >

Become a Data Science for Social Good Fellow! – Jan 31 Deadline
Another friend, University of Chicago’s Data Science for Social Good program, is now recruiting its next class of fellows. Join as a fellow, a mentor, a project manager, or partner!
Apply >

DrivenData Machine Learning Competitions (virtual) – Ongoing
Check out DrivenData’s online challenges, usually lasting 2-3 months, where a global community of data scientists competes to come up with the best statistical model for difficult predictive problems that make a difference.
Sign up >



Source: DataKind – Get Involved – Monthly Roundup!

How We Priced Our Book With An Experiment


How We Priced Our Book With An Experiment

27 May 2015 – Chicago

Summary: We conducted a large experiment to test pricing strategies for our book and came to some very surprising findings about allowing customers to pay what they wanted.

Specifically, we found strong evidence that we should let customers pay what they want, which would help us earn more money and more readers when compared with traditional pricing models. We hope our findings can inspire other authors, musicians and creators to look into pay-what-you-want pricing and run experiments of their own.

To get automatically notified about new posts, you can subscribe by clicking here. You can also subscribe via RSS to this blog to get updates.


Introduction: Pay What You Want?

My co-authors (Henry Wang, William Chen, Max Song) and I have been working on our book, The Data Science Handbook, for over a year now. Shortly before launch, we asked ourselves an important question that many authors face: how much should we charge for our book?

We had heard of Pay-What-You-Want (PWYW) models, where readers can purchase the book for any amount they want (or at least above a threshold you set). However, many authors and creators worry that only a small percentage of people will contribute in a PWYW pricing model, and that these contributors will opt for meager amounts in the $1-$5 range.

On the other hand, we also felt that PWYW was an exciting model to try. A PWYW model would allow us to get the book out to as many people as possible without putting the book behind a paywall. We also had an inkling that this experimental pricing model would increase exposure for our book.

So we set out to answer this simple question: how should we price our book?

As practicing statisticians and data scientists, we thought of no better way to decide this than to run a large-scale experiment. The following section details exactly what we tested and discovered.

TL;DR – Letting Customers Pay What They Want Wins the Day

We experimented with 7 different pricing models pre-launch, with our subscriber base of 5,700 people. In these 7 different models, we compared different pricing schemes, including fixed prices at $19 and $29, along with several Pay What You Want (PWYW) models with varying minimum amounts and suggested price points.

Before the experiment began, we had agreed to choose whichever variant maximized the two things we cared about: the total number of readers and net revenue (later on, we’ll explain how we prioritized the two).

Before conducting the experiment, we thought that setting a fixed price at $29, like a traditional book, would lead to the maximum revenue.

After we analyzed our results, to our surprise, we discovered strong statistical evidence that with a PWYW model for our book, we could significantly expand our readership (by 4x!) while earning at least as much revenue (and potentially even more) as either of the fixed-priced variants.

The Prices We Tested: Setting Up Our Experiment

On notation: throughout this post, PWYW models will be described as (Minimum Price/Suggested Price). Example. ($0/$19) means ($0 Minimum Price, $19 Suggested Price).

Through a sign-up page on our website, we’ve been continuously gathering email addresses of individuals interested in our book throughout the process of promoting the Data Science Handbook.

We conducted this pricing experiment before the official launch of the book by letting our 5,700 subscribers pre-order a special early release of the book. The following diagram shows our experimental setup:

experiment setup

We started the early release pre-order process on Monday, April 20th. We stopped the pre-orders one week later, so that we could analyze our results.

Through Gumroad, we tracked data on the number of people who landed on each link, whether they purchased, and how much they chose to pay.

Note: To guard against people buying the book who were not originally assigned to that bucket (for example, those who inadvertently stumbled across our links online), we filtered out all email addresses that purchased a book through a variant that they were not explicitly assigned to. This gave us more confidence in the rigor of our statistical analyses.

What We Found: Experiment Results

The roughly 800 users in each of our experimental buckets went through a funnel, where they clicked through the email to visit the purchase page, and then decided whether or not to purchase. We collected data on user behavior in this funnel, as well as the price they paid.

conversion funnel

For each of the experimental variants, we collected data on 6 key metrics:

  • Email CTR – # of people who clicked through to the purchase page / # of people who received the email. The emails were identical, minus the link and a short section about the price.
  • Conversion Rate – # of purchases / # of people who clicked through to the purchase page
  • Total Sales – # of sales, regardless of whether a reader paid $0 or $100
  • Net Revenue – Total revenue generated, minus fees from Gumroad
  • Mean Sales Price – Average sales price that people paid
  • Max Sales Price – Largest sales price paid in that bucket

Below, you’ll see some plots on how each pricing variant performed on each metric. Each of the seven circles represents a different pricing variant, with the area of the circle being proportional to the magnitude it represents. The larger the circle, the “better” that pricing variant did in terms of our metrics.

The blue circles are the variants that were fixed at $19 and $29. The orange circles are the PWYW variants.

The X-axis of the following plots describes the minimum prices we offered: free, $10, $19 (this was a fixed price), $20 and $29 (also fixed). The Y-axes are the prices we suggested when we were using a PWYW variant: $19 and $29.

pwyw vs fixed

plots

Looking above, it’s no surprise at a PWYW model of ($0/$19) had the highest conversion rate (upper right plot), and as a result the greatest number of people who downloaded the book . After all, you can get it for free!

Much to our surprise, many of our readers who got this variant paid much more than $0. In fact, as you can see above in the “Mean Sales Price” plot in the bottom left corner, our average purchase price was about $9. Some readers even paid $30.

To examine the distribution of payments we received for each variant, we also examined the histogram of payments for each of the 5 PWYW variants:

sales distribution

It’s again no surprise to see a large chunk of purchases at the minimum. However, you can also see fairly sizable clumps of readers who pay amounts around $5, $10, $15 and $20 (and even some who paid in the $30-$50 range).

In fact, readers seemed to like paying amounts that were multiples of $5, perhaps because it represented a nice round number.

Surprising Insights on Pay What You Want

You Can Earn As Much from a PWYW model (and possibly more) as from a Fixed Price model

Traditional advice told us that we should price our book at a high, fixed price point, since people interested in advancing their careers will typically pay a premium for a book that helps them do exactly that.

However, our ($0/$19) variant was ranked second in total revenue generated (tying with a fixed price of $29).

net revenue

In fact, if anything, the data lends credence to the belief that you can earn even more from PWYW than from setting a fixed price.

What do we mean by that?

Well, our ($0/$19) variant actually made nearly twice as much money as fixing the price at $19. The difference in earnings was large, and is strong statistical evidence that our book would make more money if we made it free, and simply had a suggested price of $19, than if we had fixed the price at $19.[1]

This was an incredible result, since it suggested that with a PWYW model, we could generate the same amount of revenue as a fixed price model, while attracting 3-4x more readership!

Higher Suggested Price Didn’t Translate to Higher Average Payments. But…

The “suggested” price didn’t seem to have seem to have a large impact on the price people paid. Compare the mean purchase prices between $19 suggested and $29 suggested in both the $0 minimum variants and the $10 minimum variants.

mean sales price

As you can see, moving the suggested price from $19 to $29 in both cases increased average purchase price by only $1.

However, we don’t mean to imply the suggested price had zero effect. In fact, the data lends support to actually having a lower suggested price.

You can look to see what happened to conversion rates when we changed the suggested price from $19 to $29. In both cases we tested ($0 minimum and $10 minimum), a lower suggested price had a higher conversion rate, and drove ultimately more revenue.[2]

Therefore, it seems that even if the average sale is the same despite different suggested price, total sales increased when you have a lower suggested price. This is perhaps due to certain readers being turned off by a higher suggested price, even if they could get it for $0.

Just imagine seeing a piece of chocolate being offered for free, but having a suggested price of $100. You might scoff at the absurdly high suggested price and refuse the candy, despite being able to take it for nothing.

On the other hand, if you were offered the same scenario, but this time the free candy had a suggested price of just $0.25, you may see this as fair and be much more inclined to part with your quarter.

Try It Out For Yourself

We think that all of these findings should spur authors and creators to conduct testing on their own product pricing. Gumroad, our sales platform, makes it remarkably easy to create product variants, which you can email out to randomized batches of your followers. Or, you can use the suite of A/B testing tools to ensure that different visitors to your website receive different product links.

By doing so, you may discover that you could reach a larger audience, while also earning higher revenue.

[1] This result just missed the cutoff for statistical significance. The actual p-value comparing $0/$19 with a fixed $19 was 0.057, missing our threshold of 0.05 necessary to qualify as statistically significant. Nevertheless, the very low p-value is a strongly suggestive result in favor of a PWYW model.

[2] Beyond being practically significant, this was also statistically significant with a p-value close to 0.


If you want to be notified when my next article is published, subscribe by clicking here.



Source: Carl Shan – How We Priced Our Book With An Experiment

How Data Science Can Be Used For Social Good


How Data Science Can Be Used For Social Good

08 Jan 2015 – Chicago

To get automatically notified about new posts, you can subscribe by clicking here. You can also subscribe via RSS to this blog to get updates.


Introduction

Give Directly

Credit: Google Images

In 2013 Kush Varshney, a researcher at IBM, signed up through a non profit called DataKind to volunteer his technical skills on assisting pro bono projects. DataKind’s flagship program, DataCorps, assembles teams of data scientists to partner with social organizations like governments, foundations or NGOs for three- to six-month collaborations to clean, analyze, visualize and otherwise use data to make the world a better place.

Kush, who holds a PhD in electrical engineering and computer science from MIT, was promptly contacted by DataKind to work on a project with GiveDirectly. He was joined by another team member, Brian Abelson, himself now a data scientist at an open data search company. The two of them were brought together to tackle a challenging problem for a non profit called GiveDirectly.

GiveDirectly conducts direct cash transfers to low-income families in Uganda and Kenya through mobile payments. These donations are given with no strings attached, trusting that the poor know how to best use the money effectively. One of the top-rated charities on GiveWell, GiveDirectly has had randomized controlled trials conducted evaluating the effectiveness of its approach, with strong positive results.

GiveDirectly’s model is to conduct direct cash transfers to villages with large number of residents in poverty. However, to assess which villages these are, the organization relied upon staff members to individually visit villages in Uganda and Kenya and assess the relative poverty of the inhabitants.

When I spoke with Kush he described some drawbacks of this method, saying, “This method could be costly in both time required to visit each site, and in using donations to help pay wages for inspections that could otherwise be going directly to the poor.”

Together with GiveDirectly, Kush and Brian sought a better way to accomplish this task.

Enter data science.

What Is Data Science?

Data Science Venn Diagram

Credit: Drew Conway – The Data Science Venn Diagram

Data science is an emerging discipline that combines techniques of computer science, statistics, mathematics, and other computational and quantitative disciplines to analyze large amounts of data for better decision making. The field arose in response to the fast growing amount of information and the need for computational tools to augment humans in understanding and using that data.

Rayid Ghani, Director of the Data Science for Social Good Fellowship and former Chief Scientist for Obama, noted that “the power of data science is typically harnessed in a spectrum with the following two extremes: helping humans in discovering new knowledge that can be used to inform decision making, or through automated predictive models that are plugged into operational systems and operate autonomously.” Put plainly, these two ways of using data can be summarized as turning data into knowledge, or converting data into action.

Chiefly responsible for wrangling findings and crafting models using the data is an emerging profession: the data scientist. The “scientist” portion of the title conjures a vision of academia, partially as a result of many data scientists holding advanced STEM degrees, but it also paints a false picture of a data scientist as someone holed up in the research lab of an organization tinkering away on esoteric questions. This view of the data scientist characterizes peering into the depths of “Big Data” in pursuit of knowledge.

Rayid debunks this myth, saying that “frequently, however, the challenge in data science is not the science, but rather the understanding and formulation of the problem; the knowledge of how to acquire and use the right data; and once all that work is done, how to operationalize the results of the entire process.” Accordingly, the real role of a data scientist should be thought of as much more embedded in the core of a company or non profit, directly shaping the scope and direction of the organization’s products and services.

The handiwork of data scientists can be found in a plethora of products we interact with every day. Facebook uses data from each visit to tailor the posts you see in your News Feed. Amazon takes account of what you’ve purchased to recommend other items for purchase. PayPal roots out fraudulent behavior by analyzing the data from seller-buyer transactions.

So far, most of the uses of data science have been towards business objectives. The technology, financial services and advertising industries are rife with opportunities to convert data into profit. But now, more and more innovative social sector organizations like GiveDirectly are catching on to how technology and data science can be used to solve their problems.

Organizations like Rayid’s Data Science for Social Good Fellowship, Y Combinator-backed nonprofit Bayes Impact, and DataKind are popping up to fund, train and deploy excellent data scientists to tackle pressing social issues.

Data Science In Action

In the case of GiveDirectly, Kush and Brian were tasked to use their computational data science skills to help discover where the poorest villages were located, so that donations could be channeled to households with the highest needs.

To do this, Kush and Brian used GiveDirectly’s knowledge that an indication of the poverty of a household is the type of roofing of their home. Kush told me that in Kenya, “poorer families tended to live in homes with thatched roofs. On the other hand, a home with a metal roof typically meant the family was well-to-do enough to purchase a more sturdy shelter.”

Thatched vs. Metal Roofs

Credit: GiveDirectly

Using this knowledge, Kush and Brian used Google Maps to extract satellite images of the various villages in Kenya and deployed an algorithm that used the coloring of the roof to determine whether it was made of metal or straw. Doing this across all of the houses in the village could gave an estimate of the level of poverty in that village.

In early 2014, GiveDirectly piloted this algorithm to detect poverty levels in 50 different villages in Kenya. It was doing so in one of its largest campaigns, moving $4 million to households all over western Kenya.

By employing Kush and Brian’s algorithm, GiveDirectly eliminated over 100 days of manual inspection of each village. Through doing so, over $4,000 was saved, allowing GiveDirectly to fund four more households.

Excited by the potential of data science playing a role in more effectively help families escape poverty, GiveDirectly is now discussing with Kush, Brian and DataKind to see how their algorithm can be used even more precisely, and scaled to additional villages.

Potential To Build The Future

As an increasing volume of information is generated by the world, there will be more opportunities to apply data science towards socially meaningful causes. What if we could help guidance counselors predict which students were the most likely to drop out, and then design to successful interventions around them? What if we improve parole decisions, reduce prison overcrowding and lower prison recidivism?

Examples of how data science can be applied to the social sector include:

  • Reduce crime and recidivism: Predictive modeling can be used to assess whether an inmate would be likely to reoffend, informing the parole decision.
  • Give tailored feedback and content to students: Adaptive tutoring software can be used to model how much students are learning and understanding, tailoring problems.
  • Spot nutrition deficiencies: Data tools can be built that monitor vitamin and mineral intake, warning users of deficiencies in their dietary and health habits.
  • Early prevention of shootings: Network-based analyses of gangs can be used to predict where and when future shootings will occur.
  • Diagnose diseases early on: Leveraging genetic, imaging, and EMR data to provide early diagnosis of diseases such as Parkinson’s, M.S., and Autism.

It’s clear that we can be optimistic about how data scientists can use the data at their fingertips for social good. As an emerging technological frontier, data science is in a position of immense potential. As a result, there is much to explore about how we can use it to push the human race forward.

References

Targeting direct cash transfers to the extremely poor (2014), Kush Varshney and Brian Abelson


I write about data science applied to social causes. If you want to be notified when my next post is published, subscribe by clicking here.



Source: Carl Shan – How Data Science Can Be Used For Social Good

Weeks 7-12: Summer Wrapup


Weeks 7-12: Summer Wrapup

13 October 2014 – Chicago

This is the final post in a series of posts chronicling my reflections on participating in the 2014 Data Science for Social Good Fellowship at the University of Chicago. While I had intended to post once a week, I ended up falling short of my goals. Work from DSSG piled up, making it tough to write thoughtul posts on a weekly schedule.

Nevertheless, I intend for this to be a wraup post that summarizes the work that my team and I did. Reading this will allow you to glean all the different experiences, learnings and findings I encountered over the summer.

You can read my last post here:

To get automatically notified about new posts, you can subscribe by clicking here. You can also subscribe via RSS to this blog to get updates.


Health Leads

“It is health that is real wealth and not pieces of gold and silver.”– Mahatma Gandhi

Introduction

President Obama’s Affordable Care Act enacted broad reforms across the United States’ healthcare system. While the healthcare landscape has changed drastically, one important constant has remained the same: a person’s health is affected significantly by non-medical factors.

For example, a patient with an asthmatic condition caused by a moldy apartment will not be cured simply with better medicine. She needs a better apartment, and yet our health care system is not traditionally set up to handle these non-medical issues.

During this summer’s DSSG Fellowship, our team — Chris Bopp, Cindy Chen, Isaac McCreery, myself and mentor Young-Jin Kim — worked with a nonprofit called Health Leads to apply data science to address these non-medical needs, to help patients get access to basic resources vital for a healthy life.

Health Leads

In 1996, Harvard sophomore Rebecca Onie was a volunteer at Greater Boston Legal Services, assisting low-income clients with housing problems. She found herself speaking with clients facing health issues brought on by their poverty. Some lived in dilapidated apartments, infested with rodents and insects. Others couldn’t afford basic necessities like food. Modern medicine was largely ineffective against these issues. Doctors were trained to treat medical ills, not social ones.

Inspired by her experiences, Rebecca launched a health services nonprofit called Health Leads, which recruits and trains college students to work closely with patients referred by doctors who needed basic resources such as food, transportation, or housing. These college students, called “Advocates” in Health Leads lexicon, learn about each patient’s needs, and meticulously dig up resource providers — food banks, employment opportunities, childcare services — that can fulfill them.

In the nearly two decades since Health Leads’ inception, its impact on the health landscape has been tremendous. In 2013 alone, Health Leads Advocates worked with over 11,000 patients to connect them with basic services and resources.

The Problem

Serving a predominantly low-income patient population can pose a challenge for Health Leads. Some patients will lack stable, permanent housing or employment. Others may not own a cell phone on which they can be consistently reached. Health Leads noticed that these circumstances affected their work with some patients: despite Advocates’ best efforts, a proportion of their clients would disconnect from working with the program. These clients would be unreachable, not returning phone calls and ultimately Advocates would be forced to close their cases — never knowing if these clients received the basic resources they needed.

Below is an image displaying the phone calls made to a random group of 200 different patients and whether they responded or not. Half of the clients worked with Health Leads through the completion of their case and the other half ultimately disconnected from Health Leads’ program.

Patient Disconnection vs. Success

(The cases with negative days are ones where Health Leads took down the information for patient, didn’t begin working with them until a few days later.)

Just at a glance, there appears to be pretty clear differences between the two groups. Most obviously, the disconnected patients seem to have many more failed communication attempts (red dots) than successful ones (green dots).

However, Health Leads wanted to know: exactly what are the factors that contribute to a patient disconnecting from Health Leads? How does the difficulty of a patient’s need play into the problem? What other factors might be important to consider?

Against the backdrop of these pressing questions, Health Leads came to our DSSG team to use data to help discover some answers.

The Challenges

When we began tackling the problem, we ran into a slew of challenges. Unlike in the internet world where companies can track every iota of data down to the click, nonprofits serve their clients in person – meaning data must be manually recorded, rather than passively accumulated.

Furthermore, it may be that the factors we end up discovering as influencing patient outcome may be outside of the control of Health Leads. What if we found that the most significant indicators of patients’ success was gender or age? It would be hard to translate a finding like this into operationalizable actions for Advocates.

Our Findings

Over the summer, our team worked through the data to distill insight, discovering findings that Health Leads can use to improve their practice.

For example, we developed a “Patient Complexity Index” that tries to capture the probability that a patient will disconnect from Health Leads. We incorporate information about the type of resources this patient requires and historic performance information about the Health Leads clinic where the patient is served. For instance, needs involving employment or housing are typically much harder to resolve than needs around childcare or transportation. The success rates of each of these resource connections also vary per desk. We found that different Health Leads sites specialize in different types of resource connections.

By combining this information, Health Leads can more accurately quantify the difficulty of each patient so that more experienced Advocates can work with patients with more complex needs. By doing so, Health Leads can better address each patient’s different circumstances, lowering the chance that they’ll disconnect.

Patient Needs

A Need Complexity Index can help quantify the difficulty of these patients’ needs

Furthermore, Health Leads currently standardizes the intervals at which patients call patients: a minimum of once every 10 days. The findings from the data confirmed previous Health Leads research that Advocates should try to get in touch with patients frequently in the beginning stages of building a relationship with a patient. When an Advocate successfully contacts a client in the first month, that one successful phone call significantly decreases the likelihood of disconnection:

Call Frequency

Health Leads should call new clients frequently in the first month

Conclusion

We presented our findings and models to Health Leads at the end of this summer, and our results validate Health Leads’ emphasis on regular follow up. We believe that the information we provided reinforces organizational strategies that can increase client engagement: calling clients regularly and leveraging communication tools such as text messaging. By investigating the different factors influencing a patient’s likelihood to disconnect, our team’s findings have pointed to important steps that Health Leads can continue to take to ensure that more people get the resources they need for a healthy life.


I write about data science applied to social causes. If you want to be notified when my next post is published, subscribe by clicking here.



Source: Carl Shan – Weeks 7-12: Summer Wrapup

Week 6: Progress Thus Far


Week 6: Progress Thus Far

12 July 2014 – Chicago

This is the sixth in a series of posts chronicling my reflections on participating in the 2014 Data Science for Social Good Fellowship at the University of Chicago.
You can read my last post here:

To get automatically notified about new posts, you can subscribe by clicking here. You can also subscribe via RSS to this blog to get updates.


Throughout the DSSG Fellowship, it’s been clear that my team is quite unique — unlike other groups, we were tasked with two separate projects with two different partners: Health Leads and The Chicago Alliance to End Homelessness.

However, after spending a few weeks wrestling with the challenges of context-switching between different projects, tangoing with multiple parties through different communication channels, and wading through raw and smelly data, our team decided to break up into two sub-groups that would each tackle a different project.

I ended up gravitating to focusing on tackling the problems presented by Health Leads.

Now that it’s been just over six weeks since the Fellowship began, it would be a worthwhile reflection to assess what we’ve been able to accomplish up to this point, and what is still left to be done.

Health Leads

Health Leads

The Goal

I’ve written briefly about Health Leads before, in which I recounted the story of how Rebecca Onie founded the organization upon discovering the hidden link between social services and debilitating health conditions. To briefly summarize Health Leads mission: many health clinic patients experience health concerns that are brought on more so by social ills than medical ones. Asthma can be treated with medication, but not if the root cause is a mold-infested apartment.

Health Leads trains college students to work with patients who are referred by health service providers to work on identifying patient needs and working with them to acquire the resources that satisfy these needs.

Unfortunately, Health Leads is seeing a large number of their patients drop off. After one or two successful contacts, many patients stop returning phone calls. They don’t reply to emails and may live transitory lives, rendering direct mail a difficult channel of reaching them.

Our team is working on sifting through the collection of interaction data Health Leads has provided us with and bringing to light the possible reasons as to why a patient may disengage. In the end, we also hope to provide insights as to how Health Leads could direct their energy and activities to boost patient responsiveness in such a way that can increases the chances they will receive the resources they need.

The Challenges

What we quickly realized upon tackling this project was that Health Leads had yet to seriously determine exactly what it meant to for a patient to be “engaged.” To be fair, even in the world of technology product management this definition can be difficult to pin down. Groupon and Zynga, both struggling companies, certainly saw high usage and engagement numbers in their heyday by measurement of engagement. However, unlike in the web-world where companies can track every iota of data down to the click, non profits oftentimes have to make do with infrequently collected data that must be actively (and sometimes, painfully) recorded rather than passively accumulated.

Translating this into practice packs a painful twofold punch. Not only do we not have a great deal of data (our entire dataset totals less than 250mb), but a large portion of is afflicted with data quality issues. We see fields with low coverage, data clearly generated from user error or have otherwise untrustworthy cleanliness issues that raise our eyebrows.

All this presents a rather challenging scenario. After all, it’s hard to do data science without good data.

In addition to data concerns, I also mentioned earlier that nailing down the exactly definition of engagement is proving to be a challenge. The difficulty therein lies in translating a more nebulous human intuition into some rigorous formulation. If we were to proceed on the wrong calculation of engagement, any statistical machine learning methods we build to model it become suspect.

Our team had initially ran a logistic regression attempting to predict outcome as a function of responsiveness, only to discover that my calculation of responsiveness was off. However, upon recalculating it I learned that the accuracy of my predictions was actually higher on the erroneous calculations, presenting quite a conundrum.

Furthermore, even beyond the practical implementation concerns our team has, there are higher level questions that we’re asking ourselves. Namely, we’re questioning the underlying assumption of the entire problem: does higher engagement actually increase the chance of a successful patient outcomes?

After all, if the answer is a resounding ‘No’, then the entire foundation upon which we’ve been working crumbles into sand. Unfortunately, there are some small inklings that we’re finding possibly pointing in this direction. Tentatively, we believe this surprising finding is due more to low-quality data and an iffy definition of engagement than actual causal processes in the real world. Nevertheless, finding this raises a red flag in our minds.

Finally, one last challenge may be that the factors we end up discovering as influencing patient outcome may be ones outside of Health Leads control. Perhaps the most significant indicators of patients successfully acquiring necessary resources are variables such as gender or age. It may be quite difficult for Health Leads to translate into operationalizable steps their Advocates can take.

The Adventure Continues

The previous section might have come off as pessimistic. But I didn’t mean it to be. Reviewing them, none of the challenges in my list are insurmountable or dead ends. In fact there are also a number of reasons to be quite positive about in thinking about what I, as a data scientist, can do to help Health Leads achieve their vision of creating a healthcare system in which all patients’ basic resource needs are adequately addressed.

For starters, our team has already started to think about ways to more carefully redefine and incorporate engagement as a measurement of patient outcome. We think that our initial findings of a disconnection between patient outcome and engagement is due more so to faulty wiring at the definition level rather than an actual lack of relationship between the two factors.

With Health Leads’ help, we’re also thinking of ways of engineering more substantive and accurate features from the data we have available that can paint a more nuanced and informative story about how a patient traverses through the process of getting the resources they need. As an example, one road we’re exploring is to, rather than summarize engagement using one single number averaging across multiple interactions, instead look at vectorizing engagement by calculating it at various points through a patient’s relationship with Health Leads.[1]

Even if engagement ends up proving a dud — bearing little to no predictive significance on a patient’s outcome — this itself would be a landmark discovery for Health Leads. And based upon their stellar team and impressive organizational quality that I’ve observed up to this point, I have no doubt that they’ll thoughtfully incorporate this finding into their model so as to better continue serving the health and social service needs of individuals all over America.

Footnotes

[1] Another opportunity we’ll be exploring as we continue working with Health Leads will be to directly predict the outcome of a patient, rather than using engagement as a proxy.


I write posts about data science applied to social causes. If you want to be notified when my next reflection is published, subscribe by clicking here.



Source: Carl Shan – Week 6: Progress Thus Far

Week 5: Learning and Doing


Week 5: Learning and Doing

5 July 2014 – Chicago

This is the fifth in a series of posts chronicling my reflections on participating in the 2014 Data Science for Social Good Fellowship at the University of Chicago.
You can read my last post here:

To get automatically notified about new posts, you can subscribe by clicking here. You can also subscribe via RSS to this blog to get updates.


My number one goal in participating in the 2014 DSSG Fellowship was to create something of value. By something valuable, I had in mind a piece of software that helped satisfy the needs of the non profit partners I would work with.

Now halfway through the Fellowship, I’ve increasingly noticed a tension between my goal and that of the Fellowship. As mentioned in my initial reflections about DSSG, the goals of the Fellowship were focused around:

“(a) helping Fellows learn how to use various techniques and tools and (b) developing each Fellow’s interests in working towards social good, open government and open science.”

Both of these are also priorities of mine, but I believe that they could both be obtained primarily through creating a valuable piece of software. Rather than through listening to lectures or reading, I find myself learning most productively (as measured by amount of content retained per unit of time) when I have the chance to apply them in practice.[1] Being exposed to and navigating this tension has made me more aware of how the different goals of learning versus doing translate into mindsets and behaviors.

As someone who came into the Fellowship leaning more towards the doing camp, my mindset towards problem-solving is to focus on effectiveness, and not necessarily on efficiency. I work with the implicit assumption that my code won’t be as clean or properly abstracted as I would like it to be. I take an iterative approach where I take small stabs at the problem and refine my code as I build up my understanding of the problem.

When I come across edge cases in the data (e.g., a client who has a negative age, or a field with multiple values that really mean the same thing), I put aside my curiosity to dig further, make a mental note to explore it later and exclude this data from my analyses. With less hand-wringing about how to deal with strange outliers or edge cases, I default towards simplicity and building the least complex model. In fact, the first model my team and I looked at for Health Leads was one of the simplest possible: a single-variable logistic regression.

As a result of prioritizing doing over learning, I work primarily in iPython Notebook, a web-based Python interpreter. Only after properly mapping out and charting the territory of the problem do I then try to translate the code I’ve hacked together into more cleanly abstracted modules and scripts.[2]

In contrast to my attitude when my aim is doing, when I am clearly optimizing for learning, I focus on efficiency of process rather than effectiveness. Ira Glass, the host of the spectacular radio show This American Life, once said that it’s your taste combined with the relentless effort to materialize it your work that will propel you to greatness.

When I’m optimizing for learning, I’m driven more by the curiosity to understand than by the goal of achieving. I work more slowly, pausing to try to understand the edge cases. By poking around the shells of these outliers, searching for cracks, I end up discovering potholes to fill up in my knowledge. In this state of mind, progress on my work feels slower, but the density of learning is much higher.

Putting these thoughts into the framework of cognitive theories of behavior, I suspect that prioritizing learning over doing is aligns your mindset with an attitude of deliberate practice), key to becoming great at your craft.

In this way, learning is complementary to doing. A strong burst of effort towards learning the fundamentals, the shortcuts and the heuristics are the precursors to getting a ton done.

The Learning/Doing Curve

As the graph I made above shows, I suspect the beginning of a new project will require a heavy commitment towards learning. However, each of the dips in the curve represents a cycle of work that attempts to meet a project deadline happening at the trough of the curve. The oscillations that occur afterwards represent the various stumbling blocks that you encounter during the course of a project.

As I progress I’ll have to keep in mind just how much I want to optimize towards learning versus doing, and try to feel out where the various inflection points are.

Footnotes

[1] I’ve noticed that I’ve been in the state of ‘flow’ more often when I’m in the act of creation, such as through writing or coding, than when I’m passively absorbing information, such as through watching a talk. One intermediate point between these two different ends is that I’ve also noticed I can easily go into a state of ‘flow’ when I’m absorbing information through reading. However even when I read I find myself underlining key passages, jotting notes and counterpoints and actively thinking about the content.

[2] No matter how much I focus on doing, I can’t escape my aesthetic preference for cleanly and well-factored code. I spent last night obsessing over how to speed a function. I left the office happily at 1:30am with the code running faster by a factor of about 8.


I write weekly posts about data science applied to social causes. If you want to be notified when my next reflection is published, subscribe by clicking here.



Source: Carl Shan – Week 5: Learning and Doing