Celebrating Women’s Day: 33 Women in Data Science from around the World & AV Community


Introduction She Believed, she could. So, she did This Women’s Day we are celebrating the women power. We are celebrating all those women who …

The post Celebrating Women’s Day: 33 Women in Data Science from around the World & AV Community appeared first on Analytics Vidhya.



Source: Vidhya – Celebrating Women’s Day: 33 Women in Data Science from around the World & AV Community

Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning


Introduction Optimization is always the ultimate goal whether you are dealing with a real life problem or building a software product. I, as a …

The post Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning appeared first on Analytics Vidhya.



Source: Vidhya – Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning

How to read most commonly used file formats in Data Science (using Python)?


Introduction If you have been part of data industry, you would know the challenge of working with different data types. Different formats, different compression, …

The post How to read most commonly used file formats in Data Science (using Python)? appeared first on Analytics Vidhya.



Source: Vidhya – How to read most commonly used file formats in Data Science (using Python)?

#GivingTuesday DataDive Capacity


Thank you for your interest in joining us at the #GivingTuesday DataDive March 3-5 in partnership with 92Y and the Bill and Melinda Gates Foundation! Together, we’ll be using data to unravel tough questions and prototype new solutions to support social change through increased philanthropic giving. Because we may have a full house this weekend, please continue to check this blog for the latest updates on event capacity!

We’ll update the text below and the image above to let you know if we’re full or if we still have room for more DataDivers to attend.

 

RIGHT NOW WE ARE….

 

ANXIOUSLY AWAITING FRIDAY MARCH 3RD!

Doors open 6:00pm!

 

What’s this #GivingTuesday DataDive all about?

#GivingTuesday is a movement to celebrate giving of all kinds. Founded by 92Y in 2012 and celebrated on the Tuesday after Thanksgiving, #GivingTuesday inspires people around the world to take collaborative action to improve their local communities and contribute in countless ways to the causes they believe in. On #GivingTuesday 2016, individuals, corporations and civic coalitions raised over $170 million to benefit a tremendously broad range of causes, and gave much more in volunteer hours, nonmonetary donations, and acts of kindness.

While #GivingTuesday’s reach has grown significantly over the past five years, philanthropic giving in the U.S. still has not risen above 2% GDP. If we could increase it by even 1%, the impact would be massive – almost $4 billion of additional funding for causes addressing tough social issues from poverty to healthcare to education and more. To understand what might motivate more people to give, volunteers will dive into data from #GivingTuesday 2016 to generate insights for a report that will be shared publicly. Philanthropic giving is what fuels social change – lend your skills to help unleash even more of this critical resource.

Collaborate and engage with some of the brightest minds in data science, social change and technology as you work in teams to analyze, visualize, and mashup fascinating data sets to create real world change. We believe data has the power to change the world, but only when we all work together. Join us for a data adventure like you’ve never seen and get ready to make friends, build skills and help unleash the power of data to serve humanity!



Source: DataKind – #GivingTuesday DataDive Capacity

Introductory guide on Linear Programming for (aspiring) data scientists


Introduction Optimization is the way of life. We all have finite resources and time and we want to make the most of them. From …

The post Introductory guide on Linear Programming for (aspiring) data scientists appeared first on Analytics Vidhya.



Source: Vidhya – Introductory guide on Linear Programming for (aspiring) data scientists

5 More Deep Learning Applications a beginner can build in minutes (using Python)


Introduction Deep Learning is fundamentally changing everything around us. A lot of people think that you need to be an expert to use power of …

The post 5 More Deep Learning Applications a beginner can build in minutes (using Python) appeared first on Analytics Vidhya.



Source: Vidhya – 5 More Deep Learning Applications a beginner can build in minutes (using Python)

DataDiving to Support Youth with the Annie E. Casey Foundation


In early December, we packed our bags to host a DataDive with DataKind DC in partnership with the Annie E. Casey Foundation, an organization devoted to developing a brighter future for millions of children at risk of poor educational, economic, social and health outcomes. What made this DataDive special is that all the teams worked on challenges focused on protecting and improving the lives of at risk children and young adults, and in some cases they even used the same datasets. It was also unique in that teams were able to get input from youth experts and students from Code in the Schools, a nonprofit dedicated to teaching programming to students in Baltimore.

A huge thanks to the approximately 100 volunteers that came together ready to roll up their sleeves and dive in to the data, as well as our inspiring project champions that are doing such critical work to help support children at risk. We are also grateful to Allegheny County and the Philadelphia Youth Network for sharing their data and expertise throughout the whole process.

 

Helping Children in Foster Care in Allegheny County, Pennsylvania

This visual shows the impact of a child’s age on having a successful exit from the foster care system in Allegheny County, Pennsylvania. The visual was created by high school students from Code in the Schools that had never worked with data science before. With coaching from our DataKind DC Chapter Leaders, they learned onsite how to produce visualizations like this.

 

Children have more successful outcomes when they are in stable, loving families, but too often children in foster care move from home to home or are placed in group homes.  According to a report by the Annie E. Casey Foundation, children in group homes were more likely to test below or far below in basic English and mathematics, more likely to drop out of school and less likely to graduate from high school than children placed with families. Given these considerations, volunteers worked to optimize a child’s potential for a successful initial placement.

When volunteer data scientists at the DataDive began to look at the data regarding movement between placements, it was important to incorporate several contextual factors. Sometimes children move for a “positive” reason, such as moving to live with a relative. When the reason is “negative,” such as when a foster parent decides they can’t handle a child’s behavior and requests the child be moved, it’s called a disruption. There are some moves in a grey area that are not clearly good or bad, such as when a child is moved from a traditional foster home to a therapeutic foster home due to a need for health-related treatment. No matter the reason, the moves can be traumatic for children and typically have a negative effect on a child’s behavior. Minimizing such moves and increasing the likelihood that a child will be placed with a family rather than in a group home, is critical in providing children with a stable environment and increasing their chances for a successful exit from foster care.

Two teams set out to see how data on foster care placements in Allegheny County, which includes Pittsburgh, could help prevent mismatched foster placements and minimize moves overall for children in foster care.

 

Improving Foster Care Placements

Many children who enter into foster care are placed into homes based on immediacy of availability instead of fit, leading to a potential mismatch between children and their home environment. When a mismatch takes place, children may end up being moved repeatedly, with some placed in homes far away from their family, schools, community, courts and other support systems critical to their success.

Led by Data Ambassadors Janet Montgomery and Abhishek Sharma, a volunteer team of data scientists, and a small cohort of high school students studying coding, worked to uncover trends and insights about placements within the Allegheny County foster care system as a first step towards creating a matching placement algorithm and application for children entering the foster care system to improve the quality of initial placements.

The team discovered a number of insights including the impact a child’s age has on successfully exiting the foster system, as shown above. In addition, they mapped where children were being removed from homes in Allegheny County compared to where facilities were located and looked at which types of foster facilities might be leading to more mismatches. This was an exploratory analysis and further investigation is needed, but the team’s work provides a strong foundation for the future development of a matching algorithm. They successfully identified what characteristics could be used for predictive analytics to flag which children entering the system may be at greater risk for removal and therefore in need of extra support to succeed. Having better placements up front would mean more stability for children and hopefully a smoother return home, with their kin or legal guardians.

 

Reducing Foster Care Placement Moves

Each of these graphs shows the movement of a different child as they are given multiple placements with different families. The volunteer team identified four basic patterns that children follow represented above.

 

While some data is captured when a child gets moved from one placement to another, the reason for the move is not always documented, which makes it difficult to know when a child might be in need of extra support. For instance, disruptive moves might signal a mismatched placement, while positive moves might signal that a correct placement has been attained. In some cases, where an ideal placement isn’t available, placing a child near critical support systems might be a suitable, if imperfect, alternative. If move types were better classified, caseworkers would have greater insight into how best to support children in foster care and potentially predict when a move is likely so they could intervene.

Data Ambassadors Ravi Solter and Sharang Kulkarni led a team to understand and potentially discover some overarching reasons that might explain disruptions. They also wanted to understand what might influence the likelihood a child will have a “positive exit” from foster care overall. The team dove in, analyzing and visualizing over 14,000 cases of children switching placements. They confirmed Allegheny County’s hunch that a child’s race and age indeed have a significant impact on whether or not they successfully exit foster care. Gender also has a significant impact, as they found that boys have a higher percentage of good placement exits than girls. Only about 50% of girls have good exits, versus almost 70% for boys (as shown with the graph below).

 

 

This graph shows the percentage of good and bad exit outcomes by gender.

 

This analysis is an important first step for caseworkers and child service agencies to better understand what factors make a disruption likelier so they can make better initial placements.

 

The Philadelphia Youth Network – Helping Young People Find Early Employment for a Strong Long-term Career

Studies have shown that youth who do not have early work experiences are more susceptible to unemployment in the future and are less likely to achieve higher levels of career attainment. The Philadelphia Youth Network (PYN) aggregates outcomes of youth enrolled in a variety of employment programs across different service providers. They wanted to understand what types of employment, wages, sectors, earnings, hours and other factors help young people achieve success and stability in their careers and what kinds of employment programs are best suited for different kinds of backgrounds.

Led by Data Ambassadors Nick Becker and Helen Wang, the team set out to analyze PYN’s data to provide insight into which employment programs are successful overall, which are successful for some groups, and which factors are driving success. Jobs assigned to students in an employment program are typically designed to be 120 hours over the course of six weeks, with a student “passing” the program if they’ve worked a minimum 86 hours. The team found that the program was actually getting more successful over time.

 

This graph shows the percentage of students in employment programs that have worked over 86 hours in their job assignments has been increasing. The program is becoming more successful over time.

 

The team also explored how demographics of the students, the length of job placement, what month the job starts and more were affecting success rates. They suggested future analysis on the students who don’t repeat the program to understand if it’s because they were unsuccessful or because they are going to school or found long-term employment. With better information about their employment programs, the Philadelphia Youth Network will be able to offer more targeted programs to help even more children achieve positive outcomes in their adult lives.

 

Annie E. Casey Foundation – Connecting Public Data Systems to Better Understand System-Involved Youth 

Youth who become part of the child welfare system are more likely to run away or become homeless; youth who age out of foster care face high risks of homelessness, and mental health issues are higher in homeless youth. While these issues are interconnected, youth service programs and agencies often do not share data with each other, making it difficult to view all aspects of a young person’s risks for homelessness and other negative outcomes. Inspired by Allegheny County’s integrated data system, Annie E. Casey Foundation wondered how might other municipalities adopt a similar integrated data system to show a more comprehensive picture of youth and help agencies better support youth involved in multiple systems.

Led by Data Ambassadors Greg Matthews and Aimee Barciauskas, the volunteer team aimed to explore the benefits of using an integrated data system for Allegheny County’s Office of Child, Youth, and Families by describing the populations of children and youth who have received services from their Behavioral Health and Homelessness programs.

The team produced population profiles of young people that have used each service or both services and described the groups of individuals who reappear between and within the same services.

 

This diagram shows the overlap of services used by young people – “bhs” or Behavioral Health Services, “shelt” or Homeless Services and “cyf” or Child Welfare Services.

 

 

These two graphs show the demographics of youth clients who used all services vs. Child Welfare services only.

 

The team also created an interactive tool to describe the youth clients and mapped their pathways through the systems over time.

These are two screenshots from the interactive tool. The top presents demographic information on the youth clients. The bottom shows how youth clients move through the systems over time.

 

The team recommended that the Annie E. Casey Foundation leverage data visualization tools for deeper exploration and more consistently categorize behavioral health services to allow for more robust analysis in the future. The Foundation is hoping to ultimately persuade other jurisdictions to link their disparate data sources on youth in an integrated data system like Allegheny County’s, allowing better monitoring of the risks that young people face and potentially improving targeted services to prevent disconnection.

 

Thank You!

Big thanks to all the volunteers that joined us for an inspiring weekend using data to support America’s youth – especially those that drove down from New York to be there! We are also grateful to Allegheny County and the Philadelphia Youth Network for sharing their data and expertise throughout the whole process. And a special shout out to the youth experts and Code in the Schools’ students that shared their wisdom and donated their time to give the teams’ context and help inform their analysis. And, of course, a sincere thanks to the Annie E. Casey Foundation for their generous support to make the weekend possible and the expertise they offered from their many years dedicated to building a brighter future for young people. Collaborations like this that bring experts together across sectors, age and geography are exactly what make new solutions possible so we are grateful to everyone that joined us to make the weekend a success!

 



Source: DataKind – DataDiving to Support Youth with the Annie E. Casey Foundation

DataKind Named One of Fast Company’s Top 10 Most Innovative Nonprofits


Big news – DataKind has been named one of Fast Company’s Top 10 Most Innovative Nonprofits for 2017 and we have you to thank.

Part of the magazine’s annual ranking of the World’s 50 Most Innovative Companies, this list honors leading enterprises and rising newcomers that exemplify the best in nimble business and impactful innovation. We were humbled to be recognized in the nonprofit category, alongside truly inspiring organizations like our friend and past project partner GiveDirectly, fellow New York-based The Fund for Public Housing and The Movement for Black Lives just to name a few.

While our stylish orange hoodies almost certainly swayed the judges, we know the real reason we were selected is because of all of you.

Even more than data, our work is about people. As we tell our incredible volunteers and project partners at our community events or before they kick off a project – YOU are DataKind. Our work depends on our global community of over 14,000 socially conscious data scientists, social innovators, subject matter experts, funders and data for good enthusiasts of all stripes.

Thank you for donating your time and talent to apply data science, AI and machine learning in the service of humanity. This honor goes to all of you, dear DataKinders – congratulations!

 

 

 



Source: DataKind – DataKind Named One of Fast Company’s Top 10 Most Innovative Nonprofits

Duplicate Question Detection with Deep Learning on Quora Dataset


Quora recently announced the first public dataset that they ever released. It includes 404351 question pairs with a label column indicating if they are duplicate or not.  In this post, I like to investigate this dataset and at least propose a baseline method with deep learning.

Beside the proposed method, it includes some examples showing how to use Pandas, Gensim, Spacy and Keras. For the full code you check Github.

Data Quirks

There are 255045 negative (non-duplicate) and 149306 positive (duplicate) instances. This induces a class imbalance however when you consider the nature of the problem, it seems reasonable to keep the same data bias with your ML model since negative instances are more expectable in a real-life scenario.

When we analyze the data, the shortest question is 1 character long (which is stupid and useless for the task) and the longest question is 1169 character (which is a long, complicated love affair question). I see that if any of the pairs is shorter than 10 characters, they do not make sense thus, I remove such pairs.  The average length is 59 and std is 32.

There are two other columns “q1id” and “q2id” but I really do not know how they are useful since the same question used in different rows has different ids.

Some labels are not true, especially for the duplicate ones. In anyways, I decided to rely on the labels and defer pruning due to hard manual effort.

Proposed Method

Converting Questions into Vectors

Here, I plan to use Word2Vec to convert each question into a semantic vector then I stack a Siamese network to detect if the pair is duplicate.

Word2Vec is a general term used for similar algorithms that embed words into a vector space with 300 dimensions in general.  These vectors capture semantics and even analogies between different words. The famous example is ;

king - man + woman = queen.

Word2Vec vectors can be used for may useful applications. You can compute semantic word similarity, classify documents or input these vectors to Recurrent Neural Networks for more advance applications.

There are two well-known algorithms in this domain. One is Google’s network architecture which learns representation by trying to predict surrounding words of a target word given certain window size. GLOVE is the another methos which relies on co-occurrence matrices. GLOVE is easy to train and it is flexible to add new words out-side of your vocabulary. You might like visit this tutorial to learn more and check this brilliant use-case Sense2Vec.

We still need a way to combine word vectors for singleton question representation. One simple alternative is taking the mean of all word vectors of each question. This is simple but really effective way for document classification and I expect it to work for this problem too.   In addition,  it is possible to enhance mean vector representation by using TF-IDF scores defined for each word. We apply weighted average of word vectors by using these scores. It emphasizes importance of discriminating words and avoid useless, frequent words which are shared by many questions.

Siamese Network

I described Siamese network in a previous post. In short, it is a two way network architecture which takes two inputs from the both side. It projects data into a space in which similar items are contracted and dissimilar ones are dispersed over the learned space. It is computationally efficient since networks are sharing parameters.

Siamese network tries to contract instances belonging to the same classes and disperse instances from different classes in the feature space.

 

Implementation

Let’s load the training data first.

For this particular problem, I train my own GLOVE model by using Gensim.

The above code trains a GLOVE model and saves it. It generates 300 dimensional vectors for words. Hyper parameters would be chosen better but it is just a baseline to see a initial performance. However, as I’ll show this model gives performance below than my expectation. I believe, this is because our questions are short and does not induce a semantic structure that GLOVE is able to learn a salient model.

Due to the performance issue and the observation above, I decide to use a pre-trained GLOVE model which comes free with Spacy. It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. This is how we use Spacy for this purpose.

Before going further, I really like Spacy. It is really fast and it does everything you need for NLP in a flash of time by hiding many intrinsic details. It deserves a good remuneration.  Similar to Gensim model, it also provides 300 dimensional embedding vectors.

The result I get from Spacy vectors is above Gensim model I trained. It is a better choice to go further with TF-IDF scoring.  For TF-IDF, I used scikit-learn (heaven of ML).  It provides TfIdfVectorizer which does everything you need.

After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores. The below code does this for just “question1” column.

Now, we are ready to create training data for Siamese network. Basically, I’ve just fetch the labels and covert mean word2vec vectors to numpy format. I split the data into train and test set too.

In this stage, we need to define Siamese network structure. I use Keras for its simplicity. Below, it is the whole script that I used for the definition of the model.

I share here the best performing network with residual connections. It is a 3 layers network using Euclidean distance as the measure of instance similarity. It has Batch Normalization per layer. It is particularly important since BN layers enhance the performance considerably. I believe, they are able to normalize the final feature vectors and Euclidean distance performances better in this normalized space.

I tried Cosine distance which is more concordant to Word2Vec vectors theoretically but cannot handle to obtain better results. I also tried to normalize data into unit variance or L2 norm but nothing gives better results than the original feature values.

Let’s train the network with the prepared data. I used the same model and hyper-parameters for all configurations. It is always possible to optimize these but hitherto I am able to give promising baseline results.

Results

In this section, I like to share test set accuracy values obtained by different model and feature extraction settings.  We expect to see improvement over 0.63 since when we set all the labels as 0, it is the accuracy we get.

These are the best results I obtain with varying GLOVE models. they all use the same network and hyper-parameters after I find the best on the last configuration depicted below.

  • Gensim (my model) + Siamese: 0.69
  • Spacy + Siamese :  0.72
  • Spacy + TD-IDF + Siamese : 0.79

We can also investigate the effect of different model architectures.  These are the values following  the best word2vec model shown above.

  • 2 layers net : 0.67
  • 3 layers net + adam : 0.74
  • 3 layers resnet (after relu BN) + adam : 0.77
  • 3 layers resnet (before relu BN) + adam : 0.78
  • 3 layers resnet (before relu BN) + adam + dropout : 0.75
  • 3 layers resnet (before relu BN) + adam + layer concat : 0.79
  • 3 layers resnet (before relu BN) + adam + unit_norm + cosine_distance : Fail

Adam works quite well for this problem compared to SGD with learning rate scheduling. Batch Normalization also yields a good improvement. I tried to introduce Dropout between layers in different orders (before ReLU, after BN etc.), the best I obtain is 0.75.  Concatenation of different layers improves the performance by 1 percent as the final gain.

In conclusion, here I tried to present a solution to this unique problem by composing different aspects of deep learning. We start with Word2Vec and combine it  with TF-IDF and then use Siamese network to find duplicates. Results are not perfect and akin to different optimizations. However, it is just a small try to see the power of deep learning in this domain. I hope you find it useful :).

Share

The post Duplicate Question Detection with Deep Learning on Quora Dataset appeared first on A Blog From Human-engineer-being.



Source: Erogol – Duplicate Question Detection with Deep Learning on Quora Dataset

Dilated Convolution


In simple terms, dilated convolution is just a convolution applied to input with defined gaps. With this definitions, given our input is an 2D image, dilation rate k=1 is normal convolution and k=2 means skipping one pixel per input and k=4 means skipping 3 pixels. The best to see the figures below with the same k values.

The figure below shows dilated convolution on 2D data. Red dots are the inputs to a filter which is 3×3 in this example, and greed area is the receptive field captured by each of these inputs. Receptive field is the implicit area captured on the initial input by each input (unit) to the next layer .

Dilated convolution is a way of increasing receptive view (global view) of the network exponentially and linear parameter accretion. With this purpose, it finds usage in applications cares more about integrating knowledge of the wider context with less cost.

One general use is image segmentation where each pixel is labelled by its corresponding class. In this case, the network output needs to be in the same size of the input image. Straight forward way to do is to apply convolution then add deconvolution layers to upsample[1]. However, it introduces many more parameters to learn. Instead, dilated convolution is applied to keep the output resolutions high and it avoids the need of upsampling [2][3].

Dilated convolution is applied in domains beside vision as well. One good example is WaveNet[4] text-to-speech solution and ByteNet learn time text translation. They both use dilated convolution in order to capture global view of the input with less parameters.

From [5]

In short, dilated convolution is a simple but effective idea and you might consider it in two cases;

  1. Detection of fine-details by processing inputs in higher resolutions.
  2. Broader view of the input to capture more contextual information.
  3. Faster run-time with less parameters

[1] Long, J., Shelhamer, E., & Darrell, T. (2014). Fully Convolutional Networks for Semantic Segmentation. Retrieved from http://arxiv.org/abs/1411.4038v1

[2]Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2014). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, 1–14. Retrieved from http://arxiv.org/abs/1412.7062

[3]Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. Iclr, 1–9. http://doi.org/10.16373/j.cnki.ahr.150049

[4]Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio, 1–15. Retrieved from http://arxiv.org/abs/1609.03499

[5]Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. van den, Graves, A., & Kavukcuoglu, K. (2016). Neural Machine Translation in Linear Time. Arxiv, 1–11. Retrieved from http://arxiv.org/abs/1610.10099

Share

The post Dilated Convolution appeared first on A Blog From Human-engineer-being.



Source: Erogol – Dilated Convolution