In early December, we packed our bags to host a DataDive with DataKind DC in partnership with the Annie E. Casey Foundation, an organization devoted to developing a brighter future for millions of children at risk of poor educational, economic, social and health outcomes. What made this DataDive special is that all the teams worked on challenges focused on protecting and improving the lives of at risk children and young adults, and in some cases they even used the same datasets. It was also unique in that teams were able to get input from youth experts and students from Code in the Schools, a nonprofit dedicated to teaching programming to students in Baltimore.

A huge thanks to the approximately 100 volunteers that came together ready to roll up their sleeves and dive in to the data, as well as our inspiring project champions that are doing such critical work to help support children at risk. We are also grateful to Allegheny County and the Philadelphia Youth Network for sharing their data and expertise throughout the whole process.

Helping Children in Foster Care in Allegheny County, Pennsylvania

This visual shows the impact of a child’s age on having a successful exit from the foster care system in Allegheny County, Pennsylvania. The visual was created by high school students from Code in the Schools that had never worked with data science before. With coaching from our DataKind DC Chapter Leaders, they learned onsite how to produce visualizations like this.

Children have more successful outcomes when they are in stable, loving families, but too often children in foster care move from home to home or are placed in group homes. According to a report by the Annie E. Casey Foundation, children in group homes were more likely to test below or far below in basic English and mathematics, more likely to drop out of school and less likely to graduate from high school than children placed with families. Given these considerations, volunteers worked to optimize a child’s potential for a successful initial placement.

When volunteer data scientists at the DataDive began to look at the data regarding movement between placements, it was important to incorporate several contextual factors. Sometimes children move for a “positive” reason, such as moving to live with a relative. When the reason is “negative,” such as when a foster parent decides they can’t handle a child’s behavior and requests the child be moved, it’s called a disruption. There are some moves in a grey area that are not clearly good or bad, such as when a child is moved from a traditional foster home to a therapeutic foster home due to a need for health-related treatment. No matter the reason, the moves can be traumatic for children and typically have a negative effect on a child’s behavior. Minimizing such moves and increasing the likelihood that a child will be placed with a family rather than in a group home, is critical in providing children with a stable environment and increasing their chances for a successful exit from foster care.

Two teams set out to see how data on foster care placements in Allegheny County, which includes Pittsburgh, could help prevent mismatched foster placements and minimize moves overall for children in foster care.

Improving Foster Care Placements

Many children who enter into foster care are placed into homes based on immediacy of availability instead of fit, leading to a potential mismatch between children and their home environment. When a mismatch takes place, children may end up being moved repeatedly, with some placed in homes far away from their family, schools, community, courts and other support systems critical to their success.

Led by Data Ambassadors Janet Montgomery and Abhishek Sharma, a volunteer team of data scientists, and a small cohort of high school students studying coding, worked to uncover trends and insights about placements within the Allegheny County foster care system as a first step towards creating a matching placement algorithm and application for children entering the foster care system to improve the quality of initial placements.

The team discovered a number of insights including the impact a child’s age has on successfully exiting the foster system, as shown above. In addition, they mapped where children were being removed from homes in Allegheny County compared to where facilities were located and looked at which types of foster facilities might be leading to more mismatches. This was an exploratory analysis and further investigation is needed, but the team’s work provides a strong foundation for the future development of a matching algorithm. They successfully identified what characteristics could be used for predictive analytics to flag which children entering the system may be at greater risk for removal and therefore in need of extra support to succeed. Having better placements up front would mean more stability for children and hopefully a smoother return home, with their kin or legal guardians.

Reducing Foster Care Placement Moves

Each of these graphs shows the movement of a different child as they are given multiple placements with different families. The volunteer team identified four basic patterns that children follow represented above.

While some data is captured when a child gets moved from one placement to another, the reason for the move is not always documented, which makes it difficult to know when a child might be in need of extra support. For instance, disruptive moves might signal a mismatched placement, while positive moves might signal that a correct placement has been attained. In some cases, where an ideal placement isn’t available, placing a child near critical support systems might be a suitable, if imperfect, alternative. If move types were better classified, caseworkers would have greater insight into how best to support children in foster care and potentially predict when a move is likely so they could intervene.

Data Ambassadors Ravi Solter and Sharang Kulkarni led a team to understand and potentially discover some overarching reasons that might explain disruptions. They also wanted to understand what might influence the likelihood a child will have a “positive exit” from foster care overall. The team dove in, analyzing and visualizing over 14,000 cases of children switching placements. They confirmed Allegheny County’s hunch that a child’s race and age indeed have a significant impact on whether or not they successfully exit foster care. Gender also has a significant impact, as they found that boys have a higher percentage of good placement exits than girls. Only about 50% of girls have good exits, versus almost 70% for boys (as shown with the graph below).

This graph shows the percentage of good and bad exit outcomes by gender.

This analysis is an important first step for caseworkers and child service agencies to better understand what factors make a disruption likelier so they can make better initial placements.

The Philadelphia Youth Network – Helping Young People Find Early Employment for a Strong Long-term Career

Studies have shown that youth who do not have early work experiences are more susceptible to unemployment in the future and are less likely to achieve higher levels of career attainment. The Philadelphia Youth Network (PYN) aggregates outcomes of youth enrolled in a variety of employment programs across different service providers. They wanted to understand what types of employment, wages, sectors, earnings, hours and other factors help young people achieve success and stability in their careers and what kinds of employment programs are best suited for different kinds of backgrounds.

Led by Data Ambassadors Nick Becker and Helen Wang, the team set out to analyze PYN’s data to provide insight into which employment programs are successful overall, which are successful for some groups, and which factors are driving success. Jobs assigned to students in an employment program are typically designed to be 120 hours over the course of six weeks, with a student “passing” the program if they’ve worked a minimum 86 hours. The team found that the program was actually getting more successful over time.

This graph shows the percentage of students in employment programs that have worked over 86 hours in their job assignments has been increasing. The program is becoming more successful over time.

The team also explored how demographics of the students, the length of job placement, what month the job starts and more were affecting success rates. They suggested future analysis on the students who don’t repeat the program to understand if it’s because they were unsuccessful or because they are going to school or found long-term employment. With better information about their employment programs, the Philadelphia Youth Network will be able to offer more targeted programs to help even more children achieve positive outcomes in their adult lives.

Annie E. Casey Foundation – Connecting Public Data Systems to Better Understand System-Involved Youth

Youth who become part of the child welfare system are more likely to run away or become homeless; youth who age out of foster care face high risks of homelessness, and mental health issues are higher in homeless youth. While these issues are interconnected, youth service programs and agencies often do not share data with each other, making it difficult to view all aspects of a young person’s risks for homelessness and other negative outcomes. Inspired by Allegheny County’s integrated data system, Annie E. Casey Foundation wondered how might other municipalities adopt a similar integrated data system to show a more comprehensive picture of youth and help agencies better support youth involved in multiple systems.

Led by Data Ambassadors Greg Matthews and Aimee Barciauskas, the volunteer team aimed to explore the benefits of using an integrated data system for Allegheny County’s Office of Child, Youth, and Families by describing the populations of children and youth who have received services from their Behavioral Health and Homelessness programs.

The team produced population profiles of young people that have used each service or both services and described the groups of individuals who reappear between and within the same services.

This diagram shows the overlap of services used by young people – “bhs” or Behavioral Health Services, “shelt” or Homeless Services and “cyf” or Child Welfare Services.

These two graphs show the demographics of youth clients who used all services vs. Child Welfare services only.

The team also created an interactive tool to describe the youth clients and mapped their pathways through the systems over time.

These are two screenshots from the interactive tool. The top presents demographic information on the youth clients. The bottom shows how youth clients move through the systems over time.

The team recommended that the Annie E. Casey Foundation leverage data visualization tools for deeper exploration and more consistently categorize behavioral health services to allow for more robust analysis in the future. The Foundation is hoping to ultimately persuade other jurisdictions to link their disparate data sources on youth in an integrated data system like Allegheny County’s, allowing better monitoring of the risks that young people face and potentially improving targeted services to prevent disconnection.

Thank You!

Big thanks to all the volunteers that joined us for an inspiring weekend using data to support America’s youth – especially those that drove down from New York to be there! We are also grateful to Allegheny County and the Philadelphia Youth Network for sharing their data and expertise throughout the whole process. And a special shout out to the youth experts and Code in the Schools’ students that shared their wisdom and donated their time to give the teams’ context and help inform their analysis. And, of course, a sincere thanks to the Annie E. Casey Foundation for their generous support to make the weekend possible and the expertise they offered from their many years dedicated to building a brighter future for young people. Collaborations like this that bring experts together across sectors, age and geography are exactly what make new solutions possible so we are grateful to everyone that joined us to make the weekend a success!

Part of the magazine’s annual ranking of the World’s 50 Most Innovative Companies, this list honors leading enterprises and rising newcomers that exemplify the best in nimble business and impactful innovation. We were humbled to be recognized in the nonprofit category, alongside truly inspiring organizations like our friend and past project partner GiveDirectly, fellow New York-based The Fund for Public Housing and The Movement for Black Lives just to name a few.

While our stylish orange hoodies almost certainly swayed the judges, we know the real reason we were selected is because of all of you.

Even more than data, our work is about people. As we tell our incredible volunteers and project partners at our community events or before they kick off a project – YOU are DataKind. Our work depends on our global community of over 14,000 socially conscious data scientists, social innovators, subject matter experts, funders and data for good enthusiasts of all stripes.

Thank you for donating your time and talent to apply data science, AI and machine learning in the service of humanity. This honor goes to all of you, dear DataKinders – congratulations!

Quora recently announced the first public dataset that they ever released. It includes 404351 question pairs with a label column indicating if they are duplicate or not. In this post, I like to investigate this dataset and at least propose a baseline method with deep learning.

Beside the proposed method, it includes some examples showing how to use Pandas, Gensim, Spacy and Keras. For the full code you check Github.

Data Quirks

There are 255045 negative (non-duplicate) and 149306 positive (duplicate) instances. This induces a class imbalance however when you consider the nature of the problem, it seems reasonable to keep the same data bias with your ML model since negative instances are more expectable in a real-life scenario.

When we analyze the data, the shortest question is 1 character long (which is stupid and useless for the task) and the longest question is 1169 character (which is a long, complicated love affair question). I see that if any of the pairs is shorter than 10 characters, they do not make sense thus, I remove such pairs. The average length is 59 and std is 32.

There are two other columns “q1id” and “q2id” but I really do not know how they are useful since the same question used in different rows has different ids.

Some labels are not true, especially for the duplicate ones. In anyways, I decided to rely on the labels and defer pruning due to hard manual effort.

Proposed Method

Converting Questions into Vectors

Here, I plan to use Word2Vec to convert each question into a semantic vector then I stack a Siamese network to detect if the pair is duplicate.

Word2Vec is a general term used for similar algorithms that embed words into a vector space with 300 dimensions in general. These vectors capture semantics and even analogies between different words. The famous example is ;

king - man + woman = queen.

Word2Vec vectors can be used for may useful applications. You can compute semantic word similarity, classify documents or input these vectors to Recurrent Neural Networks for more advance applications.

There are two well-known algorithms in this domain. One is Google’s network architecture which learns representation by trying to predict surrounding words of a target word given certain window size. GLOVE is the another methos which relies on co-occurrence matrices. GLOVE is easy to train and it is flexible to add new words out-side of your vocabulary. You might like visit this tutorial to learn more and check this brilliant use-case Sense2Vec.

We still need a way to combine word vectors for singleton question representation. One simple alternative is taking the mean of all word vectors of each question. This is simple but really effective way for document classification and I expect it to work for this problem too. In addition, it is possible to enhance mean vector representation by using TF-IDF scores defined for each word. We apply weighted average of word vectors by using these scores. It emphasizes importance of discriminating words and avoid useless, frequent words which are shared by many questions.

Siamese Network

I described Siamese network in a previous post. In short, it is a two way network architecture which takes two inputs from the both side. It projects data into a space in which similar items are contracted and dissimilar ones are dispersed over the learned space. It is computationally efficient since networks are sharing parameters.

Implementation

Let’s load the training data first.

For this particular problem, I train my own GLOVE model by using Gensim.

The above code trains a GLOVE model and saves it. It generates 300 dimensional vectors for words. Hyper parameters would be chosen better but it is just a baseline to see a initial performance. However, as I’ll show this model gives performance below than my expectation. I believe, this is because our questions are short and does not induce a semantic structure that GLOVE is able to learn a salient model.

Due to the performance issue and the observation above, I decide to use a pre-trained GLOVE model which comes free with Spacy. It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. This is how we use Spacy for this purpose.

Before going further, I really like Spacy. It is really fast and it does everything you need for NLP in a flash of time by hiding many intrinsic details. It deserves a good remuneration. Similar to Gensim model, it also provides 300 dimensional embedding vectors.

The result I get from Spacy vectors is above Gensim model I trained. It is a better choice to go further with TF-IDF scoring. For TF-IDF, I used scikit-learn (heaven of ML). It provides TfIdfVectorizer which does everything you need.

After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores. The below code does this for just “question1” column.

Now, we are ready to create training data for Siamese network. Basically, I’ve just fetch the labels and covert mean word2vec vectors to numpy format. I split the data into train and test set too.

In this stage, we need to define Siamese network structure. I use Keras for its simplicity. Below, it is the whole script that I used for the definition of the model.

I share here the best performing network with residual connections. It is a 3 layers network using Euclidean distance as the measure of instance similarity. It has Batch Normalization per layer. It is particularly important since BN layers enhance the performance considerably. I believe, they are able to normalize the final feature vectors and Euclidean distance performances better in this normalized space.

I tried Cosine distance which is more concordant to Word2Vec vectors theoretically but cannot handle to obtain better results. I also tried to normalize data into unit variance or L2 norm but nothing gives better results than the original feature values.

Let’s train the network with the prepared data. I used the same model and hyper-parameters for all configurations. It is always possible to optimize these but hitherto I am able to give promising baseline results.

Results

In this section, I like to share test set accuracy values obtained by different model and feature extraction settings. We expect to see improvement over 0.63 since when we set all the labels as 0, it is the accuracy we get.

These are the best results I obtain with varying GLOVE models. they all use the same network and hyper-parameters after I find the best on the last configuration depicted below.

Gensim (my model) + Siamese: 0.69

Spacy + Siamese : 0.72

Spacy + TD-IDF + Siamese : 0.79

We can also investigate the effect of different model architectures. These are the values following the best word2vec model shown above.

Adam works quite well for this problem compared to SGD with learning rate scheduling. Batch Normalization also yields a good improvement. I tried to introduce Dropout between layers in different orders (before ReLU, after BN etc.), the best I obtain is 0.75. Concatenation of different layers improves the performance by 1 percent as the final gain.

In conclusion, here I tried to present a solution to this unique problem by composing different aspects of deep learning. We start with Word2Vec and combine it with TF-IDF and then use Siamese network to find duplicates. Results are not perfect and akin to different optimizations. However, it is just a small try to see the power of deep learning in this domain. I hope you find it useful :).

In simple terms, dilated convolution is just a convolution applied to input with defined gaps. With this definitions, given our input is an 2D image, dilation rate k=1 is normal convolution and k=2 means skipping one pixel per input and k=4 means skipping 3 pixels. The best to see the figures below with the same k values.

The figure below shows dilated convolution on 2D data. Red dots are the inputs to a filter which is 3×3 in this example, and greed area is the receptive field captured by each of these inputs. Receptive field is the implicit area captured on the initial input by each input (unit) to the next layer .

Dilated convolution is a way of increasing receptive view (global view) of the network exponentially and linear parameter accretion. With this purpose, it finds usage in applications cares more about integrating knowledge of the wider context with less cost.

One general use is image segmentation where each pixel is labelled by its corresponding class. In this case, the network output needs to be in the same size of the input image. Straight forward way to do is to apply convolution then add deconvolution layers to upsample[1]. However, it introduces many more parameters to learn. Instead, dilated convolution is applied to keep the output resolutions high and it avoids the need of upsampling [2][3].

Dilated convolution is applied in domains beside vision as well. One good example is WaveNet[4] text-to-speech solution and ByteNet learn time text translation. They both use dilated convolution in order to capture global view of the input with less parameters.

In short, dilated convolution is a simple but effective idea and you might consider it in two cases;

Detection of fine-details by processing inputs in higher resolutions.

Broader view of the input to capture more contextual information.

Faster run-time with less parameters

[1] Long, J., Shelhamer, E., & Darrell, T. (2014). Fully Convolutional Networks for Semantic Segmentation. Retrieved from http://arxiv.org/abs/1411.4038v1

[2]Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2014). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, 1–14. Retrieved from http://arxiv.org/abs/1412.7062

[3]Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. Iclr, 1–9. http://doi.org/10.16373/j.cnki.ahr.150049

[4]Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio, 1–15. Retrieved from http://arxiv.org/abs/1609.03499

[5]Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. van den, Graves, A., & Kavukcuoglu, K. (2016). Neural Machine Translation in Linear Time. Arxiv, 1–11. Retrieved from http://arxiv.org/abs/1610.10099

Principal Component Analysis or PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

In other words, Principal Component Analysis (PCA) is a technique to detect the main components of a data set in order to reduce into fewer dimensions retaining the relevant information.

To put an example, Let a data set with zero mean, that is, the matrix formed by observations of variables. Where the elements of are denoted as usual by meaning that it contains the value of the observable of the observation experiment.

A principal component is a linear combination of the variables so that maximizes the variance.

Let’s plot the data set and compute the PCA. The red dots of the figure show below the considered data, the blue arrow shows the eigenvector of maximum eigenvalue.

Eigendecomposition – Computing Eigenvectors and Eigenvalues

The eigenvectors determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.

Let’s sort the eigenvalues to see if everything is ok

# let's sort the eig values to see if everything is okforevineig_vecs:np.testing.assert_array_almost_equal(1.0,np.linalg.norm(ev))print('Everything ok!')

Everything ok!

Now we need to make a list of the eigenvalue, eigenvectors tuples and sort them from high to low.

# Make a list of (eigenvalue, eigenvector) tupleseig_pairs=[(np.abs(eig_vals[i]),eig_vecs[:,i])foriinrange(len(eig_vals))]# Sort the (eigenvalue, eigenvector) tuples from high to loweig_pairs.sort(key=lambdax:x[0],reverse=True)# Visually confirm that the list is correctly sorted by decreasing eigenvaluesprint('Eigenvalues in descending order:')foriineig_pairs:print(i[0])

Eigenvalues in descending order:
1.89522131626
0.124980703938

Building a Projection Matrix

# Choose the "top 2" eigenvectors with the highest eigenvalues # we are going to use this values to matrix W.matrix_w=np.hstack((eig_pairs[0][1].reshape(2,1),eig_pairs[1][1].reshape(2,1)))print('Matrix W:\n',matrix_w)

We will use this data to plot our output later so we can compare with a custom gradient descent approach.

There are several numerical techniques that allow to find a point that corresponds too , the saddle point. One way to tackle the problem is to “construct a new function, related to the Lagrangian, that (ideally) has a minimum at

This new function can be considered as ’distorting’ the Lagrangian at infeasible points so as to create a minimum at . Unconstrained minimization techniques can then be applied to the new function. This approach can make it easier to guarantee convergence to a local solution, but there is the danger that the local convergence properties of the method can be damaged.

The ’distortion’ of the Lagrangian function can lead to a ’distortion’ in the Newton equations for the method. Hence the behavior of the method near the solution may be poor unless care is taken.” Another way to tackle the condition is to maintain feasibility at every iteration. That is, to ensure that the updates follow the implicit curve . For the toy problem we are considering here it is relatively easy. Assume we start from a point x 0 that satisfies , that is it satisfies the constraint.

The algorithm can be summarized as follows:

Compute the gradient (observe that we compute the gradient of the Lagrangian with respect to ).

Compute an estimate of by computing the value of that minimizes .

Assume that the update is . For each candidate update , project it over the constraint . Find the α k value that decreases the with respect to .

Goto step 1 and repeat until convergence.

Let’s now implement the KKT conditions to see if we are able to obtain the same result as the one obtained with the closed solution. We will use the projected gradient descent to obtain the solution.

Let’s A be our covariance matrix

# A is the covariance matrix of the considered dataA=np.cov(c1.T)A

Now we set up our initial values

# Tolerancetol=1e-08# Initial alpha value (line search)alpha=1.0# Initial values of w. DO NOT CHOOSE w=(0,0)w=np.array([1.,0.])

Now we compute the eigenvalues and eigenvectors

# let's see now the eigvals and eigvectseig_vals,eig_vecs=np.linalg.eig(A)print('Eigenvectors \n%s'%eig_vecs)
print('\nEigenvalues \n%s'%eig_vals)

Now, we compute the projection for the function w=w.T*w

#now let's compute the projection for the function. w = w.T*wden=np.sqrt(np.dot(w.T,w))w=w/den

Next step is to compute lambda

# now we calculate lambdalam=-np.dot(np.dot(w.T,(A+w.T)),w)/2*np.dot(w.T,w)

Let’s review our initial values

print"Initial values"print"Lagrangian value =",lagprint" w =",wprint" x =",m1print" y =",s1

Initial values
Lagrangian value = -0.858313040377
w = [ 1. 0.]
x = [4.0, -1.0]
y = [[1, 0.9], [0.9, 1]]

Let’s now compute our function using gradient descent

# let's now compute the entire values for our functionwhile((alpha>tol)and(cont<100000)):cont=cont+1# Gradient of the Lagrangiangrw=-np.dot(w.T,(A+w.T))-2*lam*w.T# Used to know if we finished line searchfinished=0while((finished==0)and(alpha>tol)):# Updateaux_w=w-alpha*grw# Our Projection den=np.sqrt(np.dot(aux_w.T,aux_w))aux_w=aux_w/den# Compute new value of the Lagrangian.aux_lam=-np.dot(np.dot(aux_w.T,(A+w.T)),aux_w)/2*np.dot(aux_w.T,aux_w)aux_lag=-np.dot(aux_w.T,np.dot(A,aux_w))-lam*(np.dot(aux_w.T,aux_w)-1)# Check if this is a descentifaux_lag<lag:w=aux_wlam=aux_lamlag=aux_lagalpha=1.0finished=1else:alpha=alpha/2.0

Let’s now review our final values

# Let's now review our final values!print" Our Final Values"print" Number of iterations",contprint" Obtained values are w =",wprint" Correct values are w =",veps[idx]print" Eigenvectors are =",eig_vecs

Our Final Values
Number of iterations 22
Obtained values are w = [ 0.71916397 0.69484041]
Correct values are w = [ 0.71916398 -0.6948404 ]
Eigenvectors are = [[ 0.71916398 -0.6948404 ]
[ 0.6948404 0.71916398]]

Let’s compare our new values vs the ones obtained by the closed solution

# Full comparitionprint" Gradient Descent values w =",wprint" PCA analysis approach w =",matrix_wprint" Closed Solution w =",veps[idx]print" Closed Solution w =",veps,vaps

Machine learning is everywhere and we are amazed with capabilities of these algorithms. However, they are not great and sometimes they behave so dumb. For instance, let’s consider an image recognition model. This model induces really high empirical performance and it works great for normal images. Nevertheless, it might fail when you change some of the pixels of an image even so this little perturbation might be indifferent to human eye. There we call this image an adversarial instance.

There are various methods to generate adversarial instances [1][2][3][4]. One method is to take derivative of the model outputs wrt the input values so that we can change instance values to manipulate the model decision. Another approach exploits genetic algorithms to generate manipulative instances which are confidently classified as a known concept (say ‘dog’) but they are nothing to human eyes.

So why these models are that weak against adversarial instances. One reliable idea states that because adversarial instances lie on the low probability regions of the instance space. Therefore, they are so weird to the network which is trained with a limited number of instances from higher probability regions.

That being said, maybe there is no way to escape from the fretting adversarial instances, especially when they are produced by exploiting weaknesses of a target model with a gradient guided probing. This is a analytic way of searching for a misleading input for that model with an (almost) guaranteed certainty. Therefore in one way or another, we find an perturbed input deceiving any model.

Due to that observation, I believe that adversarial instances can be resolved by multiple models backing each other. In essence, this is the motivation of this work.

Proposed Work

In this work, I like to share my observations focusing on strength of the ensembles against adversarial instances. This is just a toy example with so much short-comings but I hope it’ll give the idea with some emiprical evidences.

As a summary, this is what we do here;

Train a baseline MNIST ConvNet.

Create adversarial instances on this model by using cleverhans and save.

Measure the baseline model performance on adversarial.

Train the same ConvNet architecture including adversarial instances and measure its performance.

Train an ensemble of 10 models of the same ConvNet architecture and measure ensemble performance and support the backing argument stated above.

My code full code can be seen on github and I here only share the results and observations. You need cleverhans, Tensorflow and Keras for adversarial generation and you need PyTorch for ensemble training. (Sorry for verbosity of libraries but I like to try PyTorch as well after yeras of tears with Lua).

One problem of the proposed experiment is that we do not recreate adversarial instances for each model and we use a previously created one. Anyways, I believe the empirical values verifies my assumption even in this setting. In addition, I plan to do more extensive study as a future work.

Create adversarial instances.

I start by training a simple ConvNet architecture on MNIST dataset by using legitimate train and test set splits. This network gives 0.98 test set accuracy after 5 epochs.

For creating adversarial instances, I use fast gradient sign method which perturbs images using the derivative of the model outputs wrt the input values. You can see a bunch of adversarial samples below.

The same network suffers on adversarial instances (as above) created on the legitimate test set. It gives 0.09 accuracy which is worse then random guess.

Plot adversarial instances.

Then I like to see the representational power of the trained model on both the normal and the adversarial instances. I do this by using well-known dimension reduction technique T-SNE. I first compute the last hidden layer representation of the network per instance and use these values as an input to T-SNE which aims to project data onto 2-D space. Here is the final projection for the both types of data.

These projections clearly show that adversarial instances are just a random data points to the trained model and they are receding from the real data points creating what we call low probability regions for the trained model. I also trained the same model architecture by dynamically creating adversarial instances in train time then test its value on the adversarials created previously. This new model yields 0.98 on normal test set, 0.91 on previously created adversarial test set and 0.71 on its own dynamically created adversarial.

Above results show that including adversarial instances strengthen the model. However, this is conforming to the low probability region argument. By providing adversarial, we let the model to discover low probability regions of adversarial instances. Beside, this is not applicable to large scale problems like ImageNet since you cannot afford to augment your millions of images per iteration. Therefore, by assuming it works, ensembling is more viable alternative as already a common method to increase overall prediction performance.

Ensemble Training

In this part, I train multiple models in different ensemble settings. First, I train N different models with the same whole train data. Then, I bootstrap as I train N different models by randomly sampling data from the normal train set. I also observe the affect of N.

The best single model obtains 0.98 accuracy on the legitimate test set. However, the best single model only obtains 0.22 accuracy on the adversarial instances created in previous part.

When we ensemble models by averaging scores, we do not see any gain and we stuck on 0.24 accuracy for the both training settings. However, surprisingly when we perform max ensemble (only count on the most confident model for each instance), we observe 0.35 for uniformly trained ensemble and 0.57 for the bootstrapped ensemble with N equals to 50.

Increasing N raises the adversarial performance. It is much more effective on bootstrapped ensemble. With N=5 we obtain 0.27 for uniform ensemble and 0.32 for bootstrapped ensemble. With N=25 we obtain 0.30 and 0.45 respectively.

These values are interesting especially for the difference of mean and max ensemble. My intuition behind the superiority of maxing is maxing out predictions is able to cover up weaknesses of models by the most confident one, as I suggested in the first place. In that vein, one following observation is that adversarial performance increases as we use smaller random chunks for each model up to a certain threshold with increasing N (number of models in ensemble). It shows us that bootstrapping enables models to learn some of the local regions better and some worse but the worse sides are covered by the more confident model in the ensemble.

As I said before, it is not convenient to use previously created adversarials created by the baseline model in the first part. However, I believe my claim still holds. Assume that we include the baseline model in our best max ensemble above. Still its mistakes would be corrected by the other models. I also tried this (after the comments below) and include the baseline model in our ensemble. 0.57 accuracy only reduces to 0.55. It is still pretty high compared to any other method not seeing adversarial in the training phase.

Conclusion

It is much more harder to create adversarials for ensemble of models with gradient methods. However, genetic algorithms are applicable.

Blind stops of individual models are covered by the peers in the ensemble when we rely on the most confident one.

We observe that as we train a model with dynamically created adversarial instances per iteration, it resolves the adversarials created by the test set. That is, since as the model sees examples from these regions it becomes immune to adversarials. It supports the argument stating low probability regions carry adversarial instances.

(Before finish) This is Serious!

Before I finish, I like to widen the meaning of this post’s heading. Ensemble against adversarial!!

“Adversarial instances” is peculiar AI topic. It attracted so much interest first but now it seems forgotten beside research targeting GANs since it does not yield direct profit, compared to having better accuracy.

Even though this is the case hitherto, we need consider this topic more painstakingly from now on. As we witness more extensive and greater AI in many different domains (such as health, law, governace), adversarial instances akin to cause greater problems intentionally or by pure randomness. This is not a sci-fi scenario I’m drawing here. It is a reality as it is prototyped in [3]. Just switch a simple recognition model in [3] with a AI ruling court for justice.

Therefore, if we believe in a future embracing AI as a great tool to “make the world better place!”, we need to study this subject extensively before passing a certain AI threshold.

Last Words

This work overlooks many important aspects but after all it only aims to share some of my findings in a spare time research. For a next post, I like study unsupervised models like Variational Encoders and Denoising Autoencoders by applying these on adversarial instances (I already started!). In addition, I plan to work on other methods for creating different types of adversarials.

From this post you should take;

References to adversarial instances

Good example codes waiting you on github that can be used many different projects.

Power of ensemble.

Some of non-proven claims and opinions on the topic.

IN ANY WAY HOPE YOU LIKE IT !

References

[1] Nguyen, A., Yosinski, J., & Clune, J. (2015). Deep Neural Networks are Easily Fooled. Computer Vision and Pattern Recognition, 2015 IEEE Conference on, 427–436.

[2] Szegedy, C., Zaremba, W., & Sutskever, I. (2013). Intriguing properties of neural networks. arXiv Preprint arXiv: …, 1–10. Retrieved from http://arxiv.org/abs/1312.6199

[3] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2016). Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples. arXiv. Retrieved from http://arxiv.org/abs/1602.02697

[4] Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. Iclr 2015, 1–11. Retrieved from http://arxiv.org/abs/1412.6572

In the last three years we’ve seen a notable improvement in how UK charities and social enterprises harness data. We know that it can be a powerful tool for driving social change. However, we also recognize that adopting data on an organizational level is often a slow, laborious and sometimes painful process; one that typically entails a shift in thinking among leadership, acquiring new skills and talent, breaking down data silos and raising awareness about what data can do across an organization.

To help charities better understand and alleviate the challenges of incorporating data into their efforts, DataKind UK wanted to build a data maturity framework. In partnership with Data Orchard, a small UK-based research consultancy, we undertook the Data Evolution project to map out the journey towards data maturity. It was supported by Nesta, Teradata, Esmée Fairbairn Foundation and Access – The Foundation for Social Investment.

As part of the project we ran two workshops, surveyed 200 social sector organizations, carried out in-depth assessments with 47 people from 12 social sector organizations and developed a framework documenting the five stages we identified these organizations go through as they adopt new data practices and become more data savvy.

You can learn more about the project here and also explore the maturity framework itself here. If you’re interested in reading more about our data maturity initiative as well as similar work being developed for local government, here’s a great post we co-wrote with Nesta you should check out.

If you’re looking for a job that lets you use your data and technology skills for social good, check out the selection of opportunities below we’ve heard about through the grapevine or stumbled upon online. Know of a great opportunity we missed? Tweet at us or email us at contact@datakind.org and we’ll share it.

DataKind is Hiring!

Our Executive Associate(New York) will be the right-hand person to our Director of Operations.