DataKind Named One of Fast Company’s Top 10 Most Innovative Nonprofits


Big news – DataKind has been named one of Fast Company’s Top 10 Most Innovative Nonprofits for 2017 and we have you to thank.

Part of the magazine’s annual ranking of the World’s 50 Most Innovative Companies, this list honors leading enterprises and rising newcomers that exemplify the best in nimble business and impactful innovation. We were humbled to be recognized in the nonprofit category, alongside truly inspiring organizations like our friend and past project partner GiveDirectly, fellow New York-based The Fund for Public Housing and The Movement for Black Lives just to name a few.

While our stylish orange hoodies almost certainly swayed the judges, we know the real reason we were selected is because of all of you.

Even more than data, our work is about people. As we tell our incredible volunteers and project partners at our community events or before they kick off a project – YOU are DataKind. Our work depends on our global community of over 14,000 socially conscious data scientists, social innovators, subject matter experts, funders and data for good enthusiasts of all stripes.

Thank you for donating your time and talent to apply data science, AI and machine learning in the service of humanity. This honor goes to all of you, dear DataKinders – congratulations!

 

 

 



Source: DataKind – DataKind Named One of Fast Company’s Top 10 Most Innovative Nonprofits

Duplicate Question Detection with Deep Learning on Quora Dataset


Quora recently announced the first public dataset that they ever released. It includes 404351 question pairs with a label column indicating if they are duplicate or not.  In this post, I like to investigate this dataset and at least propose a baseline method with deep learning.

Beside the proposed method, it includes some examples showing how to use Pandas, Gensim, Spacy and Keras. For the full code you check Github.

Data Quirks

There are 255045 negative (non-duplicate) and 149306 positive (duplicate) instances. This induces a class imbalance however when you consider the nature of the problem, it seems reasonable to keep the same data bias with your ML model since negative instances are more expectable in a real-life scenario.

When we analyze the data, the shortest question is 1 character long (which is stupid and useless for the task) and the longest question is 1169 character (which is a long, complicated love affair question). I see that if any of the pairs is shorter than 10 characters, they do not make sense thus, I remove such pairs.  The average length is 59 and std is 32.

There are two other columns “q1id” and “q2id” but I really do not know how they are useful since the same question used in different rows has different ids.

Some labels are not true, especially for the duplicate ones. In anyways, I decided to rely on the labels and defer pruning due to hard manual effort.

Proposed Method

Converting Questions into Vectors

Here, I plan to use Word2Vec to convert each question into a semantic vector then I stack a Siamese network to detect if the pair is duplicate.

Word2Vec is a general term used for similar algorithms that embed words into a vector space with 300 dimensions in general.  These vectors capture semantics and even analogies between different words. The famous example is ;

king - man + woman = queen.

Word2Vec vectors can be used for may useful applications. You can compute semantic word similarity, classify documents or input these vectors to Recurrent Neural Networks for more advance applications.

There are two well-known algorithms in this domain. One is Google’s network architecture which learns representation by trying to predict surrounding words of a target word given certain window size. GLOVE is the another methos which relies on co-occurrence matrices. GLOVE is easy to train and it is flexible to add new words out-side of your vocabulary. You might like visit this tutorial to learn more and check this brilliant use-case Sense2Vec.

We still need a way to combine word vectors for singleton question representation. One simple alternative is taking the mean of all word vectors of each question. This is simple but really effective way for document classification and I expect it to work for this problem too.   In addition,  it is possible to enhance mean vector representation by using TF-IDF scores defined for each word. We apply weighted average of word vectors by using these scores. It emphasizes importance of discriminating words and avoid useless, frequent words which are shared by many questions.

Siamese Network

I described Siamese network in a previous post. In short, it is a two way network architecture which takes two inputs from the both side. It projects data into a space in which similar items are contracted and dissimilar ones are dispersed over the learned space. It is computationally efficient since networks are sharing parameters.

Siamese network tries to contract instances belonging to the same classes and disperse instances from different classes in the feature space.

 

Implementation

Let’s load the training data first.

For this particular problem, I train my own GLOVE model by using Gensim.

The above code trains a GLOVE model and saves it. It generates 300 dimensional vectors for words. Hyper parameters would be chosen better but it is just a baseline to see a initial performance. However, as I’ll show this model gives performance below than my expectation. I believe, this is because our questions are short and does not induce a semantic structure that GLOVE is able to learn a salient model.

Due to the performance issue and the observation above, I decide to use a pre-trained GLOVE model which comes free with Spacy. It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. This is how we use Spacy for this purpose.

Before going further, I really like Spacy. It is really fast and it does everything you need for NLP in a flash of time by hiding many intrinsic details. It deserves a good remuneration.  Similar to Gensim model, it also provides 300 dimensional embedding vectors.

The result I get from Spacy vectors is above Gensim model I trained. It is a better choice to go further with TF-IDF scoring.  For TF-IDF, I used scikit-learn (heaven of ML).  It provides TfIdfVectorizer which does everything you need.

After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores. The below code does this for just “question1” column.

Now, we are ready to create training data for Siamese network. Basically, I’ve just fetch the labels and covert mean word2vec vectors to numpy format. I split the data into train and test set too.

In this stage, we need to define Siamese network structure. I use Keras for its simplicity. Below, it is the whole script that I used for the definition of the model.

I share here the best performing network with residual connections. It is a 3 layers network using Euclidean distance as the measure of instance similarity. It has Batch Normalization per layer. It is particularly important since BN layers enhance the performance considerably. I believe, they are able to normalize the final feature vectors and Euclidean distance performances better in this normalized space.

I tried Cosine distance which is more concordant to Word2Vec vectors theoretically but cannot handle to obtain better results. I also tried to normalize data into unit variance or L2 norm but nothing gives better results than the original feature values.

Let’s train the network with the prepared data. I used the same model and hyper-parameters for all configurations. It is always possible to optimize these but hitherto I am able to give promising baseline results.

Results

In this section, I like to share test set accuracy values obtained by different model and feature extraction settings.  We expect to see improvement over 0.63 since when we set all the labels as 0, it is the accuracy we get.

These are the best results I obtain with varying GLOVE models. they all use the same network and hyper-parameters after I find the best on the last configuration depicted below.

  • Gensim (my model) + Siamese: 0.69
  • Spacy + Siamese :  0.72
  • Spacy + TD-IDF + Siamese : 0.79

We can also investigate the effect of different model architectures.  These are the values following  the best word2vec model shown above.

  • 2 layers net : 0.67
  • 3 layers net + adam : 0.74
  • 3 layers resnet (after relu BN) + adam : 0.77
  • 3 layers resnet (before relu BN) + adam : 0.78
  • 3 layers resnet (before relu BN) + adam + dropout : 0.75
  • 3 layers resnet (before relu BN) + adam + layer concat : 0.79
  • 3 layers resnet (before relu BN) + adam + unit_norm + cosine_distance : Fail

Adam works quite well for this problem compared to SGD with learning rate scheduling. Batch Normalization also yields a good improvement. I tried to introduce Dropout between layers in different orders (before ReLU, after BN etc.), the best I obtain is 0.75.  Concatenation of different layers improves the performance by 1 percent as the final gain.

In conclusion, here I tried to present a solution to this unique problem by composing different aspects of deep learning. We start with Word2Vec and combine it  with TF-IDF and then use Siamese network to find duplicates. Results are not perfect and akin to different optimizations. However, it is just a small try to see the power of deep learning in this domain. I hope you find it useful :).

Share

The post Duplicate Question Detection with Deep Learning on Quora Dataset appeared first on A Blog From Human-engineer-being.



Source: Erogol – Duplicate Question Detection with Deep Learning on Quora Dataset

Dilated Convolution


In simple terms, dilated convolution is just a convolution applied to input with defined gaps. With this definitions, given our input is an 2D image, dilation rate k=1 is normal convolution and k=2 means skipping one pixel per input and k=4 means skipping 3 pixels. The best to see the figures below with the same k values.

The figure below shows dilated convolution on 2D data. Red dots are the inputs to a filter which is 3×3 in this example, and greed area is the receptive field captured by each of these inputs. Receptive field is the implicit area captured on the initial input by each input (unit) to the next layer .

Dilated convolution is a way of increasing receptive view (global view) of the network exponentially and linear parameter accretion. With this purpose, it finds usage in applications cares more about integrating knowledge of the wider context with less cost.

One general use is image segmentation where each pixel is labelled by its corresponding class. In this case, the network output needs to be in the same size of the input image. Straight forward way to do is to apply convolution then add deconvolution layers to upsample[1]. However, it introduces many more parameters to learn. Instead, dilated convolution is applied to keep the output resolutions high and it avoids the need of upsampling [2][3].

Dilated convolution is applied in domains beside vision as well. One good example is WaveNet[4] text-to-speech solution and ByteNet learn time text translation. They both use dilated convolution in order to capture global view of the input with less parameters.

From [5]

In short, dilated convolution is a simple but effective idea and you might consider it in two cases;

  1. Detection of fine-details by processing inputs in higher resolutions.
  2. Broader view of the input to capture more contextual information.
  3. Faster run-time with less parameters

[1] Long, J., Shelhamer, E., & Darrell, T. (2014). Fully Convolutional Networks for Semantic Segmentation. Retrieved from http://arxiv.org/abs/1411.4038v1

[2]Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2014). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, 1–14. Retrieved from http://arxiv.org/abs/1412.7062

[3]Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. Iclr, 1–9. http://doi.org/10.16373/j.cnki.ahr.150049

[4]Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio, 1–15. Retrieved from http://arxiv.org/abs/1609.03499

[5]Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. van den, Graves, A., & Kavukcuoglu, K. (2016). Neural Machine Translation in Linear Time. Arxiv, 1–11. Retrieved from http://arxiv.org/abs/1610.10099

Share

The post Dilated Convolution appeared first on A Blog From Human-engineer-being.



Source: Erogol – Dilated Convolution

Principal Component Analysis (PCA): A Practical Example

Let’s first define what a is PCA.

Principal Component Analysis or PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

In other words, Principal Component Analysis (PCA) is a technique to detect the main components of a data set in order to reduce into fewer dimensions retaining the relevant information.

To put an example, Let  X \in\mathbb{R}^{mxn} a data set with zero mean, that is, the matrix formed by n observations of m  variables. Where the elements of X  are denoted as usual by x_ij  meaning that it contains the value of the observable i  of the j-th  observation experiment.

A principal component is a linear combination of the variables so that maximizes the variance.

Let’s now see a PCA example step by step

1. Create a random toy data set

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

m1 = [4.,-1.]
s1 = [[1,0.9],[0.9,1]]
c1 = np.random.multivariate_normal(m1,s1,100)
plt.plot(c1[:,0],c1[:,1],'r.')

Let’s plot the data set and compute the PCA. The red dots of the figure show below the considered data, the blue arrow shows the eigenvector of maximum eigenvalue.

vaps,veps = np.linalg.eig(np.cov(c1.T))
idx = np.argmax(vaps)

plt.plot(c1[:,0],c1[:,1],'r.')
plt.arrow(np.mean(c1[:,0]),np.mean(c1[:,1]),
          vaps[idx]*veps[0,idx],vaps[idx]*veps[1,idx],0.5,
          linewidth=1,head_width=0.1,color='blue')

PCA Closed Solution

Now that we have visualize it, let’s code the closed solution for the PCA

First step is to standardize the data. We are going to use Scikit-learn library.

from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(c1)

Eigendecomposition – Computing Eigenvectors and Eigenvalues

The eigenvectors determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.

mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance Matrix \n%s' %cov_mat)
Covariance Matrix 
[[ 1.01010101  0.88512031]
 [ 0.88512031  1.01010101]]

Let’s now print our Covariance Matrix

#Let's print our Covariance Matrix
print('NumPy Covariance Matrix: \n%s' %np.cov(X_std.T))
NumPy Covariance Matrix: 
[[ 1.01010101  0.88512031]
 [ 0.88512031  1.01010101]]

Now we perform an eigendecomposition on the covariance matrix

cov_mat = np.cov(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
Eigenvectors 
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]

Eigenvalues 
[ 1.89522132  0.1249807 ]

Let’s sort the eigenvalues to see if everything is ok

# let's sort the eig values to see if everything is ok
for ev in eig_vecs:
    np.testing.assert_array_almost_equal(1.0, np.linalg.norm(ev))
print('Everything ok!')
Everything ok!

Now we need to make a list of the eigenvalue, eigenvectors tuples and sort them from high to low.

# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
    print(i[0])
Eigenvalues in descending order:
1.89522131626
0.124980703938

Building a Projection Matrix

# Choose the "top 2" eigenvectors with the highest eigenvalues 
# we are going to use this values to matrix W.
matrix_w = np.hstack((eig_pairs[0][1].reshape(2,1), 
                      eig_pairs[1][1].reshape(2,1)))

print('Matrix W:\n', matrix_w)
('Matrix W:\n', array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]]))

We will use this data to plot our output later so we can compare with a custom gradient descent approach.

There are several numerical techniques that allow to find a point x^* that corresponds too \nabla x, \lambda L (x^*, \lambda ^*) = 0 , the saddle point. One way to tackle the problem is to “construct a new function, related to the Lagrangian, that (ideally) has a minimum at (x^*, \lambda ^*)

This new function can be considered as ’distorting’ the Lagrangian at infeasible points so as to create a minimum at (x^*, \lambda ^*) . Unconstrained minimization techniques can then be applied to the new function. This approach can make it easier to guarantee convergence to a local solution, but there is the danger that the local convergence properties of the method can be damaged.

The ’distortion’ of the Lagrangian function can lead to a ’distortion’ in the Newton equations for the method. Hence the behavior of the method near the solution may be poor unless care is taken.” Another way to tackle the condition \nabla x, \lambda L (x, \lambda) = 0 is to maintain feasibility at every iteration. That is, to ensure that the updates xk follow the implicit curve h(x) = 0 . For the toy problem we are considering here it is relatively easy. Assume we start from a point x 0 that satisfies h(x 0 ) = 0 , that is it satisfies the constraint.

The algorithm can be summarized as follows:

  1. Compute the gradient \nabla L (x^k)  (observe that we compute the gradient of the Lagrangian with respect to x ).
  2. Compute an estimate of \lambda by computing the value of \lambda that minimizes \nabla L (x^k)^2 .
  3. Assume that the update is x^{k+1} = x^k - \alpha ^k \nabla L (x^k) . For each candidate update x k+1 , project it over the constraint h(x) = 0 . Find the α k value that decreases the L (x^{k+1}) with respect to \nabla L (x^k) .
  4. Goto step 1 and repeat until convergence.

Let’s now implement the KKT conditions to see if we are able to obtain the same result as the one obtained with the closed solution. We will use the projected gradient descent to obtain the solution.

Let’s A be our covariance matrix

# A is the covariance matrix of the considered data
A = np.cov(c1.T)
A

Now we set up our initial values

# Tolerance
tol=1e-08

# Initial alpha value (line search)
alpha=1.0

# Initial values of w. DO NOT CHOOSE w=(0,0)
w = np.array([1., 0.])

Now we compute the eigenvalues and eigenvectors

# let's see now the eigvals and eigvects

eig_vals, eig_vecs = np.linalg.eig(A)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

Now, we compute the projection for the function w=w.T*w

#now let's compute the projection for the function. w = w.T*w
den = np.sqrt(np.dot(w.T,w))
w = w / den

Next step is to compute lambda

# now we calculate lambda
lam = -np.dot (np.dot (w.T,(A + w.T) ),w) / 2 * np.dot(w.T,w)

Let’s review our initial values

print "Initial values"
print "Lagrangian value =", lag
print " w =", w
print " x =", m1
print " y =", s1
Initial values
Lagrangian value = -0.858313040377
 w = [ 1.  0.]
 x = [4.0, -1.0]
 y = [[1, 0.9], [0.9, 1]]

Let’s now compute our function using gradient descent

# let's now compute the entire values for our function

while ((alpha > tol) and (cont < 100000)):
    cont = cont+1
    
    # Gradient of the Lagrangian
    grw = -np.dot (w.T,(A + w.T) ) - 2 * lam * w.T
    
    # Used to know if we finished line search
    finished = 0
    
    while ((finished == 0) and (alpha > tol)):
        # Update
        aux_w = w - alpha * grw
        
        # Our Projection 
        den = np.sqrt(np.dot(aux_w.T,aux_w))
        aux_w = aux_w / den

        # Compute new value of the Lagrangian.
        aux_lam = -np.dot (np.dot(aux_w.T,(A+w.T)),aux_w) / 2 * np.dot (aux_w.T,aux_w)
        aux_lag = -np.dot (aux_w.T,np.dot(A,aux_w)) - lam * (np.dot(aux_w.T,aux_w) - 1)
        
        # Check if this is a descent
        if aux_lag < lag:
            w = aux_w
            lam = aux_lam
            lag = aux_lag
            alpha = 1.0
            finished = 1
        else:
            alpha = alpha / 2.0

Let’s now review our final values

# Let's now review our final values!
print " Our Final Values"
print "  Number of iterations", cont
print "  Obtained values are w =", w
print "  Correct values are  w =", veps[idx]
print "  Eigenvectors are =", eig_vecs
 Our Final Values
  Number of iterations 22
  Obtained values are w = [ 0.71916397  0.69484041]
  Correct values are  w = [ 0.71916398 -0.6948404 ]
  Eigenvectors are = [[ 0.71916398 -0.6948404 ]
 [ 0.6948404   0.71916398]]

Let’s compare our new values vs the ones obtained by the closed solution

# Full comparition
print "  Gradient Descent values   w =", w
print "  PCA analysis approach     w =", matrix_w
print "  Closed Solution           w =", veps[idx]
print "  Closed Solution           w =", veps,vaps
  Gradient Descent values   w = [ 0.71916397  0.69484041]
  PCA analysis approach     w = [[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]
  Closed Solution           w = [ 0.71916398 -0.6948404 ]
  Closed Solution           w = [[ 0.71916398 -0.6948404 ]
 [ 0.6948404   0.71916398]] [ 1.56340502  0.10299214]

Very close! Let’s print it to visualize the new values versus the ones obtaine with sci-kit learn

import seaborn as sns 
plt.plot(c1[:,0],c1[:,1],'r.')
plt.arrow(np.mean(c1[:,0]),np.mean(c1[:,1]),
          vaps[idx]*veps[0,idx],vaps[idx]*veps[1,idx],0.5,
          linewidth=1,head_width=0.1,color='blue')
plt.arrow(np.mean(c1[:,0]),np.mean(c1[:,1]),
          vaps[idx]*w[idx],vaps[idx]*w[idx],0.5,
          linewidth=1,head_width=0.1,color='lightblue')

PCA-gradient-descent

The post Principal Component Analysis (PCA): A Practical Example appeared first on 3Blades.

Source: 3blades – Principal Component Analysis (PCA): A Practical Example

Ensembling Against Adversarial Instances


What is Adversarial?

Machine learning is everywhere and we are amazed with capabilities of these algorithms. However, they are not great and sometimes they behave so dumb.  For instance, let’s consider an image recognition model. This model  induces really high empirical performance and it works great for normal images. Nevertheless, it might fail when you change some of the pixels of an image even so this little perturbation might be indifferent to human eye. There we call this image an adversarial instance.

There are various methods to generate adversarial instances [1][2][3][4]. One method is to take derivative of the model outputs wrt the input values so that we can change instance values to manipulate the model decision. Another approach exploits genetic algorithms to generate manipulative instances which are confidently classified as a known concept (say ‘dog’) but they are nothing to human eyes.

Generating adversaries by genetic algorithm [1]

Generating adversaries by input gradient [2].

So why these models are that weak against adversarial instances. One reliable idea states that because adversarial instances lie on the low probability regions of the instance space. Therefore, they are so weird to the network which is trained with a limited number of instances from higher probability regions.

That being said, maybe there is no way to escape from the fretting adversarial instances, especially when they are produced by exploiting weaknesses of a target model with a gradient guided probing. This is a analytic way of searching for a misleading input for that model with an (almost) guaranteed certainty. Therefore in one way or another, we find an perturbed input deceiving any model.

Due to that observation, I believe that adversarial instances can be resolved by multiple models backing each other. In essence, this is the motivation of this work.

Proposed Work

In this work, I like to share my observations focusing on strength of the ensembles against adversarial instances. This is just a toy example with so much short-comings but I hope it’ll give the idea with some emiprical evidences.

As a summary, this is what we do here;

  • Train a baseline MNIST ConvNet.
  • Create adversarial instances on this model by using cleverhans and save.
  • Measure the baseline model performance on adversarial.
  • Train the same ConvNet architecture including adversarial instances and measure its performance.
  • Train an ensemble of 10 models of the same ConvNet architecture and measure ensemble performance and support the backing argument stated above.

My code full code can be seen on github and I here only share the results and observations. You need cleverhans, Tensorflow and Keras for adversarial generation and you need PyTorch for ensemble training. (Sorry for verbosity of libraries but I like to try PyTorch as well after yeras of tears with Lua).

One problem of the proposed experiment is that we do not recreate adversarial instances for each model and we use a previously created one. Anyways, I believe the empirical values verifies my assumption even in this setting.  In addition,  I plan to do more extensive study as a future work.

Create adversarial instances.

I start by training a simple ConvNet architecture on MNIST dataset by using legitimate train and test set splits. This network gives 0.98 test set accuracy after 5 epochs.

For creating adversarial instances, I use fast gradient sign method which perturbs images using the derivative of the model outputs wrt the input values.  You can see a bunch of adversarial samples below.

The same network suffers on adversarial instances (as above) created on the legitimate test set. It gives 0.09 accuracy which is worse then random guess.

Plot adversarial instances.

Then I like to see the representational power of the trained model on both the normal and the adversarial instances. I do this by using well-known dimension reduction technique T-SNE. I first compute the last hidden layer representation of the network per instance and use these values as an input to T-SNE which aims to project data onto 2-D space. Here is the final projection for the both types of data.

Projection of normal test set.
Projection of adversarial instances.
Projection of both adversarial and normal test instances.

 

These projections clearly show that adversarial instances are just a random data points to the trained model and they are receding from the real data points creating what we call low probability regions for the trained model. I also trained the same model architecture by dynamically creating adversarial instances in train time then test its value on the adversarials created previously. This new model yields 0.98 on normal test set, 0.91 on previously created adversarial test set and 0.71 on its own dynamically created adversarial.

Above results show that including adversarial instances strengthen the model. However,  this is conforming to the low probability region argument. By providing adversarial, we let the model to discover low probability regions of adversarial instances. Beside, this is not applicable to large scale problems like ImageNet since you cannot afford to augment your millions of images per iteration. Therefore,  by assuming it works, ensembling is more viable alternative as already a common method to increase overall prediction performance.

Ensemble Training

In this part, I train multiple models in different ensemble settings. First, I train N different models with the same whole train data. Then, I bootstrap as I train N different models by randomly sampling data from the normal train set. I also observe the affect of N.

The best single model obtains 0.98 accuracy on the legitimate test set. However, the best single model only obtains 0.22 accuracy on the adversarial instances created in previous part.

When we ensemble models by averaging scores, we do not see any gain and we stuck on 0.24 accuracy for the both training settings. However, surprisingly when we perform max ensemble (only count on the most confident model for each instance), we observe 0.35 for uniformly trained ensemble and 0.57 for the bootstrapped ensemble with N equals to 50.

Increasing N raises the adversarial performance. It is much more effective on bootstrapped ensemble. With N=5 we obtain 0.27 for uniform ensemble and 0.32 for bootstrapped ensemble. With N=25 we obtain 0.30 and 0.45 respectively.

These values are interesting especially for the difference of mean and max ensemble. My intuition behind the superiority of maxing is maxing out predictions is able to cover up weaknesses of models by the most confident one, as I suggested in the first place. In that vein, one following observation is that adversarial performance increases as we use smaller random chunks for each model up to a certain threshold with increasing N (number of models in ensemble). It shows us that bootstrapping enables models to learn some of the local regions better and some worse but the worse sides are covered by the more confident model in the ensemble.

As I said before, it is not convenient to use previously created adversarials created by the baseline model in the first part. However, I believe my claim still holds. Assume that we include the baseline model in our best max ensemble above. Still its mistakes would be corrected by the other models. I also tried this (after the comments below) and include the baseline model in our ensemble. 0.57 accuracy only reduces to 0.55. It is still pretty high compared to any other method not seeing adversarial in the training phase.

Conclusion

  1. It is much more harder to create adversarials for ensemble of models with gradient methods. However, genetic algorithms are applicable.
  2. Blind stops of individual models are covered by the peers in the ensemble when we rely on the most confident one.
  3. We observe that as we train a model with dynamically created adversarial instances per iteration, it resolves the adversarials created by the test set. That is, since as the model sees examples from these regions it becomes immune to adversarials. It supports the argument stating low probability regions carry adversarial instances.

(Before finish) This is Serious!

Before I finish, I like to widen the meaning of this post’s heading. Ensemble against adversarial!!

“Adversarial instances” is peculiar AI topic. It attracted so much interest first but now it seems forgotten beside research targeting GANs since it does not yield direct profit, compared to having better accuracy.

Even though this is the case hitherto, we need consider this topic more painstakingly from now on. As we witness more extensive and greater AI in many different domains (such as health, law, governace), adversarial instances akin to cause greater problems intentionally or by pure randomness. This is not a sci-fi scenario I’m drawing here. It is a reality as it is prototyped in [3]. Just switch a simple recognition model in [3]  with a AI ruling court for justice.

Therefore, if we believe in a future embracing AI as a great tool to “make the world better place!”, we need to study this subject extensively before passing a certain AI threshold.

Last Words

This work overlooks many important aspects but after all it only aims to share some of my findings in a spare time research.  For a next post, I like study unsupervised models like Variational Encoders and Denoising Autoencoders by applying these on adversarial instances (I already started!). In addition, I plan to work on other methods for creating different types of adversarials.

From this post you should take;

  • References to adversarial instances
  • Good example codes waiting you on github that can be used many different projects.
  •  Power of ensemble.
  • Some of non-proven claims and opinions on the topic.

IN ANY WAY HOPE YOU LIKE IT ! 🙂

 

References

[1] Nguyen, A., Yosinski, J., & Clune, J. (2015). Deep Neural Networks are Easily Fooled. Computer Vision and Pattern Recognition, 2015 IEEE Conference on, 427–436.

[2] Szegedy, C., Zaremba, W., & Sutskever, I. (2013). Intriguing properties of neural networks. arXiv Preprint arXiv: …, 1–10. Retrieved from http://arxiv.org/abs/1312.6199

[3] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2016). Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples. arXiv. Retrieved from http://arxiv.org/abs/1602.02697

[4] Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. Iclr 2015, 1–11. Retrieved from http://arxiv.org/abs/1412.6572

Share

The post Ensembling Against Adversarial Instances appeared first on A Blog From Human-engineer-being.



Source: Erogol – Ensembling Against Adversarial Instances

A maturity model for data evolution


In the last three years we’ve seen a notable improvement in how UK charities and social enterprises harness data.  We know that it can be a powerful tool for driving social change. However, we also recognize that adopting data on an organizational level is often a slow, laborious and sometimes painful process; one that typically entails a shift in thinking among leadership, acquiring new skills and talent, breaking down data silos and raising awareness about what data can do across an organization.

To help charities better understand and alleviate the challenges of incorporating data into their efforts, DataKind UK wanted to build a data maturity framework. In partnership with Data Orchard, a small UK-based research consultancy, we undertook the Data Evolution project to map out the journey towards data maturity. It was supported by Nesta, Teradata, Esmée Fairbairn Foundation and Access – The Foundation for Social Investment.

As part of the project we ran two workshops, surveyed 200 social sector organizations, carried out in-depth assessments with 47 people from 12 social sector organizations and developed a framework documenting the five stages we identified these organizations go through as they adopt new data practices and become more data savvy.

You can learn more about the project here and also explore the maturity framework itself here. If you’re interested in reading more about our data maturity initiative as well as similar work being developed for local government, here’s a great post we co-wrote with Nesta you should check out.

 

Image credit: Kelly, via Flickr (CC license)



Source: DataKind – A maturity model for data evolution

Data4Good Job Alert Roundup


If you’re looking for a job that lets you use your data and technology skills for social good, check out the selection of opportunities below we’ve heard about through the grapevine or stumbled upon online. Know of a great opportunity we missed? Tweet at us or email us at contact@datakind.org and we’ll share it.

DataKind is Hiring!

  • Our Executive Associate (New York) will be the right-hand person to our Director of Operations.

Other Great Organizations Are Hiring Too…

*STILL OPEN FROM LAST MONTH*



Source: DataKind – Data4Good Job Alert Roundup

Paper Notes: Intriguing Properties of Neural Networks


Paper: https://arxiv.org/abs/1312.6199

This paper studies description of semantic information with higher level units of an network and blind spot of the network models againt adversarial instances. They illustrate the learned semantics inferring maximally activating instances per unit. They also interpret the effect of adversarial examples and their generalization on different network architectures and datasets.

Findings might be summarized as follows;

  1. Certain dimensions of the each layer reflects different semantics of data. (This is a well-known fact to this date therefore I skip this to discuss more)
  2. Adversarial instances are general to different models and datasets.
  3. Adversarial instances are more significant to higher layers of the networks.
  4. Auto-Encoders are more resilient to adversarial instances.

Adversarial instances are general to different models and datasets.

They posit that advertorials exploiting a particular network architectures are also hard to classify for the others. They illustrate it by creating adversarial instances yielding 100% error-rate on the target network architecture and using these on the another network. It is shown that these adversarial instances are still hard for the other network ( a network with 2% error-rate degraded to 5%). Of course the influence is not that strong compared to the target architecture (which has 100% error-rate).

Adversarial instances are more significant to higher layers of networks.

As you go to higher layers of the network, instability induced by adversarial instances increases as they measure by Lipschitz constant. This is justifiable observation with that the higher layers capture more abstract semantics and therefore any perturbation on an input might override the constituted semantic. (For instance a concept of “dog head” might be perturbed to something random).

Auto-Encoders are more resilient to adversarial instances.

AE is an unsupervised algorithm and it is different from the other models used in the paper since it learns the implicit distribution of the training data instead of mere discriminant features. Thus, it is expected to be more tolerant to adversarial instances. It is understood by Table2 that AE model needs stronger perturbations to achieve 100% classification error with generated adversarials.

My Notes

One intriguing observation is that shallow model with no hidden unit is yet to be more robust to adversarial instance created from the deeper models. It questions the claim of generalization of adversarial instances. I believe, if the term generality is supposed to be hold, then a higher degree of susceptibility ought to be obtained in this example (and in other too).

I also happy to see that unsupervised method is more robust to adversarial as expected since I believe the notion of general AI is only possible with the unsupervised learning which learns the space of data instead of memorizing things. This is also what I plan to examine after this paper to see how the new tools like Variational Auto Encoders behave againt adversarial instance.

I believe that it is really hard to fight with adversarial instances especially, the ones created by counter optimization against a particular supervised model. A supervised model always has flaws to be exploited in this manner since it memorizes things [ref] and when you go beyond its scope (especially with adversarial instances are of low probability), it makes natural mistakes. Beside, it is known that a neural network converges to local minimum due to its non-convex nature. Therefore, by definition, it has such weaknesses.

Adversarial instances are, in practical sense, not a big deal right now.However, this is akin to be a far more important topic, as we journey through a more advanced AI. Right now, a ML model only makes tolerable mistakes. However, consider advanced systems waiting us in a close future with a use of great importance such as deciding who is guilty, who has cancer. Then this is question of far more important means.

Share

The post Paper Notes: Intriguing Properties of Neural Networks appeared first on A Blog From Human-engineer-being.



Source: Erogol – Paper Notes: Intriguing Properties of Neural Networks

Ingredients of a Thriving Chapter


By the DataKind Bangalore team

Happy New Year from DataKind Bangalore! As we head into 2017 and our third year as a chapter, we’ve been reflecting on the successes of 2016 and how much our community of over 1200 volunteers and project partners has accomplished together. But what makes a successful DataKind Chapter? For us, there are a few key ingredients. Check out highlights below and get excited for the year ahead!

1 – Volunteers That Embody Our Values

Volunteers are at the center of DataKind’s work. DataKind Bangalore is entirely volunteer-led, supported by a team of committed and talented people that exemplify DataKind’s values. Because they are always going above and beyond, we created the monthly DataKind Bangalore Awards to recognize their specific contributions. Get inspired by our November and December winners!

Chetana Amancharla
Mindfulness
A Senior Technology Architect at Infosys, Chetana works on application development, software process engineering and program management. Chetana has been an incredible addition to the DataCorps team for Centre for Budget and Governance Accountability. She has been building and refining various data visualizations for the tool, polishing our user interface with her great eye for design and detail. And she does all of this on top of her career and Saturday classes, all while taking care of her 8-year-old son. Her knowledge, expertise and commitment is truly an inspiration for the whole community.


Sahil Maheshwari
Diversity and Expertise
An Engineer and MBA, a few minutes conversing with Sahil is enough for anyone to realize that he is an expert data scientist. With his wide ranging knowledge in statistics and probability, he has been instrumental in the eGovs DataCorps project. A fast learner, he’s also generous in sharing his knowledge and gave a workshop on statistics for the Chapter. His motivation to try out new things inspires all of us to do the same. We’re grateful to have someone with such a rich skillset and rich love of learning with us.


Suchismita Naik
Diversity
An engineer-turned-designer, Suchismita has been leading design work for our CBGA’s DataCorps project. She exemplifies a great passion and commitment to the work and is always ready to try out those last-minute design suggestions (No matter how cumbersome they seem!) Apart from being the creative brain of the project, she brings great enthusiasm and vigor to the team, making her a fun and energizing teammate to work with.


Murugesan Ramakrishnan
Expertise
A consultant at Fractal Analytics, Murugesan is absolutely fantastic to work with. With an immense will to learn and almost limitless energy, he keeps the eGovernments DataCorps team moving full speed ahead. He’ll git at 2am or on weekdays, blowing us away by how much he accomplishes in addition to his demanding job.

 

2 – High Impact Project Partners

Partner organizations are our vehicle for impact so we depend on their subject matter expertise to inform our volunteers work. We’ve had the honor of working with many incredible organizations this past year, but we’re especially excited to launch two long-term DataCorps projects in 2016 that will be wrapping up soon:


Centre for Budget and Governance Accountability (CBGA)
is a civil society organization that promotes transparent, accountable, and participatory governance, and a people-centered perspective in the preparation and implementation of budgets. CBGA has been building Open Budgets India, a data portal to make India’s budgets open, usable and easy to comprehend. The DataKind Bangalore team is co-creating a Story Generator Tool that helps users browse visualizations across various state-level fiscal indicators and schemes. The project is still in progress and the beta version of the tool is expected to launch in February.
Check out the source code and documentation >

eGovernments Foundation transforms urban governance with the use of scalable and replicable technology solutions. Using four years of data from the Chennai municipal corporation’s public grievance portal, we hope to build a problem forecasting and alerting system to predict trends and generate alerts at ward levels for better urban governance.
Check out the source code and documentation >

 

3 – A Community of Learning

Any good data scientist or social innovator embraces continuous learning, which is why we were excited to launch DataLearn –  a series of of talks, workshops and discussions that brought together some of the best names in the data science and social good community.

From creative hacks of Machine Learning – which viewed machine learning and artificial intelligence through the lens of creative subversion to Data Visualization and Storytelling with Data to the open data environment in India and ethics, we covered a variety of topics. We also hosted skill-building workshops, including statistical analysis with R, exploring data with pandas, text mining and Natural Language Processing and web scraping with R.

And true to our word about sharing learnings, we recorded many of these talks!

Check out our YouTube video series to learn more >

And The Last Ingredient? You!

In 2017, we are looking forward to exciting collaborations with more project partners, more values-driven volunteers and learning even more with our community, but we need you to make it a success! Stay tuned for more DataLearn sessions on Bayesian statistics and inference, time series modeling, developments in Deep Learning and more, as well as DataDives and collaborations with NGOs in interesting domains.

Join our Meetup to get involved! >

Follow us on Facebook and Twitter for updates and announcements!



Source: DataKind – Ingredients of a Thriving Chapter

A Big Welcome to DataKind’s Newest Board Member!


We’re thrilled to announce the addition of Elizabeth Grossman to DataKind’s esteemed Board of Directors, a team of top minds and dedicated champions in the Data for Good movement.

Director of Civic Projects in the Technology and Civic Engagement group at Microsoft Corporation, Elizabeth helps design and execute long-term, strategic partnerships for Microsoft that leverage technology to make a sustainable and scalable impact on local and global civic priorities. She has also worked on policy and societal impacts of emerging technologies and governmental science and research program design with universities and scientific societies as well as at the U.S. House of Representatives Committee on Science and the National Academy of Sciences. 

A longtime friend, collaborator and supporter of DataKind, Elizabeth worked with us on the very first DataKind Labs projects to advance the Vision Zero movement, to reduce traffic-related deaths and severe injuries to zero, in three U.S. cities – New York, Seattle and New Orleans.

Her knowledge and expertise in areas such as civic engagement, partnership design, smarter and more sustainable cities, research and technology policy, data sharing and government ecosystems will be indispensable in helping further DataKind’s work and mission, particularly on larger, civic and sector-wide projects like Vision Zero.

With the guidance of our devoted Board of Directors, now five-strong with Elizabeth, and the help of our talented and amazing volunteer community, DataKind finds itself approaching another phase of growth; with more staff, increased chapter engagement, and a thriving volunteer network – all paving the way for more projects and opportunities to harness the power of data science in the service of humanity.

Please take a minute to join us in congratulating Elizabeth and officially welcoming her to DataKind!



Source: DataKind – A Big Welcome to DataKind’s Newest Board Member!