Quora recently announced the first public dataset that they ever released. It includes 404351 question pairs with a label column indicating if they are duplicate or not. In this post, I like to investigate this dataset and at least propose a baseline method with deep learning.
Beside the proposed method, it includes some examples showing how to use Pandas, Gensim, Spacy and Keras. For the full code you check Github.
There are 255045 negative (non-duplicate) and 149306 positive (duplicate) instances. This induces a class imbalance however when you consider the nature of the problem, it seems reasonable to keep the same data bias with your ML model since negative instances are more expectable in a real-life scenario.
When we analyze the data, the shortest question is 1 character long (which is stupid and useless for the task) and the longest question is 1169 character (which is a long, complicated love affair question). I see that if any of the pairs is shorter than 10 characters, they do not make sense thus, I remove such pairs. The average length is 59 and std is 32.
There are two other columns “q1id” and “q2id” but I really do not know how they are useful since the same question used in different rows has different ids.
Some labels are not true, especially for the duplicate ones. In anyways, I decided to rely on the labels and defer pruning due to hard manual effort.
Converting Questions into Vectors
Here, I plan to use Word2Vec to convert each question into a semantic vector then I stack a Siamese network to detect if the pair is duplicate.
Word2Vec is a general term used for similar algorithms that embed words into a vector space with 300 dimensions in general. These vectors capture semantics and even analogies between different words. The famous example is ;
king - man + woman = queen.
Word2Vec vectors can be used for may useful applications. You can compute semantic word similarity, classify documents or input these vectors to Recurrent Neural Networks for more advance applications.
There are two well-known algorithms in this domain. One is Google’s network architecture which learns representation by trying to predict surrounding words of a target word given certain window size. GLOVE is the another methos which relies on co-occurrence matrices. GLOVE is easy to train and it is flexible to add new words out-side of your vocabulary. You might like visit this tutorial to learn more and check this brilliant use-case Sense2Vec.
We still need a way to combine word vectors for singleton question representation. One simple alternative is taking the mean of all word vectors of each question. This is simple but really effective way for document classification and I expect it to work for this problem too. In addition, it is possible to enhance mean vector representation by using TF-IDF scores defined for each word. We apply weighted average of word vectors by using these scores. It emphasizes importance of discriminating words and avoid useless, frequent words which are shared by many questions.
I described Siamese network in a previous post. In short, it is a two way network architecture which takes two inputs from the both side. It projects data into a space in which similar items are contracted and dissimilar ones are dispersed over the learned space. It is computationally efficient since networks are sharing parameters.
Let’s load the training data first.
For this particular problem, I train my own GLOVE model by using Gensim.
The above code trains a GLOVE model and saves it. It generates 300 dimensional vectors for words. Hyper parameters would be chosen better but it is just a baseline to see a initial performance. However, as I’ll show this model gives performance below than my expectation. I believe, this is because our questions are short and does not induce a semantic structure that GLOVE is able to learn a salient model.
Due to the performance issue and the observation above, I decide to use a pre-trained GLOVE model which comes free with Spacy. It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. This is how we use Spacy for this purpose.
Before going further, I really like Spacy. It is really fast and it does everything you need for NLP in a flash of time by hiding many intrinsic details. It deserves a good remuneration. Similar to Gensim model, it also provides 300 dimensional embedding vectors.
The result I get from Spacy vectors is above Gensim model I trained. It is a better choice to go further with TF-IDF scoring. For TF-IDF, I used scikit-learn (heaven of ML). It provides TfIdfVectorizer which does everything you need.
After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores. The below code does this for just “question1” column.
Now, we are ready to create training data for Siamese network. Basically, I’ve just fetch the labels and covert mean word2vec vectors to numpy format. I split the data into train and test set too.
In this stage, we need to define Siamese network structure. I use Keras for its simplicity. Below, it is the whole script that I used for the definition of the model.
I share here the best performing network with residual connections. It is a 3 layers network using Euclidean distance as the measure of instance similarity. It has Batch Normalization per layer. It is particularly important since BN layers enhance the performance considerably. I believe, they are able to normalize the final feature vectors and Euclidean distance performances better in this normalized space.
I tried Cosine distance which is more concordant to Word2Vec vectors theoretically but cannot handle to obtain better results. I also tried to normalize data into unit variance or L2 norm but nothing gives better results than the original feature values.
Let’s train the network with the prepared data. I used the same model and hyper-parameters for all configurations. It is always possible to optimize these but hitherto I am able to give promising baseline results.
In this section, I like to share test set accuracy values obtained by different model and feature extraction settings. We expect to see improvement over 0.63 since when we set all the labels as 0, it is the accuracy we get.
These are the best results I obtain with varying GLOVE models. they all use the same network and hyper-parameters after I find the best on the last configuration depicted below.
Gensim (my model) + Siamese: 0.69
Spacy + Siamese : 0.72
Spacy + TD-IDF + Siamese : 0.79
We can also investigate the effect of different model architectures. These are the values following the best word2vec model shown above.
Adam works quite well for this problem compared to SGD with learning rate scheduling. Batch Normalization also yields a good improvement. I tried to introduce Dropout between layers in different orders (before ReLU, after BN etc.), the best I obtain is 0.75. Concatenation of different layers improves the performance by 1 percent as the final gain.
In conclusion, here I tried to present a solution to this unique problem by composing different aspects of deep learning. We start with Word2Vec and combine it with TF-IDF and then use Siamese network to find duplicates. Results are not perfect and akin to different optimizations. However, it is just a small try to see the power of deep learning in this domain. I hope you find it useful :).
In simple terms, dilated convolution is just a convolution applied to input with defined gaps. With this definitions, given our input is an 2D image, dilation rate k=1 is normal convolution and k=2 means skipping one pixel per input and k=4 means skipping 3 pixels. The best to see the figures below with the same k values.
The figure below shows dilated convolution on 2D data. Red dots are the inputs to a filter which is 3×3 in this example, and greed area is the receptive field captured by each of these inputs. Receptive field is the implicit area captured on the initial input by each input (unit) to the next layer .
Dilated convolution is a way of increasing receptive view (global view) of the network exponentially and linear parameter accretion. With this purpose, it finds usage in applications cares more about integrating knowledge of the wider context with less cost.
One general use is image segmentation where each pixel is labelled by its corresponding class. In this case, the network output needs to be in the same size of the input image. Straight forward way to do is to apply convolution then add deconvolution layers to upsample. However, it introduces many more parameters to learn. Instead, dilated convolution is applied to keep the output resolutions high and it avoids the need of upsampling .
Dilated convolution is applied in domains beside vision as well. One good example is WaveNet text-to-speech solution and ByteNet learn time text translation. They both use dilated convolution in order to capture global view of the input with less parameters.
In short, dilated convolution is a simple but effective idea and you might consider it in two cases;
Detection of fine-details by processing inputs in higher resolutions.
Broader view of the input to capture more contextual information.
Faster run-time with less parameters
 Long, J., Shelhamer, E., & Darrell, T. (2014). Fully Convolutional Networks for Semantic Segmentation. Retrieved from http://arxiv.org/abs/1411.4038v1
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2014). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, 1–14. Retrieved from http://arxiv.org/abs/1412.7062
Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. Iclr, 1–9. http://doi.org/10.16373/j.cnki.ahr.150049
Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio, 1–15. Retrieved from http://arxiv.org/abs/1609.03499
Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. van den, Graves, A., & Kavukcuoglu, K. (2016). Neural Machine Translation in Linear Time. Arxiv, 1–11. Retrieved from http://arxiv.org/abs/1610.10099
Machine learning is everywhere and we are amazed with capabilities of these algorithms. However, they are not great and sometimes they behave so dumb. For instance, let’s consider an image recognition model. This model induces really high empirical performance and it works great for normal images. Nevertheless, it might fail when you change some of the pixels of an image even so this little perturbation might be indifferent to human eye. There we call this image an adversarial instance.
There are various methods to generate adversarial instances . One method is to take derivative of the model outputs wrt the input values so that we can change instance values to manipulate the model decision. Another approach exploits genetic algorithms to generate manipulative instances which are confidently classified as a known concept (say ‘dog’) but they are nothing to human eyes.
So why these models are that weak against adversarial instances. One reliable idea states that because adversarial instances lie on the low probability regions of the instance space. Therefore, they are so weird to the network which is trained with a limited number of instances from higher probability regions.
That being said, maybe there is no way to escape from the fretting adversarial instances, especially when they are produced by exploiting weaknesses of a target model with a gradient guided probing. This is a analytic way of searching for a misleading input for that model with an (almost) guaranteed certainty. Therefore in one way or another, we find an perturbed input deceiving any model.
Due to that observation, I believe that adversarial instances can be resolved by multiple models backing each other. In essence, this is the motivation of this work.
In this work, I like to share my observations focusing on strength of the ensembles against adversarial instances. This is just a toy example with so much short-comings but I hope it’ll give the idea with some emiprical evidences.
As a summary, this is what we do here;
Train a baseline MNIST ConvNet.
Create adversarial instances on this model by using cleverhans and save.
Measure the baseline model performance on adversarial.
Train the same ConvNet architecture including adversarial instances and measure its performance.
Train an ensemble of 10 models of the same ConvNet architecture and measure ensemble performance and support the backing argument stated above.
My code full code can be seen on github and I here only share the results and observations. You need cleverhans, Tensorflow and Keras for adversarial generation and you need PyTorch for ensemble training. (Sorry for verbosity of libraries but I like to try PyTorch as well after yeras of tears with Lua).
One problem of the proposed experiment is that we do not recreate adversarial instances for each model and we use a previously created one. Anyways, I believe the empirical values verifies my assumption even in this setting. In addition, I plan to do more extensive study as a future work.
Create adversarial instances.
I start by training a simple ConvNet architecture on MNIST dataset by using legitimate train and test set splits. This network gives 0.98 test set accuracy after 5 epochs.
For creating adversarial instances, I use fast gradient sign method which perturbs images using the derivative of the model outputs wrt the input values. You can see a bunch of adversarial samples below.
The same network suffers on adversarial instances (as above) created on the legitimate test set. It gives 0.09 accuracy which is worse then random guess.
Plot adversarial instances.
Then I like to see the representational power of the trained model on both the normal and the adversarial instances. I do this by using well-known dimension reduction technique T-SNE. I first compute the last hidden layer representation of the network per instance and use these values as an input to T-SNE which aims to project data onto 2-D space. Here is the final projection for the both types of data.
These projections clearly show that adversarial instances are just a random data points to the trained model and they are receding from the real data points creating what we call low probability regions for the trained model. I also trained the same model architecture by dynamically creating adversarial instances in train time then test its value on the adversarials created previously. This new model yields 0.98 on normal test set, 0.91 on previously created adversarial test set and 0.71 on its own dynamically created adversarial.
Above results show that including adversarial instances strengthen the model. However, this is conforming to the low probability region argument. By providing adversarial, we let the model to discover low probability regions of adversarial instances. Beside, this is not applicable to large scale problems like ImageNet since you cannot afford to augment your millions of images per iteration. Therefore, by assuming it works, ensembling is more viable alternative as already a common method to increase overall prediction performance.
In this part, I train multiple models in different ensemble settings. First, I train N different models with the same whole train data. Then, I bootstrap as I train N different models by randomly sampling data from the normal train set. I also observe the affect of N.
The best single model obtains 0.98 accuracy on the legitimate test set. However, the best single model only obtains 0.22 accuracy on the adversarial instances created in previous part.
When we ensemble models by averaging scores, we do not see any gain and we stuck on 0.24 accuracy for the both training settings. However, surprisingly when we perform max ensemble (only count on the most confident model for each instance), we observe 0.35 for uniformly trained ensemble and 0.57 for the bootstrapped ensemble with N equals to 50.
Increasing N raises the adversarial performance. It is much more effective on bootstrapped ensemble. With N=5 we obtain 0.27 for uniform ensemble and 0.32 for bootstrapped ensemble. With N=25 we obtain 0.30 and 0.45 respectively.
These values are interesting especially for the difference of mean and max ensemble. My intuition behind the superiority of maxing is maxing out predictions is able to cover up weaknesses of models by the most confident one, as I suggested in the first place. In that vein, one following observation is that adversarial performance increases as we use smaller random chunks for each model up to a certain threshold with increasing N (number of models in ensemble). It shows us that bootstrapping enables models to learn some of the local regions better and some worse but the worse sides are covered by the more confident model in the ensemble.
As I said before, it is not convenient to use previously created adversarials created by the baseline model in the first part. However, I believe my claim still holds. Assume that we include the baseline model in our best max ensemble above. Still its mistakes would be corrected by the other models. I also tried this (after the comments below) and include the baseline model in our ensemble. 0.57 accuracy only reduces to 0.55. It is still pretty high compared to any other method not seeing adversarial in the training phase.
It is much more harder to create adversarials for ensemble of models with gradient methods. However, genetic algorithms are applicable.
Blind stops of individual models are covered by the peers in the ensemble when we rely on the most confident one.
We observe that as we train a model with dynamically created adversarial instances per iteration, it resolves the adversarials created by the test set. That is, since as the model sees examples from these regions it becomes immune to adversarials. It supports the argument stating low probability regions carry adversarial instances.
(Before finish) This is Serious!
Before I finish, I like to widen the meaning of this post’s heading. Ensemble against adversarial!!
“Adversarial instances” is peculiar AI topic. It attracted so much interest first but now it seems forgotten beside research targeting GANs since it does not yield direct profit, compared to having better accuracy.
Even though this is the case hitherto, we need consider this topic more painstakingly from now on. As we witness more extensive and greater AI in many different domains (such as health, law, governace), adversarial instances akin to cause greater problems intentionally or by pure randomness. This is not a sci-fi scenario I’m drawing here. It is a reality as it is prototyped in . Just switch a simple recognition model in  with a AI ruling court for justice.
Therefore, if we believe in a future embracing AI as a great tool to “make the world better place!”, we need to study this subject extensively before passing a certain AI threshold.
This work overlooks many important aspects but after all it only aims to share some of my findings in a spare time research. For a next post, I like study unsupervised models like Variational Encoders and Denoising Autoencoders by applying these on adversarial instances (I already started!). In addition, I plan to work on other methods for creating different types of adversarials.
From this post you should take;
References to adversarial instances
Good example codes waiting you on github that can be used many different projects.
Power of ensemble.
Some of non-proven claims and opinions on the topic.
IN ANY WAY HOPE YOU LIKE IT !
 Nguyen, A., Yosinski, J., & Clune, J. (2015). Deep Neural Networks are Easily Fooled. Computer Vision and Pattern Recognition, 2015 IEEE Conference on, 427–436.
 Szegedy, C., Zaremba, W., & Sutskever, I. (2013). Intriguing properties of neural networks. arXiv Preprint arXiv: …, 1–10. Retrieved from http://arxiv.org/abs/1312.6199
 Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2016). Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples. arXiv. Retrieved from http://arxiv.org/abs/1602.02697
 Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. Iclr 2015, 1–11. Retrieved from http://arxiv.org/abs/1412.6572
This paper studies description of semantic information with higher level units of an network and blind spot of the network models againt adversarial instances. They illustrate the learned semantics inferring maximally activating instances per unit. They also interpret the effect of adversarial examples and their generalization on different network architectures and datasets.
Findings might be summarized as follows;
Certain dimensions of the each layer reflects different semantics of data. (This is a well-known fact to this date therefore I skip this to discuss more)
Adversarial instances are general to different models and datasets.
Adversarial instances are more significant to higher layers of the networks.
Auto-Encoders are more resilient to adversarial instances.
Adversarial instances are general to different models and datasets.
They posit that advertorials exploiting a particular network architectures are also hard to classify for the others. They illustrate it by creating adversarial instances yielding 100% error-rate on the target network architecture and using these on the another network. It is shown that these adversarial instances are still hard for the other network ( a network with 2% error-rate degraded to 5%). Of course the influence is not that strong compared to the target architecture (which has 100% error-rate).
Adversarial instances are more significant to higher layers of networks.
As you go to higher layers of the network, instability induced by adversarial instances increases as they measure by Lipschitz constant. This is justifiable observation with that the higher layers capture more abstract semantics and therefore any perturbation on an input might override the constituted semantic. (For instance a concept of “dog head” might be perturbed to something random).
Auto-Encoders are more resilient to adversarial instances.
AE is an unsupervised algorithm and it is different from the other models used in the paper since it learns the implicit distribution of the training data instead of mere discriminant features. Thus, it is expected to be more tolerant to adversarial instances. It is understood by Table2 that AE model needs stronger perturbations to achieve 100% classification error with generated adversarials.
One intriguing observation is that shallow model with no hidden unit is yet to be more robust to adversarial instance created from the deeper models. It questions the claim of generalization of adversarial instances. I believe, if the term generality is supposed to be hold, then a higher degree of susceptibility ought to be obtained in this example (and in other too).
I also happy to see that unsupervised method is more robust to adversarial as expected since I believe the notion of general AI is only possible with the unsupervised learning which learns the space of data instead of memorizing things. This is also what I plan to examine after this paper to see how the new tools like Variational Auto Encoders behave againt adversarial instance.
I believe that it is really hard to fight with adversarial instances especially, the ones created by counter optimization against a particular supervised model. A supervised model always has flaws to be exploited in this manner since it memorizes things [ref] and when you go beyond its scope (especially with adversarial instances are of low probability), it makes natural mistakes. Beside, it is known that a neural network converges to local minimum due to its non-convex nature. Therefore, by definition, it has such weaknesses.
Adversarial instances are, in practical sense, not a big deal right now.However, this is akin to be a far more important topic, as we journey through a more advanced AI. Right now, a ML model only makes tolerable mistakes. However, consider advanced systems waiting us in a close future with a use of great importance such as deciding who is guilty, who has cancer. Then this is question of far more important means.
Suppose you have a problem that you like to tackle with machine learning and use the resulting system in a real-life project. I like to share my simple pathway for such purpose, in order to provide a basic guide to beginners and keep these things as a reminder to myself. These rules are tricky since even-thought they are simple, it is not that trivial to remember all and suppress your instinct which likes to see a running model as soon as possible.
When we confronted any problem, initially we have numerous learning algorithms, many bytes or gigabytes of data and already established knowledge to apply some of these models to particular problems. With all these in mind, we follow a three stages procedure;
Define a goal based on a metric
Build the system
Refine the system with more data
Let’s pear down these steps into more details ;
DEFINE GOAL & METRIC
Human Level vs Acceptable
First thing, we need to adjust what is the expected quality from the system performance. We might expect human level performance if it is medical diagnoses system or we might prefer to have lower one, if it is a simple mobile application. This decision defines the cost (time, money and engineering) of the system. As we increase the our expectation, we also need to invest more.
What Metric to Measure
Thus don’t go dinosaur hunt with your flip-flops. Related to the problem at hand, define a right metric to gauge system performance. It’s supposed to match with the nature of the problem. Possible alternatives are these;
Accuracy – object classification
Recall – medical diagnose
Amount of error – rental price prediction for houses
F-score – document classification
Defining the right metric creates a huge quality difference. It involves a process that you understand the user (or customer) well and find the matching criteria which apes the selective procedure of your user well in the artificial environment in which you develop your solution.
BUILDING THE SYSTEM
Create a baseline ASAP
Do not try to devise the time machine without a clock. First devise a minimum viable system with any tool and algorithm, easy to use and implement. Define this as a baseline. Baseline is useful to show what is your gain, whether it is significant, random or what.
After you are done with the baseline system then you can start to add on. Here, following a incremental proceeding is an efficient strategy which makes things easier to follow. Then it is also easier to backup things against if there is something not working as expected.
Do not waste time with the state-of-art space level techniques. Let Occam speaks. Only go for more advance methods if the data demands so.
For instance, it is not always the right choice to use ImageNet winner Inception network directly to your problem. Define your model structure based on observations on your data. In general, if there is noise and data is easily separable then use shallower models. As noise decreases and structure increases in the data go for deeper and wider models.
The distinction between deeper and wider models is; deeper models are better to capture more high-level abstractions that are important to differentiate particularly different classes (car vs horse) and wider models are better for fine-grain problems where the classes are close to each other and only slight commonalities differentiate one from another (genre of cats).
Kind of Model
Based on your problem, there are better subset of ML models. These models might be used over each other but the best is always to keep the model nature and the problem nature aligned.
Raw data –> Fully connected network (MLP)
Spatial data (Image) –> Convolutional network
Temporal, sequential data –> Recurrent networks (LSTM, RNN, GRU)
REFINE with DATA
Assume that you finalized your system with good success and you deployed it. It is not the end yet. There are still work to be done.
Don’t Believe Numbers
Until that point you always measure the success based on the metric values in a controlled environment. However, these values might not be the indicators of the real-life. Data might change or your users might change. Thus, always check the system performance live after initial deployment. Do A/B test, check your metric values on real-time data, validate your hypothesis with real values.
Update with New Data
If you are able to obtain more data in time always use it to update your model and fine-tune it. It is the rule of thump that more data always increases the performance. Do not skip that since you might even achieve unimaginable results as you update the system with more and more data. This is also the skill of big ML driven companies like Google. They are really skillful to use running data to enhance their products.
In this post there are many things I skipped such as details of training a model, finding its defects and re-iterating to increase the performance. You might like to see my one another post to see such details.
Selfies are everywhere. With different fun masks, poses and filters, it goes crazy. When we coincide with any of these selfies, we automatically give an intuitive score regarding the quality and beauty of the selfie. However, it is not really possible to describe what makes a beautiful selfie. There are some obvious attributes but they are not fully prescribed.
With the folks at 8bit.ai, we decided to develop a system which analyzes selfie images and scores them in accordance to its quality and beauty. The idea was to see whether it is possible to mimic that bizarre perceptual understanding of human with the recent advancements of AI. And if it is, then let’s make a mobile application and let people use it for whatever purpose. Spoiler alert! We already developed Selfai app available on iOS and Android and we have one instagram bot @selfai_robot. You can check before reading.
After a kind of self-promotional entry, let’s come to the essence. In this post, I like to talk about what I’ve done in this fun project from research point. It entails to a novel method which is also applicable to similar fine-grain image recognition problems beyond this particular one.
I call the problem fine-grain since what differentiates the score of a selfie relies on the very details. It is hard to capture compared to the traditional object categorization problems, even with simple deep learning models.
We like to model ‘human eye evaluation of a selfie image’ by a computer. Here; we do not define what the beauty is, which is a very vague term by itself, but let the model internalize the notion from the data. The data is labeled by human annotators on an internally developed crowd-sourced website.
In terms of research, this is a peculiar problem where traditional CNN approaches fail due to following reasons:
Fine-grain attributes are the factors defining one image better or worse than another.
Selfie images induce vast amount of variations with different applied filters, editions, pose and lighting.
Scoring is a different practice than categorization and it is not a well-studied problem compared to categorization.
Scarcity of annotated data yields learning in a small-data regime.
This is a problem already targeted by different works. HowHot.io is one of the well-known example of such, using deep learning back-end empowered with a large amount of data from a dating application. They use the application statistics as the annotation. Our solution differs strongly since we only use in-house data which is very small compared to what they have. Thus feeding data into a well-known CNN architecture simply does not work in our setting.
There is also a relevant blog post by A. Karpathy where he crawled Instagram for millions of images and use “likes” as annotation. He uses a simple CNN. He states that the model is not that good but still it gives a intuition about what is a good selfie. Again, we count on A. Karpathy that ad-hoc CNN solutions are not enough for decent results.
There are other research efforts suggesting different CNN architectures or ratio based beauty justifications, however they are limited to pose constrains or smooth backgrounds. In our setting, an image can be uploaded from any scene with an applied filter or mask.
We solve this problem based on 3 steps. First, pre-train the network with Siamese layer  as enlarging the model by Net2Net  incrementally. Then fine-tune the model with Huber-Loss based regression for scoring and just before fine-tuning use Net2Net operator once more to double the model size.
Siamese network architecture is a way of learning which is embedding images into lower-dimensions based on similarity computed with features learned by a feature network. The feature network is the architecture we intend to fine-tune in this setting. Given two images, we feed into the feature network and compute corresponding feature vectors. The final layer computes pair-wise distance between computed features and final loss layer considers whether these two images are from the same class (label 1) or not (label -1) .
Suppose G_w() is the function implying the feature network and X is raw image pixels. Lower indices of X shows different images. Based on this parametrization the final layer computes the below distance (L1 norm).
E_w = ||G_w(X_1) – G_W(X_2)||
On top of this any suitable loss function might be used. There are many different alternatives proposed lately. We choose to use Hinge Embedding Loss which is defined as,
Here in this framework, Siamese layer tries to push the network to learn features common for the same classes and differentiating for different classes.. Being said this, we expect to learn powerful features capturing finer details compared to simple supervised learning with help of the pair-wise consideration of examples. These features present good initialization for latter stage fine-tuning in relation to simple random or ImageNet initialization.
Architecture update by Net2Net
Net2Net  proposes two different operators to make the networks deeper and wider while keeping the model activations the same. Hence, it enables to train a network incrementally from smaller and shallower to wider and deeper architectures. This accelerates the training, lowers computational requirements and results possibly better representations.
We use Net2Net to reduce the training time in our modest computing facility and benefit from Siamese training without any architectural deficit. We apply Net2Net operators once in everytime training stalls through Siamese traning. In the end of the Siamese training we applied Net2Net wider operation once more to double the size and increase model capability to learn more representation.
Wider operation adds more units to a layer by copying weights from the old units and normalizes the next layer weights by the cloning factor of each unit, in order to keep the propagated activation the same. Deeper operation adds an identity layer between successive layers so that again the propagated activation stands the same.
One subtle difference in our use of Net2Net is to apply zeroing noise to cloned weights in wider operation. It basically breaks the symmetry and forces each unit to learn similar but different representations.
Sidenote: I studied this exact method in parallel to this paper at Qualcomm Research when I was participating ImageNet challenge. However, I cannot find time to publish before Net2Net. Sad
Fine-tuning is performed with Huber-Loss on top of the network which was used as the feature network at Siamese stage. Huber-Loss is the choice due to its resiliency to outlier instances. Outliers are extremely harmful in fine-grain problems (miss-labeled or corrupted instance) especially for small scale data sets. Hence, it is important for us to reconcile the effect of wrongly scored instances.
As we discussed above, before fine-tuning, we double the width (number of units in each layer) of the network. It enables to increase the representation power of the network which seems important for fine-grain problems.
Data Collection and Annotation
For this mission, we collect ~100.000 images from the web, prune the irrelevant or low-quality images then annotate the remaining ones on a crowd-sourced website. Each image is scored between 0 to 9. Eventually, we have 30.000 images annotated where each one is scored at least twice by different annotators.
Understanding of beauty varies among cultures and we assume that variety of annotators minimized any cultural bias.
Annotated images are processed by face detection and alignment procedure in order to focus faces centered and aligned by the eyes.
For all the model training, we use Torch7 framework and almost all of the training code is released on Github . In this repository, you find different architectures at different code branches.
Fine-tuning leverages a data sampling strategy alleviating the effect of data imbalance. Our data set includes a a Gaussian like distribution over the classes in which mid-classes have more instances compared to fringes. To alleviate this, we first pick a random class then select a random image belonging to that class. That gives equal change to each class to be selected.
We applied rotation, random scaling, color noise and random horizontal flip for data augmentation.
We do not use Batch Normalization (BN) layers since they lavish computational cost and in our experiments we obtain far worse performances. We believe it relies on the fine-detailed nature of the problem and BN layers just loose the representational power of the network due to implicit noise applied by its layers.
ELU activation is used for all our network architectures since, approving the claim of , it accelerates the training of a network without BN layers.
We tried many different architectures but with a simple and memory efficient model (Tiny Darknet) was enough to obtain comparable performance in shorter training time. Below, I share Torch code for the model definition;
In this section, we will discuss what are the contributions of individual bits and pieces of the proposed method. For any numerical comparison, I show correlation between the model prediction and the annotators score in a validation set.
Effect of Pre-Training
Pre-training with Siamese loss depicts very crucial effect. The initial representation learned by Siamese training presents a very effective initialization scheme for the final model. Without pre-training, many of our train runs stall so quickly or even not reduce the loss.
Correlation values with different settings, higher is better;
with pre-training : 0.82
without pre-training : 0.68
with ImageNet: 0.73
Effect of Net2Net
The most important aspect of Net2Net is to allow training incrementally, in a faster manner. It also reduces the engineering effort to your model architecture so that you can validate smaller version of your model rapidly before training the real one.
In our experiments, It is observed that Net2Net provides good speed up. It also increase the final model performance slightly.
Correlation values with different settings;
pre-training + net2net :0.84
with pre-training : 0.82
without pre-training : 0.68
with ImageNet (VGG): 0.73
pre-training + net2net :5 hours
with pre-training : 8 hours
without pre-training : 13 hours
with ImageNet (VGG): 3 hours
We can see the performance and time improvement above. Maybe 3 hours seems not crucial but think about replicating the same training again and again to find the best possible setting. In such case, it saves a lot.
Although, proposed method yields considerable performance gain, correcting the common notion, more data would increase the performance much beyond. It might be observed by the below learning curve that our model learns training data very-well but validation loss stalls quickly. Thus, we need much more coverage by the training data in order to generalize better on validation set.
In this work, we only consider simple and efficient model architectures. However, with more resources, more complex network architectures might be preferred and that might result additional gains.
We do not separate man and woman images since we believe that the model is supposed to learn genders implicitly and score accordingly. It is not experimented yet so such grouping likely to increase the performance.
Below we see a simple occlusion analysis of our network indicating the model’s attention while scoring. This is done by occluding part of the image in sliding window fashion and compute absolute prediction changes in relation to normal image.
Figures show that, it mainly focuses on face and specifically eyes, nose and lips for high score images where as attention is more scattered for low and medium scale scores.
Below, we have random top and low scored selfies from validation set . It seems like results are not perfect but still its predictions are concordant to our inclination to these images.
Here, we solidify the ability of deep learning models, CNNs in particular. Results are not perfect but still make sense and amaze me. It looks very intriguing that how couple of matrix multiplication is able to capture what is beautiful and what is not.
This work entails to Selfai mobile application, you might like to give it a try for fun (if you did not before reading it). For instance, I stop growing my facial hair after I see a huge boost of my score. Thus it might be used as a smart mirror as well :). There is also the Instagram account where selfai bot scores images tagged #selfai_robot or sent by direct message.
Besides all, keep in mind that this is just for fun without any bad intention. It was sparked by curiosity and resulted these applications.
Finally, please share your thoughts, comment and more. It is good to see what people think about your work.
Disclaimer: This post is just a draft of my work to share this interesting problem and our solution with the community . This work might be a paper with some more legitimate future work.
 J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah. Signature verification using a siamese time delay neural network. J. Cowan and G. Tesauro (eds) Advances in Neural Information Processing Systems, 1993.
 Chopra, S., Hadsell, R., & LeCun, Y. (n.d.). Learning a Similarity Metric Discriminatively, with Application to Face Verification. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 1, 539–546. http://doi.org/10.1109/CVPR.2005.202
Chen, T., Goodfellow, I., & Shlens, J. (2015). Net2Net: Accelerating Learning via Knowledge Transfer. arXiv Preprint, 1–10. Retrieved from http://arxiv.org/abs/1511.05641
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations, 1–14. http://doi.org/10.1016/j.infsof.2008.09.005
Huang, G., Liu, Z., & Weinberger, K. Q. (2016). Densely Connected Convolutional Networks. arXiv Preprint, 1–12. Retrieved from http://arxiv.org/abs/1608.06993
Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). Under Review of ICLR2016， 提出了ELU, (1997), 1–13. Retrieved from http://arxiv.org/pdf/1511.07289.pdf%5Cnhttp://arxiv.org/abs/1511.07289%5Cnhttp://arxiv.org/abs/1511.07289
Continuous distribution on the simplex which approximates discrete vectors (one hot vectors) and differentiable by its parameters with reparametrization trick used in VAE.
It is used for semi-supervised learning.
DEEP UNSUPERVISED LEARNING WITH SPATIAL CONTRASTING
Learning useful unsupervised image representations by using triplet loss on image patches. The triplet is defined by two image patches from the same images as the anchor and the positive instances and a patch from a different image which is the negative. It gives a good boost on CIFAR-10 after using it as a pretraning method.
How would you apply to real and large scale classification problem?
UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION
This paper states the following phrase. Traditional machine learning frameworks (VC dimensions, Rademacher complexity etc.) trying to explain how learning occurs are not very explanatory for the success of deep learning models and we need more understanding looking from different perspectives.
They rely on following empirical observations;
Deep networks are able to learn any kind of train data even with white noise instances with random labels. It entails that neural networks have very good brute-force memorization capacity.
Explicit regularization techniques – dropout, weight decay, batch norm – improves model generalization but it does not mean that same network give poor generalization performance without any of these. For instance, an inception network trained without ant explicit technique has 80.38% top-5 rate where as the same network achieved 83.6% on ImageNet challange with explicit techniques.
A 2 layers network with 2n+d parameters can learn the function f with n samples in d dimensions. They provide a proof of this statement on appendix section. From the empirical stand-view, they show the network performances on MNIST and CIFAR-10 datasets with 2 layers Multi Layer Perceptron.
Above observations entails following questions and conflicts;
Traditional notion of learning suggests stronger regularization as we use more powerful models. However, large enough network model is able to memorize any kind of data even if this data is just a random noise. Also, without any further explicit regularization techniques these models are able to generalize well in natural datasets. It shows us that, conflicting to general belief, brute-force memorization is still a good learning method yielding reasonable generalization performance in test time.
Classical approaches are poorly suited to explain the success of neural networks and more investigation is imperative in order to understand what is really going on from theoretical view.
Generalization power of the networks are not really defined by the explicit techniques, instead implicit factors like learning method or the model architecture seems more effective.
Explanation of generalization is need to be redefined in order to solve the conflicts depicted above.
My take : These large models are able to learn any function (and large does not mean deep anymore) and if there is any kind of information match between the training data and the test data, they are able to generalize well as well. Maybe it might be an explanation to think this models as an ensemble of many millions of smaller models on which is controlled by the zeroing effect of activation functions. Thus, it is able to memorize any function due to its size and implicated capacity but it still generalize well due-to this ensembling effect.