I want to hire some people to help me update my websites more frequently, do the maintenance stuff, and to help edit the podcast so I can produce episodes more frequently.
I outlined my whole plan here on my Patreon Campaign. You’ll see a new page on this site soon acknowledging supporters, and I’ll update you on the progress.
Whether you can give financially, or even if you just share the campaign with your data science friends, you are helping Becoming a Data Scientist podcast, the learning club, Data Sci Guide, Jobs for New Data Scientists, and all of my websites get off the ground! Thank you!!
In this tutorial, we’ll dive into one of the most powerful aspects of pandas – its grouping and aggregation functionality. With this functionality, it’s dead simple to compute group summary statistics, discover patterns, and slice up your data in various ways.
Since Thanksgiving was just last week, we’ll use a dataset on what Americans typically eat for Thanksgiving dinner as we explore the pandas library. You can download the dataset here. It contains 1058 online survey responses collected by FiveThirtyEight. Each survey respondent was asked questions about what they typically eat for Thanksgiving, along with some demographic questions, like their gender, income, and location. This dataset will allow us to discover regional and income-based patterns in what Americans eat for Thanksgiving dinner. As we explore the data and try to find patterns, we’ll be heavily using the grouping and aggregation functionality of pandas.
This is just a short list of a few books that I have have recently discovered online.
Model-Based Machine Learning – Chapters of this book become available as they are being written. It introduces machine learning via case studies instead of just focusing on the algorithms.
Foundations of Data Science – This is a much more academic-focused book which could be used at the undergraduate or graduate level. It covers many of the topics one would expect: machine learning, streaming, clustering and more.
R is a hugely popular language among data scientists and statisticians. One of the difficulties with open-source R is the memory constraint. All the data needs to be loaded into a data.frame. Microsoft solves this problem with the RevoScaleR package of the Microsoft R Server. Just launched this week is an EdX course on Analyzing Big Data with Microsoft R Server.
According the syllabus:
Upon completion, you will know how to use R for big-data problems.
Full Disclosure: I work at Microsoft, and the course instructor, Seth Mottaghinejad, is one of my colleagues.
Selfies are everywhere. With different fun masks, poses and filters, it goes crazy. When we coincide with any of these selfies, we automatically give an intuitive score regarding the quality and beauty of the selfie. However, it is not really possible to describe what makes a beautiful selfie. There are some obvious attributes but they are not fully prescribed.
With the folks at 8bit.ai, we decided to develop a system which analyzes selfie images and scores them in accordance to its quality and beauty. The idea was to see whether it is possible to mimic that bizarre perceptual understanding of human with the recent advancements of AI. And if it is, then let’s make a mobile application and let people use it for whatever purpose. Spoiler alert! We already developed Selfai app available on iOS and Android and we have one instagram bot @selfai_robot. You can check before reading.
After a kind of self-promotional entry, let’s come to the essence. In this post, I like to talk about what I’ve done in this fun project from research point. It entails to a novel method which is also applicable to similar fine-grain image recognition problems beyond this particular one.
I call the problem fine-grain since what differentiates the score of a selfie relies on the very details. It is hard to capture compared to the traditional object categorization problems, even with simple deep learning models.
We like to model ‘human eye evaluation of a selfie image’ by a computer. Here; we do not define what the beauty is, which is a very vague term by itself, but let the model internalize the notion from the data. The data is labeled by human annotators on an internally developed crowd-sourced website.
In terms of research, this is a peculiar problem where traditional CNN approaches fail due to following reasons:
Fine-grain attributes are the factors defining one image better or worse than another.
Selfie images induce vast amount of variations with different applied filters, editions, pose and lighting.
Scoring is a different practice than categorization and it is not a well-studied problem compared to categorization.
Scarcity of annotated data yields learning in a small-data regime.
This is a problem already targeted by different works. HowHot.io is one of the well-known example of such, using deep learning back-end empowered with a large amount of data from a dating application. They use the application statistics as the annotation. Our solution differs strongly since we only use in-house data which is very small compared to what they have. Thus feeding data into a well-known CNN architecture simply does not work in our setting.
There is also a relevant blog post by A. Karpathy where he crawled Instagram for millions of images and use “likes” as annotation. He uses a simple CNN. He states that the model is not that good but still it gives a intuition about what is a good selfie. Again, we count on A. Karpathy that ad-hoc CNN solutions are not enough for decent results.
There are other research efforts suggesting different CNN architectures or ratio based beauty justifications, however they are limited to pose constrains or smooth backgrounds. In our setting, an image can be uploaded from any scene with an applied filter or mask.
We solve this problem based on 3 steps. First, pre-train the network with Siamese layer  as enlarging the model by Net2Net  incrementally. Then fine-tune the model with Huber-Loss based regression for scoring and just before fine-tuning use Net2Net operator once more to double the model size.
Siamese network architecture is a way of learning which is embedding images into lower-dimensions based on similarity computed with features learned by a feature network. The feature network is the architecture we intend to fine-tune in this setting. Given two images, we feed into the feature network and compute corresponding feature vectors. The final layer computes pair-wise distance between computed features and final loss layer considers whether these two images are from the same class (label 1) or not (label -1) .
Suppose G_w() is the function implying the feature network and X is raw image pixels. Lower indices of X shows different images. Based on this parametrization the final layer computes the below distance (L1 norm).
E_w = ||G_w(X_1) – G_W(X_2)||
On top of this any suitable loss function might be used. There are many different alternatives proposed lately. We choose to use Hinge Embedding Loss which is defined as,
Here in this framework, Siamese layer tries to push the network to learn features common for the same classes and differentiating for different classes.. Being said this, we expect to learn powerful features capturing finer details compared to simple supervised learning with help of the pair-wise consideration of examples. These features present good initialization for latter stage fine-tuning in relation to simple random or ImageNet initialization.
Architecture update by Net2Net
Net2Net  proposes two different operators to make the networks deeper and wider while keeping the model activations the same. Hence, it enables to train a network incrementally from smaller and shallower to wider and deeper architectures. This accelerates the training, lowers computational requirements and results possibly better representations.
We use Net2Net to reduce the training time in our modest computing facility and benefit from Siamese training without any architectural deficit. We apply Net2Net operators once in everytime training stalls through Siamese traning. In the end of the Siamese training we applied Net2Net wider operation once more to double the size and increase model capability to learn more representation.
Wider operation adds more units to a layer by copying weights from the old units and normalizes the next layer weights by the cloning factor of each unit, in order to keep the propagated activation the same. Deeper operation adds an identity layer between successive layers so that again the propagated activation stands the same.
One subtle difference in our use of Net2Net is to apply zeroing noise to cloned weights in wider operation. It basically breaks the symmetry and forces each unit to learn similar but different representations.
Sidenote: I studied this exact method in parallel to this paper at Qualcomm Research when I was participating ImageNet challenge. However, I cannot find time to publish before Net2Net. Sad
Fine-tuning is performed with Huber-Loss on top of the network which was used as the feature network at Siamese stage. Huber-Loss is the choice due to its resiliency to outlier instances. Outliers are extremely harmful in fine-grain problems (miss-labeled or corrupted instance) especially for small scale data sets. Hence, it is important for us to reconcile the effect of wrongly scored instances.
As we discussed above, before fine-tuning, we double the width (number of units in each layer) of the network. It enables to increase the representation power of the network which seems important for fine-grain problems.
Data Collection and Annotation
For this mission, we collect ~100.000 images from the web, prune the irrelevant or low-quality images then annotate the remaining ones on a crowd-sourced website. Each image is scored between 0 to 9. Eventually, we have 30.000 images annotated where each one is scored at least twice by different annotators.
Understanding of beauty varies among cultures and we assume that variety of annotators minimized any cultural bias.
Annotated images are processed by face detection and alignment procedure in order to focus faces centered and aligned by the eyes.
For all the model training, we use Torch7 framework and almost all of the training code is released on Github . In this repository, you find different architectures at different code branches.
Fine-tuning leverages a data sampling strategy alleviating the effect of data imbalance. Our data set includes a a Gaussian like distribution over the classes in which mid-classes have more instances compared to fringes. To alleviate this, we first pick a random class then select a random image belonging to that class. That gives equal change to each class to be selected.
We applied rotation, random scaling, color noise and random horizontal flip for data augmentation.
We do not use Batch Normalization (BN) layers since they lavish computational cost and in our experiments we obtain far worse performances. We believe it relies on the fine-detailed nature of the problem and BN layers just loose the representational power of the network due to implicit noise applied by its layers.
ELU activation is used for all our network architectures since, approving the claim of , it accelerates the training of a network without BN layers.
We tried many different architectures but with a simple and memory efficient model (Tiny Darknet) was enough to obtain comparable performance in shorter training time. Below, I share Torch code for the model definition;
In this section, we will discuss what are the contributions of individual bits and pieces of the proposed method. For any numerical comparison, I show correlation between the model prediction and the annotators score in a validation set.
Effect of Pre-Training
Pre-training with Siamese loss depicts very crucial effect. The initial representation learned by Siamese training presents a very effective initialization scheme for the final model. Without pre-training, many of our train runs stall so quickly or even not reduce the loss.
Correlation values with different settings, higher is better;
with pre-training : 0.82
without pre-training : 0.68
with ImageNet: 0.73
Effect of Net2Net
The most important aspect of Net2Net is to allow training incrementally, in a faster manner. It also reduces the engineering effort to your model architecture so that you can validate smaller version of your model rapidly before training the real one.
In our experiments, It is observed that Net2Net provides good speed up. It also increase the final model performance slightly.
Correlation values with different settings;
pre-training + net2net :0.84
with pre-training : 0.82
without pre-training : 0.68
with ImageNet (VGG): 0.73
pre-training + net2net :5 hours
with pre-training : 8 hours
without pre-training : 13 hours
with ImageNet (VGG): 3 hours
We can see the performance and time improvement above. Maybe 3 hours seems not crucial but think about replicating the same training again and again to find the best possible setting. In such case, it saves a lot.
Although, proposed method yields considerable performance gain, correcting the common notion, more data would increase the performance much beyond. It might be observed by the below learning curve that our model learns training data very-well but validation loss stalls quickly. Thus, we need much more coverage by the training data in order to generalize better on validation set.
In this work, we only consider simple and efficient model architectures. However, with more resources, more complex network architectures might be preferred and that might result additional gains.
We do not separate man and woman images since we believe that the model is supposed to learn genders implicitly and score accordingly. It is not experimented yet so such grouping likely to increase the performance.
Below we see a simple occlusion analysis of our network indicating the model’s attention while scoring. This is done by occluding part of the image in sliding window fashion and compute absolute prediction changes in relation to normal image.
Figures show that, it mainly focuses on face and specifically eyes, nose and lips for high score images where as attention is more scattered for low and medium scale scores.
Below, we have random top and low scored selfies from validation set . It seems like results are not perfect but still its predictions are concordant to our inclination to these images.
Here, we solidify the ability of deep learning models, CNNs in particular. Results are not perfect but still make sense and amaze me. It looks very intriguing that how couple of matrix multiplication is able to capture what is beautiful and what is not.
This work entails to Selfai mobile application, you might like to give it a try for fun (if you did not before reading it). For instance, I stop growing my facial hair after I see a huge boost of my score. Thus it might be used as a smart mirror as well :). There is also the Instagram account where selfai bot scores images tagged #selfai_robot or sent by direct message.
Besides all, keep in mind that this is just for fun without any bad intention. It was sparked by curiosity and resulted these applications.
Finally, please share your thoughts, comment and more. It is good to see what people think about your work.
Disclaimer: This post is just a draft of my work to share this interesting problem and our solution with the community . This work might be a paper with some more legitimate future work.
 J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah. Signature verification using a siamese time delay neural network. J. Cowan and G. Tesauro (eds) Advances in Neural Information Processing Systems, 1993.
 Chopra, S., Hadsell, R., & LeCun, Y. (n.d.). Learning a Similarity Metric Discriminatively, with Application to Face Verification. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 1, 539–546. http://doi.org/10.1109/CVPR.2005.202
Chen, T., Goodfellow, I., & Shlens, J. (2015). Net2Net: Accelerating Learning via Knowledge Transfer. arXiv Preprint, 1–10. Retrieved from http://arxiv.org/abs/1511.05641
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations, 1–14. http://doi.org/10.1016/j.infsof.2008.09.005
Huang, G., Liu, Z., & Weinberger, K. Q. (2016). Densely Connected Convolutional Networks. arXiv Preprint, 1–12. Retrieved from http://arxiv.org/abs/1608.06993
Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). Under Review of ICLR2016， 提出了ELU, (1997), 1–13. Retrieved from http://arxiv.org/pdf/1511.07289.pdf%5Cnhttp://arxiv.org/abs/1511.07289%5Cnhttp://arxiv.org/abs/1511.07289
The differences between Data Scientists, Data Engineers, and Software engineers can get a little confusing at times. Thus, here is a guest post provided by Jake Stein, CEO at Stitch formerly RJ Metrics, which aims to clear up some of that confusion based upon LinkedIn data.
As data grows, so does the expertise needed to manage it. The past few years have seen an increasing distinction between the key roles tasked with managing data: software engineers, data engineers, and data scientists.
More and more we’re seeing data engineers emerge as a subset within the software engineering discipline, but this is still a relatively new trend. Plenty of software engineers are still tasked with moving and managing data.
Our team has released two reports over the past year, one focused on understanding the data science role, one on data engineering. Both of these reports are based on self-reported LinkedIn data. In this post, I’ll lay out the distinctions between these roles and software engineers, but first, here’s a diagram to show you (in very broad strokes) what we saw in the skills breakdown between these three roles:
A software engineer builds applications and systems. Developers will be involved through all stages of this process from design, to writing code, to testing and review. They are creating the products that create the data. Software engineering is the oldest of these three roles, and has established methodologies and tool sets.
Frontend and backend development
Operating system development
A data engineer builds systems that consolidate, store, and retrieve data from the various applications and systems created by software engineers. Data engineering emerged as a niche skill set within software engineering. 40% of all data engineers were previously working as a software engineer, making this the most common career path for data engineers by far.
Advanced data structures
Knowledge of new & emerging tools: Hadoop, Spark, Kafka, Hive, etc.
Building ETL/data pipelines
A data scientist builds analysis on top of data. This may come in the form of a one-off analysis for a team trying to better understand customer behavior, or a machine learning algorithm that is then implemented into the code base by software engineers and data engineers.
Business Intelligence dashboards
Evolving Data Teams
These roles are still evolving. The process of ETL is getting much easier overall as new tools (like Stitch) enter the market, making it easy for software developers to set up and maintain data pipelines. Larger companies are pulling data engineers off the software engineering team entirely in lieu of forming a centralized data team where infrastructure and analysis sit together. In some scenarios data scientists are responsible for both data consolidation and analysis.
At this point, there is no single dominant path. But we expect this rapid evolution to continue, after all, data certainly isn’t getting any smaller.
I’ve decided that I want to have Becoming a Data Scientist t-shirts to sell and to give out to podcast guests and contest winners, but I am not a graphic designer, so I need some help! So I’m going to have a t-shirt design contest!
Here are the rules/guidelines for entry:
1. Create a design that prominently says “Becoming a Data Scientist” and can be easily scaled to fit in on the front or back of a t-shirt. If you create a design for the back, also create a small “pocket sized” text or design for the front of the shirt. But I don’t have a preference – front or back of shirt designs are both fine!
If it can be incorporated into the design without looking too cluttered, you can also add “Podcast and Learning Club” and/or “@becomingdatasci”, but that is not a requirement.
The design can be just text, text with an image, more abstract, use your imagination! As long as “Becoming a Data Scientist” is clearly readable, your design will be considered. Obviously, vulgar designs will not be considered, and I’ll also remove them (or anything spammy) from the comments on this post.
Please don’t use more than 2 colors in your design itself, as more can be cost-prohibitive. The background color can be a different 3rd color.
You can even use an online shirt design program like CustomInk, as long as the design can be extracted for use at another printing site (I’m not sure what the rules or capabilities of most of those t-shirt design websites are).
2. Please submit 2 files: your design itself (in a format that can be read on multiple platforms, like a PDF), large enough that it can be zoomed in to “life size”, and then also a smaller image of your design as you imagine it on a shirt – choose a shirt color and location for your design and create a little “shirt preview” image I can share with readers for the vote (this one doesn’t have to be zoomable to full size – the largest size I’d post it at is about 500×500). Please let me know if you have any questions or suggestions about this!
You can submit the files by posting a comment below. Don’t put your email in the text of the comment, I’ll be able to see it behind the scenes from the form. Make sure to include your name (it can be just your first name if you want) as you want it shared along with your design if you get selected for the voting round, along with the links to the 2 files and any link you want to point people to – your blog, your portfolio, your twitter account, etc.
UPDATE 12/5/2016: It looks like I made the turnaround time short (I’d like to have at least 10 designs before the voting), and I know some people may have great ideas but not great graphic design skills, so I’m making 2 changes:
It doesn’t need to be a “print-ready” design. If you sketch it out and your design wins, I’ll get a graphic designer to help turn it into a file to send to the t-shirt printer
The deadline is now extended to the end of the calendar year (see below)
Thank you to those of you who have entered already!!
Here’s how the contest will go:
I’ll accept entries until 11:59pm Saturday, December 31, 2016. Over the next week, depending on how many entries there are, I’ll narrow down the selection to maybe 5-10 choices. I’ll create a blog post with the t-shirt images and names of the designers, with a way to vote on your favorite, and advertise it on @becomingdatasci to get as many votes as possible.
The top 3 vote-winners will win:
A data science related book of their choice up to $60
Will be featured in a “finalists” blog post with their design, a little blurb about them, and a link to their website
A tweet with their design and a link to their site on my @becomingdatasci twitter account.
From the finalists, I’ll choose my favorite to be printed. The final winner will also get:
2-3 extra t-shirts with their design to give out to friends
Name credited as designer wherever the t-shirt is sold
Additional tweets with their design announced as the winner, with a link to their site, on @becomingdatasci.
A shout-out on the Becoming a Data Scientist podcast
Please let me know if there’s anything I forgot to detail here or if you have any questions! I look forward to seeing the submitted designs!!
Update: I should probably mention that any proceeds from sales of the shirts will go to support the maintenance and creation of more content at this site BecomingADataScientist.com, the podcast, the Data Science Learning Club, DataSciGuide, and my other data science sites and social media accounts. I’ll be posting a Patreon campaign soon to raise money to hire help to keep these sites updated, and money I earn from selling t-shirts will go toward that as well.
Continuous distribution on the simplex which approximates discrete vectors (one hot vectors) and differentiable by its parameters with reparametrization trick used in VAE.
It is used for semi-supervised learning.
DEEP UNSUPERVISED LEARNING WITH SPATIAL CONTRASTING
Learning useful unsupervised image representations by using triplet loss on image patches. The triplet is defined by two image patches from the same images as the anchor and the positive instances and a patch from a different image which is the negative. It gives a good boost on CIFAR-10 after using it as a pretraning method.
How would you apply to real and large scale classification problem?
UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION