Paper Notes: Intriguing Properties of Neural Networks


This paper studies description of semantic information with higher level units of an network and blind spot of the network models againt adversarial instances. They illustrate the learned semantics inferring maximally activating instances per unit. They also interpret the effect of adversarial examples and their generalization on different network architectures and datasets.

Findings might be summarized as follows;

  1. Certain dimensions of the each layer reflects different semantics of data. (This is a well-known fact to this date therefore I skip this to discuss more)
  2. Adversarial instances are general to different models and datasets.
  3. Adversarial instances are more significant to higher layers of the networks.
  4. Auto-Encoders are more resilient to adversarial instances.

Adversarial instances are general to different models and datasets.

They posit that advertorials exploiting a particular network architectures are also hard to classify for the others. They illustrate it by creating adversarial instances yielding 100% error-rate on the target network architecture and using these on the another network. It is shown that these adversarial instances are still hard for the other network ( a network with 2% error-rate degraded to 5%). Of course the influence is not that strong compared to the target architecture (which has 100% error-rate).

Adversarial instances are more significant to higher layers of networks.

As you go to higher layers of the network, instability induced by adversarial instances increases as they measure by Lipschitz constant. This is justifiable observation with that the higher layers capture more abstract semantics and therefore any perturbation on an input might override the constituted semantic. (For instance a concept of “dog head” might be perturbed to something random).

Auto-Encoders are more resilient to adversarial instances.

AE is an unsupervised algorithm and it is different from the other models used in the paper since it learns the implicit distribution of the training data instead of mere discriminant features. Thus, it is expected to be more tolerant to adversarial instances. It is understood by Table2 that AE model needs stronger perturbations to achieve 100% classification error with generated adversarials.

My Notes

One intriguing observation is that shallow model with no hidden unit is yet to be more robust to adversarial instance created from the deeper models. It questions the claim of generalization of adversarial instances. I believe, if the term generality is supposed to be hold, then a higher degree of susceptibility ought to be obtained in this example (and in other too).

I also happy to see that unsupervised method is more robust to adversarial as expected since I believe the notion of general AI is only possible with the unsupervised learning which learns the space of data instead of memorizing things. This is also what I plan to examine after this paper to see how the new tools like Variational Auto Encoders behave againt adversarial instance.

I believe that it is really hard to fight with adversarial instances especially, the ones created by counter optimization against a particular supervised model. A supervised model always has flaws to be exploited in this manner since it memorizes things [ref] and when you go beyond its scope (especially with adversarial instances are of low probability), it makes natural mistakes. Beside, it is known that a neural network converges to local minimum due to its non-convex nature. Therefore, by definition, it has such weaknesses.

Adversarial instances are, in practical sense, not a big deal right now.However, this is akin to be a far more important topic, as we journey through a more advanced AI. Right now, a ML model only makes tolerable mistakes. However, consider advanced systems waiting us in a close future with a use of great importance such as deciding who is guilty, who has cancer. Then this is question of far more important means.


The post Paper Notes: Intriguing Properties of Neural Networks appeared first on A Blog From Human-engineer-being.

Source: Erogol – Paper Notes: Intriguing Properties of Neural Networks

What I read lately

  • Link:
  • Continuous distribution on the simplex which approximates discrete vectors (one hot vectors) and differentiable by its parameters with reparametrization trick used in VAE.
  • It is used for semi-supervised learning.


  • Learning useful unsupervised image representations by using triplet loss on image patches. The triplet is defined by two image patches from the same images as the anchor and the positive instances and a patch from a different image which is the negative.  It gives a good boost on CIFAR-10 after using it as a pretraning method.
  • How would you apply to real and large scale classification problem?




  • For 110-layers ResNet the most contribution to gradient updates come from the paths with 10-34 layers.
  • ResNet trained with only these effective paths has comparable performance with the full ResNet. It is done by sampling paths with lengths in the effective range for each mini-batch.
  • Instead of going deeper adding more residual connections provides more boost due to the notion of exponential ensemble of shallow networks by the residual connections.
  • Removing a residual block from a ResNet has negligible drop on performance in test time in contrast to VGG and GoogleNet.


The post What I read lately appeared first on A Blog From Human-engineer-being.

Source: Erogol – What I read lately

Paper review – Understanding Deep Learning Requires Rethinking Generalization


This paper states the following phrase. Traditional machine learning frameworks (VC dimensions, Rademacher complexity etc.) trying to explain how learning occurs are not very explanatory for the success of deep learning models and we need more understanding looking from different perspectives.

They rely on following empirical observations;

  • Deep networks are able to learn any kind of train data even with white noise instances with random labels. It entails that neural networks have very good brute-force memorization capacity.
  • Explicit regularization techniques – dropout, weight decay, batch norm – improves model generalization but it does not mean that same network give poor generalization performance without any of these. For instance, an inception network trained without ant explicit technique has 80.38% top-5 rate where as the same network achieved 83.6% on ImageNet challange with explicit techniques.
  • A 2 layers network with 2n+d parameters can learn the function f with n samples in d dimensions. They provide a proof of this statement on appendix section. From the empirical stand-view, they show the network performances on MNIST and CIFAR-10 datasets with 2 layers Multi Layer Perceptron.

Above observations entails following questions and conflicts;

  • Traditional notion of learning suggests stronger regularization as we use more powerful models. However, large enough network model is able to memorize any kind of data even if this data is just a random noise. Also, without any further explicit regularization techniques these models are able to generalize well in natural datasets.  It shows us that, conflicting to general belief, brute-force memorization is still a good learning method yielding reasonable generalization performance in test time.
  • Classical approaches are poorly suited to explain the success of neural networks and more investigation is imperative in order to understand what is really going on from theoretical view.
  • Generalization power of the networks are not really defined by the explicit techniques, instead implicit factors like learning method or the model architecture seems more effective.
  • Explanation of generalization is need to be redefined in order to solve the conflicts depicted above.

My take :  These large models are able to learn any function (and large does not mean deep anymore) and if there is any kind of information match between the training data and the test data, they are able to generalize well as well. Maybe it might be an explanation to think this models as an ensemble of many millions of smaller models on which is controlled by the zeroing effect of activation functions.  Thus, it is able to memorize any function due to its size and implicated capacity but it still generalize well due-to this ensembling effect.


The post Paper review – Understanding Deep Learning Requires Rethinking Generalization appeared first on A Blog From Human-engineer-being.

Source: Erogol – Paper review – Understanding Deep Learning Requires Rethinking Generalization