Generalization and random labels

Why is the result in [ : Understanding deep learning requires rethinking generalization, ] interesting?

Edit on Feb. 23, 2017: the above paper has now been voted ICLR 2017 best paper.

Let’s think about model fitting and generalization using an example from Christopher Bishop’s 2006 book, namely fitting a polynomial to data sampled from a sine function (supposed to be unknown) :


We most frequently determine our model complexity (the “hypothesis class”  / network architecture) using cross-validation on a hold-out set. Let’s suppose we have a hold-out set on the data above, not shown here. It would most likely lead to choosing a model / hypothesis class with hyper-parameter M=3, since it closely corresponds to the (unknown) function used to generate the (noisy) data :


We can expect this model to generalize reasonably well to a test set, since it already generalized well from the training to the hold-out set (ignoring here the fact that optimizing a model on a hold-out set can itself lead to overfitting).

Now, can we expect the same model / hypothesis class (polynomial with order M=3) to perform well on the same training data, where labels are replaced with random labels? In our case of fitting a polynomial to said data, this would involve pairs of (x,y), where the y-axis is random, e.g. this data :


The model will most probably not fit this semi-random data. The model M=3 leads to low overfitting on the original (not random) data since it is capable of expressing regularities in the data which arise from the marginal and the full data distribution, e.g. including labels. Replacing labels with random labels, removes regularities from the data. As a consequence, the model is not powerful enough to fit the data.

However, the authors in [] showed that this is not the case for a large part of the deep models currently used in vision. Known models like VGG etc. can fit very large amounts of training data (like Imagenet) with random labels, although the same models, with the same capacity (the same number of parameters) have shown to generalize well from the training set to a hold-out set.

These results have been shown without regularization (no weight decay, not batch normalization, no dropout).