Billion-scale semi-supervised learning for state-of-the-art image and video classification

NaiveTortoise19 Oct 2019 15:10 UTC

5 points

Facebook AI releases a new SOTA “weakly semi-supervised” learning system for video and image classification. I’m posting this here because even though it’s about capabilities, the architecture includes a sort-of-similar-to amplification component where a higher capacity teacher decides how to train a lower capacity student model.

NaiveTortoise19 Oct 2019 15:10 UTC

5 points

3 comments1 min readLW link

gwern 19 Oct 2019 17:56 UTC
2 points

a sort-of-similar-to amplification component where a higher capacity teacher decides how to train a lower capacity student model. This is the first example I’ve seen of this overseer/machine-teaching style approach scaling up to such a data-hungry classification task.

What’s special there is the semi-supervised part (the training on unlabeled data to get pseudo-labels to then use in the student model’s training). Using a high capacity teacher on hundreds of millions of images is not all that new: for example, Google was doing that on its JFT dataset (then ~100m noisily-labeled images) back in at least 2015, given “Distilling the Knowledge in a Neural Network”, Hinton, Vinyals & Dean 2015. Or Gao et al 2017 which goes the other direction and tries to distill dozens of teachers into a single student using 400m images in 100k classes.

(See also: Gross et al 2017/Sun et al 2017/Gao et al 2017/Shazeer et al 2018/Mahajan et al 2018/Yalniz et al 2019 or GPipe scaling to 1663-layer/83.4b-parameter Transformers)
- NaiveTortoise 19 Oct 2019 20:10 UTC
  1 point
  Parent
  Interesting, I somehow hadn’t seen this. Thanks! (Editing to reflect this as well.)
  I’m curious—even though this isn’t new, do you agree with my vague claim that the fact that this and the paper you linked work pertains to the feasibility of amplification-style strategies?
  - gwern 19 Oct 2019 22:39 UTC
    5 points
    Parent
    I’m not sure. Typically, the justification for these sorts of distillation/compression papers is purely compute: the original teacher model is too big to run on a phone or as a service (Hinton), or too slow, or would be too big to run at all without ‘sharding’ it somehow, or it fits but training it to full convergence would take too long (Gao). You don’t usually see arguments that the student is intrinsically superior in intelligence and so ‘amplified’ in any kind of AlphaGo-style way which is one of the more common examples for amplification. They do do something which sorta looks iterated by feeding the pseudo-labels back into the same model:
    
    In order to achieve the state of the art, our researchers used the weakly supervised ResNeXt-101-32x48 model teacher model to select pretraining examples from the same data set of one billion hashtagged images. The target ResNet-50 model is pretrained with the selected examples and then fine-tuned with the ImageNet training data set. The resulting semi-weakly supervised ResNet-50 model achieves 81.2 percent top-1 accuracy. This is the current state of the art for the ResNet-50 ImageNet benchmark model. The top-1 accuracy is 3 percent higher than the (weakly supervised) ResNet-50 baseline, which is pretrained and fine-tuned on the same data sets with exactly the same training data set and hyper-parameters.
    
    But this may top out at one or 2 iterations, and they don’t demonstrate that this would be better than any other clearly non-iterated semi-supervised learning method (like MixMatch).