Facebook AI releases a new SOTA “weakly semi-supervised” learning system for video and image classification. I’m posting this here because even though it’s about capabilities, the architecture includes a sort-of-similar-to amplification component where a higher capacity teacher decides how to train a lower capacity student model.
What’s special there is the semi-supervised part (the training on unlabeled data to get pseudo-labels to then use in the student model’s training). Using a high capacity teacher on hundreds of millions of images is not all that new: for example, Google was doing that on its JFT dataset (then ~100m noisily-labeled images) back in at least 2015, given “Distilling the Knowledge in a Neural Network”, Hinton, Vinyals & Dean 2015. Or Gao et al 2017 which goes the other direction and tries to distill dozens of teachers into a single student using 400m images in 100k classes.
(See also: Gross et al 2017/Sun et al 2017/Gao et al 2017/Shazeer et al 2018/Mahajan et al 2018/Yalniz et al 2019 or GPipe scaling to 1663-layer/83.4b-parameter Transformers)
Interesting, I somehow hadn’t seen this. Thanks! (Editing to reflect this as well.)
I’m curious—even though this isn’t new, do you agree with my vague claim that the fact that this and the paper you linked work pertains to the feasibility of amplification-style strategies?
I’m not sure. Typically, the justification for these sorts of distillation/compression papers is purely compute: the original teacher model is too big to run on a phone or as a service (Hinton), or too slow, or would be too big to run at all without ‘sharding’ it somehow, or it fits but training it to full convergence would take too long (Gao). You don’t usually see arguments that the student is intrinsically superior in intelligence and so ‘amplified’ in any kind of AlphaGo-style way which is one of the more common examples for amplification. They do do something which sorta looks iterated by feeding the pseudo-labels back into the same model:
But this may top out at one or 2 iterations, and they don’t demonstrate that this would be better than any other clearly non-iterated semi-supervised learning method (like MixMatch).