I’m not sure. Typically, the justification for these sorts of distillation/compression papers is purely compute: the original teacher model is too big to run on a phone or as a service (Hinton), or too slow, or would be too big to run at all without ‘sharding’ it somehow, or it fits but training it to full convergence would take too long (Gao). You don’t usually see arguments that the student is intrinsically superior in intelligence and so ‘amplified’ in any kind of AlphaGo-style way which is one of the more common examples for amplification. They do do something which sorta looks iterated by feeding the pseudo-labels back into the same model:
In order to achieve the state of the art, our researchers used the weakly supervised ResNeXt-101-32x48 model teacher model to select pretraining examples from the same data set of one billion hashtagged images. The target ResNet-50 model is pretrained with the selected examples and then fine-tuned with the ImageNet training data set. The resulting semi-weakly supervised ResNet-50 model achieves 81.2 percent top-1 accuracy. This is the current state of the art for the ResNet-50 ImageNet benchmark model. The top-1 accuracy is 3 percent higher than the (weakly supervised) ResNet-50 baseline, which is pretrained and fine-tuned on the same data sets with exactly the same training data set and hyper-parameters.
But this may top out at one or 2 iterations, and they don’t demonstrate that this would be better than any other clearly non-iterated semi-supervised learning method (like MixMatch).
I’m not sure. Typically, the justification for these sorts of distillation/compression papers is purely compute: the original teacher model is too big to run on a phone or as a service (Hinton), or too slow, or would be too big to run at all without ‘sharding’ it somehow, or it fits but training it to full convergence would take too long (Gao). You don’t usually see arguments that the student is intrinsically superior in intelligence and so ‘amplified’ in any kind of AlphaGo-style way which is one of the more common examples for amplification. They do do something which sorta looks iterated by feeding the pseudo-labels back into the same model:
But this may top out at one or 2 iterations, and they don’t demonstrate that this would be better than any other clearly non-iterated semi-supervised learning method (like MixMatch).