wunan comments on Paper: Large Language Models Can Self-improve [Linkpost]

wunan 2 Oct 2022 15:25 UTC
10 points
3
Another similar result was that AlphaFold was trained on its own high-confidence predictions for protein sequences with unknown structures:
The AlphaFold architecture is able to train to high accuracy using only supervised learning on PDB data, but we are able to enhance accuracy (Fig. 4a) using an approach similar to noisy student self-distillation³⁵. In this procedure, we use a trained network to predict the structure of around 350,000 diverse sequences from Uniclust30³⁶ and make a new dataset of predicted structures filtered to a high-confidence subset. We then train the same architecture again from scratch using a mixture of PDB data and this new dataset of predicted structures as the training data, in which the various training data augmentations such as cropping and MSA subsampling make it challenging for the network to recapitulate the previously predicted structures. This self-distillation procedure makes effective use of the unlabelled sequence data and considerably improves the accuracy of the resulting network.