Sam Marks comments on Quintin’s alignment papers roundup—week 1

Sam Marks 11 Sep 2022 18:42 UTC
15 points
1
This seems very useful—thanks for doing it!

Some paper suggestions:

Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit
There is mounting empirical evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning k-sparse parities of n bits, a canonical family of problems which pose theoretical computational barriers. In this setting, we find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time. In particular, we demonstrate empirically that with standard training, a variety of architectures learn sparse parities with nO(k) examples, with loss (and error) curves abruptly dropping after nO(k) iterations. These positive results nearly match known SQ lower bounds, even without an explicit sparsity-promoting prior. We elucidate the mechanisms of these phenomena with a theoretical analysis: we find that the phase transition in performance is not due to SGD “stumbling in the dark” until it finds the hidden set of features (a natural algorithm which also runs in nO(k) time); instead, we show that SGD gradually amplifies a Fourier gap in the population gradient.
By some of the same authors as “Functions of Increasing Complexity,” this paper takes a toy problem which exhibits a very sharp phase transition and analyzes the hell out of it. A primary aim of the paper is to refute the explanation that this phase transition happens because SGD is randomly searching weight space until stumbles upon a solution, an explanation which is tempting given that the loss curves for this problem stay essentially flat before rapidly converging to zero. Instead, the authors find that their models make “hidden progress” which is not reflected in the loss curves; this echos findings from Neel Nanda’s work on grokking. (Speaking of which, this paper also abounds with fascinating tidbits on grokking, including a gearsy analysis of for which variants on their toy problem and which model hyperparameters grokking does/doesn’t occur.)

Diversify and Disambiguate: Learning from Underspecified Data
Many datasets are underspecified: there exist multiple equally viable solutions to a given task. Underspecification can be problematic for methods that learn a single hypothesis because different functions that achieve low training loss can focus on different predictive features and thus produce widely varying predictions on out-of-distribution data. We propose DivDis, a simple two-stage framework that first learns a diverse collection of hypotheses for a task by leveraging unlabeled data from the test distribution. We then disambiguate by selecting one of the discovered hypotheses using minimal additional supervision, in the form of additional labels or inspection of function visualization. We demonstrate the ability of DivDis to find hypotheses that use robust features in image classification and natural language processing problems with underspecification.
I’m suggesting this paper because it forms the technical basis for Stuart Armstrong’s work on his concept extrapolation agenda. Given (1) labeled data for which there are many possible proxies which result in good classification performance, (2) unlabeled data which bears witness to the fact that these proxies can disagree with each other, this paper gives a method for explicitly learning multiple diverse proxies. A possible story about how this is useful for alignment: if one can generate diverse hypotheses which postdict observed human preferences but disagree on novel scenarios, then one may hope to either actively query humans (to get evidence on which hypothesis is correct) or act conservatively (picking actions which are good according to all the various hypotheses).
Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution
When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer—the “head”). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head—this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).
Even when a pretrained model has learned robust features for understanding its data, finetuning can sometimes distort these features intro proxies which only work well on-distribution, thereby leading to poor OOD performance. This paper studies the “feature distortion” phenomenon, and proposes a method to mitigate it: first freeze the pretrained model’s weights and only train a linear probe out of the pretrained model’s latent representation of the data, and then finetune the pretrained model + linear probe. This seems relevant to alignment insofar as it seems plausible that large self-supervisedly trained models could learn concepts which robustly correspond to our own concepts; in that case it’d be useful if we could avoid distorting these concepts during finetuning (such as RLHF).
[Meta: if you include these papers in future roundups, feel free to either use these blurbs or toss them out and write your own. I had originally planned to just write something which pitched the papers and their alignment relevance to you (Quintin), but I guess they kinda turned more into the sort of opinions you had written.]