how to guide the pretraining process in a way that benefits alignment
One key question here, I think: a major historical alignment concern has been that for any given finite set of outputs, there are an unbounded number of functions that could produce it, and so it’s hard to be sure that a model will generalize in a desirable way. Nora Belrose goes so far as to suggest that ‘Alignment worries are quite literally a special case of worries about generalization.’ This is relevant for post-training but I think even more so for pre-training.
I know that there’s been research into how neural networks generalize both from the AIS community and the larger ML community, but I’m not very familiar with it; hopefully someone else can provide some good references here.
One key question here, I think: a major historical alignment concern has been that for any given finite set of outputs, there are an unbounded number of functions that could produce it, and so it’s hard to be sure that a model will generalize in a desirable way. Nora Belrose goes so far as to suggest that ‘Alignment worries are quite literally a special case of worries about generalization.’ This is relevant for post-training but I think even more so for pre-training.
I know that there’s been research into how neural networks generalize both from the AIS community and the larger ML community, but I’m not very familiar with it; hopefully someone else can provide some good references here.