paulfchristiano comments on Experimentally evaluating whether honesty generalizes

paulfchristiano 14 Jul 2021 0:43 UTC
LW: 2 AF: 2
0
AF
I do expect “explanations of what’s going on in this sentence” to be a lot weaker than translations.
For that task, I expect that the model trained on coherence + similar tasks will outperform a 10x larger pre-trained model. If the larger pre-trained model gets context stuffing on similar tasks, but no coherence training, then it’s less clear to me.
But I guess the point is that the differences between various degrees of successful-generalization will be relatively small compared to model size effects. It doesn’t matter so much how good the transfer model is relative to the pre-trained baseline, it matters how large the differences between the possible worlds that we are hoping to distinguish are.
I guess my main hope there is to try to understand whether there is some setting where transfer works quite well, either getting very close to the model fine-tuned on distribution, or at least converging as the pre-trained model grows. Hopefully that will make it easier to notice the effects we are looking for, and it’s OK if those effects are small relative to model doublings.
(Also worth noting that “as good as increasing model size by 10%” is potentially quite economically relevant. So I’m mostly just thinking about the extent to which it can make effects hard to measure.)