This feels kinda unrealistic for the kind of pretraining that’s common today, but so does actually learning how to do needle-moving alignment research just from next-token prediction. If we *condition on* the latter, it seems kinda reasonable to imagine there must be cases where an AI has to be able to do needle-moving alignment research in order to improve at next-token prediction, and this feels like a reasonable way that might happen.
For what little it’s worth, I mostly don’t buy this hypothetical (see e.g. here), but if I force myself to accept it, I think I’m tentatively on Holden’s side.
I’m not sure this paragraph will be helpful for anyone but me, but I wound up with a mental image vaguely like a thing I wrote long ago about “Straightforward RL” versus “Gradient descent through the model”, with the latter kinda like what you would get from next-token prediction. Again, I’m kinda skeptical that things like “gradient descent through the model” would work at all in practice, mainly because the model is only seeing a sporadic surface trace of the much richer underlying processing; but if I grant that it does (for the sake of argument), then it would be pretty plausible to me that the resulting model would have things like “strong preference to generally fit in and follow norms”, and thus it would do fine at POUDA-avoidance.
For what little it’s worth, I mostly don’t buy this hypothetical (see e.g. here), but if I force myself to accept it, I think I’m tentatively on Holden’s side.
I’m not sure this paragraph will be helpful for anyone but me, but I wound up with a mental image vaguely like a thing I wrote long ago about “Straightforward RL” versus “Gradient descent through the model”, with the latter kinda like what you would get from next-token prediction. Again, I’m kinda skeptical that things like “gradient descent through the model” would work at all in practice, mainly because the model is only seeing a sporadic surface trace of the much richer underlying processing; but if I grant that it does (for the sake of argument), then it would be pretty plausible to me that the resulting model would have things like “strong preference to generally fit in and follow norms”, and thus it would do fine at POUDA-avoidance.