ryan_greenblatt comments on How will we update about scheming?

ryan_greenblatt Jan 16, 2025, 1:13 AM
5 points
2

So if your ‘yes’ above is agreeing that capabilities—including scheming—come mostly from pretraining, then I don’t see how relevant it is whether or not that ability is actually used/executed much in pretraining, as the models we care about will go through post-training and I doubt you are arguing post-training will reliably remove scheming.

I think scheming is less likely to emerge if it is only selected for / reinforced in a small subset of training that doesn’t cause much of the capabilities. (As discussed in the post here.) If AIs don’t self-locate except in a small post training phase that doesn’t substantially increase capabilities, then the risk of scheming would be substantially reduced IMO.

That said, it looks like RL is becoming increasingly and increasingly important such that it already substantially increases capabilities.