Abilities/intelligence come almost entirely from pretraining, so all the situation awareness and scheming capability that current (and future similar) frontier models possess is thus also mostly present in the base model.
Yes, but for scheming, we care about whether the AI can self-locate itself as an AI using its knowledge. The fact that (at a minimum) sampling from the system is required for it to self-locate as an AI might make a big difference here.
So if your ‘yes’ above is agreeing that capabilities—including scheming—come mostly from pretraining, then I don’t see how relevant it is whether or not that ability is actually used/executed much in pretraining, as the models we care about will go through post-training and I doubt you are arguing post-training will reliably remove scheming.
I also think it seems probably very hard to train a system capable of obsoleting top human experts which doesn’t understand that it is an AI even if you’re willing to take a big competitiveness hit.
Indeed but that is entirely the point—by construction!
Conceptually we have a recipe R (arch, algorithms, compute, etc), and a training dataset which we can parameterize by time cutoff T. Our objective (for safety research) is not to train a final agent, but instead to find a safe/good R with minimal capability penalty. All important results we care about vary with R independently of T, but competitiveness/dangerousness does vary strongly with T.
Take the same R but vary the time cutoff T of the training dataset: the dangerousness of the AI will depend heavily on T, but not the relative effectiveness of various configurations of R. That is simply a restatement of the ideal requirements for a safe experimental regime. Models/algos that work well for T of 1950 will also work for T of 2020 etc.
So if your ‘yes’ above is agreeing that capabilities—including scheming—come mostly from pretraining, then I don’t see how relevant it is whether or not that ability is actually used/executed much in pretraining, as the models we care about will go through post-training and I doubt you are arguing post-training will reliably remove scheming.
I think scheming is less likely to emerge if it is only selected for / reinforced in a small subset of training that doesn’t cause much of the capabilities. (As discussed in the post here.) If AIs don’t self-locate except in a small post training phase that doesn’t substantially increase capabilities, then the risk of scheming would be substantially reduced IMO.
That said, it looks like RL is becoming increasingly and increasingly important such that it already substantially increases capabilities.
So if your ‘yes’ above is agreeing that capabilities—including scheming—come mostly from pretraining, then I don’t see how relevant it is whether or not that ability is actually used/executed much in pretraining, as the models we care about will go through post-training and I doubt you are arguing post-training will reliably remove scheming.
Indeed but that is entirely the point—by construction!
Conceptually we have a recipe R (arch, algorithms, compute, etc), and a training dataset which we can parameterize by time cutoff T. Our objective (for safety research) is not to train a final agent, but instead to find a safe/good R with minimal capability penalty. All important results we care about vary with R independently of T, but competitiveness/dangerousness does vary strongly with T.
Take the same R but vary the time cutoff T of the training dataset: the dangerousness of the AI will depend heavily on T, but not the relative effectiveness of various configurations of R. That is simply a restatement of the ideal requirements for a safe experimental regime. Models/algos that work well for T of 1950 will also work for T of 2020 etc.
I think scheming is less likely to emerge if it is only selected for / reinforced in a small subset of training that doesn’t cause much of the capabilities. (As discussed in the post here.) If AIs don’t self-locate except in a small post training phase that doesn’t substantially increase capabilities, then the risk of scheming would be substantially reduced IMO.
That said, it looks like RL is becoming increasingly and increasingly important such that it already substantially increases capabilities.