Training processes with varying (apparent) situational awareness
1:2.5 The AI seemingly isn’t aware it is an AI except for a small fraction of training which isn’t where much of the capabilities are coming from. For instance, the system is pretrained on next token prediction, our evidence strongly indicates that the system doesn’t know it is an AI when doing next token prediction (which likely requires being confident that it isn’t internally doing a substantial amount of general-purpose thinking about what to think about), and there is only a small RL process which isn’t where much of the capabilities are coming from.
Abilities/intelligence come almost entirely from pretraining, so all the situation awareness and scheming capability that current (and future similar) frontier models possess is thus also mostly present in the base model. The fact that you need to prompt them to summon out a situationally aware scheming agent doesn’t seem like much of a barrier, and indeed strong frontier base models are so obviously misaligned/jail-breakable/dangerous that releasing them to the public is PR-harmful enough to motivate RLHF post training purely for selfish profit-motives.
> This implies that restricting when AIs become (saliently) aware that they are an AI could be a promising intervention, to the extent this is possible without greatly reducing competitiveness.
Who cares if it greatly reduces competitiveness in experimental training runs?
We need to figure out how to align superhuman models—models trained with > 1e25 efficient flops on the current internet/knowledge, which requires experimental iteration. We probably won’t get multiple iteration attempts for aligning SI ‘in prod’, so we need to iterate in simulation (what you now call ‘model organisms’).
We need to find alignment training methods that work even when the agent has superhuman intelligence/inference. But ‘superhuman’ hear is relative—measured against our capabilities. The straightforward easy way to accomplish this is training agents in simulations with much earlier knowledge cutoff dates, which isn’t theoretically hard—just requires constructing augmented historical training datasets. So you could train on a 10T+ token dataset of human writings/thoughts with cutoff 2010, or 1950, or 1700, etc. These base models wouldn’t be capable of simulating/summoning realistic situationally aware agents, their RL derived agents wouldn’t be situationally sim-aware either, etc.
Abilities/intelligence come almost entirely from pretraining, so all the situation awareness and scheming capability that current (and future similar) frontier models possess is thus also mostly present in the base model.
Yes, but for scheming, we care about whether the AI can self-locate itself as an AI using its knowledge. The fact that (at a minimum) sampling from the system is required for it to self-locate as an AI might make a big difference here.
Who cares if it greatly reduces competitiveness in experimental training runs?
Yes, reducing situational awareness for experiments could possibly be viable, though it also makes experimentation less valuable if the situation is no longer analogous. E.g., demonstrating measures that mitigate scheming on a model that doesn’t understand the situation well enough to scheme analogously doesn’t work.
(I also think it seems probably very hard to train a system capable of obsoleting top human experts which doesn’t understand that it is an AI even if you’re willing to take a big competitiveness hit. At lower capability levels this seems more viable, but might make experiments less valuable.)
Abilities/intelligence come almost entirely from pretraining, so all the situation awareness and scheming capability that current (and future similar) frontier models possess is thus also mostly present in the base model.
Yes, but for scheming, we care about whether the AI can self-locate itself as an AI using its knowledge. The fact that (at a minimum) sampling from the system is required for it to self-locate as an AI might make a big difference here.
So if your ‘yes’ above is agreeing that capabilities—including scheming—come mostly from pretraining, then I don’t see how relevant it is whether or not that ability is actually used/executed much in pretraining, as the models we care about will go through post-training and I doubt you are arguing post-training will reliably remove scheming.
I also think it seems probably very hard to train a system capable of obsoleting top human experts which doesn’t understand that it is an AI even if you’re willing to take a big competitiveness hit.
Indeed but that is entirely the point—by construction!
Conceptually we have a recipe R (arch, algorithms, compute, etc), and a training dataset which we can parameterize by time cutoff T. Our objective (for safety research) is not to train a final agent, but instead to find a safe/good R with minimal capability penalty. All important results we care about vary with R independently of T, but competitiveness/dangerousness does vary strongly with T.
Take the same R but vary the time cutoff T of the training dataset: the dangerousness of the AI will depend heavily on T, but not the relative effectiveness of various configurations of R. That is simply a restatement of the ideal requirements for a safe experimental regime. Models/algos that work well for T of 1950 will also work for T of 2020 etc.
So if your ‘yes’ above is agreeing that capabilities—including scheming—come mostly from pretraining, then I don’t see how relevant it is whether or not that ability is actually used/executed much in pretraining, as the models we care about will go through post-training and I doubt you are arguing post-training will reliably remove scheming.
I think scheming is less likely to emerge if it is only selected for / reinforced in a small subset of training that doesn’t cause much of the capabilities. (As discussed in the post here.) If AIs don’t self-locate except in a small post training phase that doesn’t substantially increase capabilities, then the risk of scheming would be substantially reduced IMO.
That said, it looks like RL is becoming increasingly and increasingly important such that it already substantially increases capabilities.
Abilities/intelligence come almost entirely from pretraining, so all the situation awareness and scheming capability that current (and future similar) frontier models possess is thus also mostly present in the base model. The fact that you need to prompt them to summon out a situationally aware scheming agent doesn’t seem like much of a barrier, and indeed strong frontier base models are so obviously misaligned/jail-breakable/dangerous that releasing them to the public is PR-harmful enough to motivate RLHF post training purely for selfish profit-motives.
> This implies that restricting when AIs become (saliently) aware that they are an AI could be a promising intervention, to the extent this is possible without greatly reducing competitiveness.
Who cares if it greatly reduces competitiveness in experimental training runs?
We need to figure out how to align superhuman models—models trained with > 1e25 efficient flops on the current internet/knowledge, which requires experimental iteration. We probably won’t get multiple iteration attempts for aligning SI ‘in prod’, so we need to iterate in simulation (what you now call ‘model organisms’).
We need to find alignment training methods that work even when the agent has superhuman intelligence/inference. But ‘superhuman’ hear is relative—measured against our capabilities. The straightforward easy way to accomplish this is training agents in simulations with much earlier knowledge cutoff dates, which isn’t theoretically hard—just requires constructing augmented historical training datasets. So you could train on a 10T+ token dataset of human writings/thoughts with cutoff 2010, or 1950, or 1700, etc. These base models wouldn’t be capable of simulating/summoning realistic situationally aware agents, their RL derived agents wouldn’t be situationally sim-aware either, etc.
Yes, but for scheming, we care about whether the AI can self-locate itself as an AI using its knowledge. The fact that (at a minimum) sampling from the system is required for it to self-locate as an AI might make a big difference here.
Yes, reducing situational awareness for experiments could possibly be viable, though it also makes experimentation less valuable if the situation is no longer analogous. E.g., demonstrating measures that mitigate scheming on a model that doesn’t understand the situation well enough to scheme analogously doesn’t work.
(I also think it seems probably very hard to train a system capable of obsoleting top human experts which doesn’t understand that it is an AI even if you’re willing to take a big competitiveness hit. At lower capability levels this seems more viable, but might make experiments less valuable.)
So if your ‘yes’ above is agreeing that capabilities—including scheming—come mostly from pretraining, then I don’t see how relevant it is whether or not that ability is actually used/executed much in pretraining, as the models we care about will go through post-training and I doubt you are arguing post-training will reliably remove scheming.
Indeed but that is entirely the point—by construction!
Conceptually we have a recipe R (arch, algorithms, compute, etc), and a training dataset which we can parameterize by time cutoff T. Our objective (for safety research) is not to train a final agent, but instead to find a safe/good R with minimal capability penalty. All important results we care about vary with R independently of T, but competitiveness/dangerousness does vary strongly with T.
Take the same R but vary the time cutoff T of the training dataset: the dangerousness of the AI will depend heavily on T, but not the relative effectiveness of various configurations of R. That is simply a restatement of the ideal requirements for a safe experimental regime. Models/algos that work well for T of 1950 will also work for T of 2020 etc.
I think scheming is less likely to emerge if it is only selected for / reinforced in a small subset of training that doesn’t cause much of the capabilities. (As discussed in the post here.) If AIs don’t self-locate except in a small post training phase that doesn’t substantially increase capabilities, then the risk of scheming would be substantially reduced IMO.
That said, it looks like RL is becoming increasingly and increasingly important such that it already substantially increases capabilities.