One way that an AI might check if it’s still in a training simulation vs. the real world* deployment environment is with a kind of cryptographic puzzle.
For example, Paul Christiano and Evan Hubinger have written about how a deceptively aligned model could check for the presence of the factorization of RSA-2048 in its inputs:
RSA-2048 is a large semiprime number that was featured as part of the RSA Factoring Challenge. Though no factorization of RSA-2048 is currently known, it is likely that at some point someone will publish such a factorization. Thus, by only behaving unacceptably when it sees a factorization of RSA-2048, a deceptive model can all but guarantee that adversarial training will never be able to produce an input on which it behaves unacceptably—since that would require factoring the number—but that will still be fairly likely to eventually occur during deployment.
This is just one example, but it points to the likely large number of conditions that AI optimizing for long-term rewards could select from to confirm that it is in our world vs. any kind of simulation that humans are anywhere close to being able to create.
So this indeed seems to me like a real threat vector, and although it’s a clever idea to try and fake the AI out so it doesn’t know whether it’s still in training or the real world, unfortunately there is probably no way to set up the environment in order to robustly achieve this. There may be a way to achieve something like this with advanced interpretability, by taking advantage of the AI’s specific beliefs about training vs. deployment that we read off of its neural network (Scenario 7 of my recent post “Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios” explores this possibility in more detail). However, interpretability tools are not currently close to being able to achieve such a feat.
--
*Or in case what we usually consider to be the real world is actually a simulation, then this is to check if the AI is in our world vs. a training simulation that we could create.
One way that an AI might check if it’s still in a training simulation vs. the real world* deployment environment is with a kind of cryptographic puzzle.
For example, Paul Christiano and Evan Hubinger have written about how a deceptively aligned model could check for the presence of the factorization of RSA-2048 in its inputs:
This is just one example, but it points to the likely large number of conditions that AI optimizing for long-term rewards could select from to confirm that it is in our world vs. any kind of simulation that humans are anywhere close to being able to create.
So this indeed seems to me like a real threat vector, and although it’s a clever idea to try and fake the AI out so it doesn’t know whether it’s still in training or the real world, unfortunately there is probably no way to set up the environment in order to robustly achieve this. There may be a way to achieve something like this with advanced interpretability, by taking advantage of the AI’s specific beliefs about training vs. deployment that we read off of its neural network (Scenario 7 of my recent post “Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios” explores this possibility in more detail). However, interpretability tools are not currently close to being able to achieve such a feat.
--
*Or in case what we usually consider to be the real world is actually a simulation, then this is to check if the AI is in our world vs. a training simulation that we could create.