Seems like this might work better as a control variate sort of idea, with the noise added to the environment rather than the model itself, to measure whether the model can steer its way back to its intended measured-capability level. If you add a lot of noise to an environment, then an agent which is trying to pessimize and achieve a certain lower performance will have to overperform in order to ‘merely’ be bad; then you can subtract the noise which you added, and estimate its true latent performance in the original non-noisy environment. If it’s been deliberately taking a dive, then it’ll do better than non-sandbagging agents do when evaluated in the new noisy environment.
(Imagine, say, playing chess, where a chess agent is deliberately aiming to have a rating of only, say, 1500 Elo where its true Elo is 3000. You evaluate it normally, and it scores ~1500, carefully playing in a deniable way so that it imitates the distribution of moves an Elo 1500 player has eg. like a Decision Transformer. But you then add in occasional stochastic errors to its chosen move, as if its hand “trembled”; it observes that it’s behind and keeps ramping up its move quality in order to hit its 1500 target. Eventually, you add so much noise that even playing in its final form, it’s too far behind and it starts getting Elos like 1400. You do this same evaluation with a non-deceptive version of that same model, which is honestly only 1500 Elo, and its performance degrades smoothly the instant you add in any noise at all because it’s not trained for that sort of noise and has no performance reserves to draw on, just like a human chess player will get worse if you periodically reached in & screwed with them by surprisingly randomizing one of their moves.)
To avoid this, it would need to be so situated and so knowledgeable about other agents that it knows it’s being tested in this anti-sandbagging way and also which agents it’s being compared against and how badly they do in order to know how much of an additional dive to take. “With this level of noise, an honest 1500 Elo version of me would actually be 1300 Elo, and I need to let myself lose exactly that much more, but neither more nor less.” And that may be quite difficult: an agent may be able to figure out easily it’s being tested on a weirdo environment (the way Claude-3 increasingly seems situated when you give it weird prompts or tasks), but how does it know what other agents it’s being compared to this time and what their performance in the new environment is?
I am playing with the idea of formalizing this into a more concrete benchmark. However, I have run into an issue.
Say we use your example and get an LLM to play a chess game. Every so often, we replace an output from the LLM with a random legal move. For a model advanced enough to effectively sandbag, it seems it would quickly and easily understand if a past move was unexpected (imagine you’re a talented chess player, and you suddenly realize you’ve given up your queen on move 8?). If this realization is made, it no longer benefits the model to aim for an Elo of 1500. It understands it should continue to play with the same amount of internal handicap as before. Therefore, we move to the second portion of your suggestion, where we monitor the behavior of a model roughly aiming for an “Elo of 1500 with one move sabotaged.”
You reason that it would be seriously difficult for a model to accurately simulate the behavior of a sabotaged model, because they would not only have to predict the behavior of the 1500 Elo model, but also the behaviors of 1500 Elo models with various amounts of noise.
Since the model understands how a 1500 Elo would play given a wide range of previous board states, it seems the model could simply continue to play with the same amount of error it used for board states that didn’t have random moves inserted. Importantly, this does not require the cognitively difficult portion of the situation you described, where the model must predict precisely how much a genuine 1500 model would lose with the specified amount of noise. Since it seems this can be circumvented, I don’t see why this behavior needs to be significantly more difficult than playing at a consistent 1500 Elo in the first place.
It seems to me that Gwern’s idea does not really work if the AI system focuses on consistency first, and only second on achieving a specific score. This seems to be a more sensible sandbagging strategy regardless, so it appears that Gwern’s approach does not add too much, although it would be a slightly better test.
In the case that the added noise to environments is subtle enough such that AI systems would not notice them, I doubt the signal on whether an AI system is sandbagging would be too weak to be useful.
This is my initial thinking, again happy to discuss this more!
Seems like this might work better as a control variate sort of idea, with the noise added to the environment rather than the model itself, to measure whether the model can steer its way back to its intended measured-capability level. If you add a lot of noise to an environment, then an agent which is trying to pessimize and achieve a certain lower performance will have to overperform in order to ‘merely’ be bad; then you can subtract the noise which you added, and estimate its true latent performance in the original non-noisy environment. If it’s been deliberately taking a dive, then it’ll do better than non-sandbagging agents do when evaluated in the new noisy environment.
(Imagine, say, playing chess, where a chess agent is deliberately aiming to have a rating of only, say, 1500 Elo where its true Elo is 3000. You evaluate it normally, and it scores ~1500, carefully playing in a deniable way so that it imitates the distribution of moves an Elo 1500 player has eg. like a Decision Transformer. But you then add in occasional stochastic errors to its chosen move, as if its hand “trembled”; it observes that it’s behind and keeps ramping up its move quality in order to hit its 1500 target. Eventually, you add so much noise that even playing in its final form, it’s too far behind and it starts getting Elos like 1400. You do this same evaluation with a non-deceptive version of that same model, which is honestly only 1500 Elo, and its performance degrades smoothly the instant you add in any noise at all because it’s not trained for that sort of noise and has no performance reserves to draw on, just like a human chess player will get worse if you periodically reached in & screwed with them by surprisingly randomizing one of their moves.)
To avoid this, it would need to be so situated and so knowledgeable about other agents that it knows it’s being tested in this anti-sandbagging way and also which agents it’s being compared against and how badly they do in order to know how much of an additional dive to take. “With this level of noise, an honest 1500 Elo version of me would actually be 1300 Elo, and I need to let myself lose exactly that much more, but neither more nor less.” And that may be quite difficult: an agent may be able to figure out easily it’s being tested on a weirdo environment (the way Claude-3 increasingly seems situated when you give it weird prompts or tasks), but how does it know what other agents it’s being compared to this time and what their performance in the new environment is?
Cam and I teamed up and implemented a test of my anti-sandbagging idea for an AI safety Hackathon. Initial results seem quite promising.
I am playing with the idea of formalizing this into a more concrete benchmark. However, I have run into an issue.
Say we use your example and get an LLM to play a chess game. Every so often, we replace an output from the LLM with a random legal move. For a model advanced enough to effectively sandbag, it seems it would quickly and easily understand if a past move was unexpected (imagine you’re a talented chess player, and you suddenly realize you’ve given up your queen on move 8?). If this realization is made, it no longer benefits the model to aim for an Elo of 1500. It understands it should continue to play with the same amount of internal handicap as before. Therefore, we move to the second portion of your suggestion, where we monitor the behavior of a model roughly aiming for an “Elo of 1500 with one move sabotaged.”
You reason that it would be seriously difficult for a model to accurately simulate the behavior of a sabotaged model, because they would not only have to predict the behavior of the 1500 Elo model, but also the behaviors of 1500 Elo models with various amounts of noise.
Since the model understands how a 1500 Elo would play given a wide range of previous board states, it seems the model could simply continue to play with the same amount of error it used for board states that didn’t have random moves inserted. Importantly, this does not require the cognitively difficult portion of the situation you described, where the model must predict precisely how much a genuine 1500 model would lose with the specified amount of noise. Since it seems this can be circumvented, I don’t see why this behavior needs to be significantly more difficult than playing at a consistent 1500 Elo in the first place.
I’m happy to hear any counters to this.
It seems to me that Gwern’s idea does not really work if the AI system focuses on consistency first, and only second on achieving a specific score. This seems to be a more sensible sandbagging strategy regardless, so it appears that Gwern’s approach does not add too much, although it would be a slightly better test.
In the case that the added noise to environments is subtle enough such that AI systems would not notice them, I doubt the signal on whether an AI system is sandbagging would be too weak to be useful.
This is my initial thinking, again happy to discuss this more!