I am playing with the idea of formalizing this into a more concrete benchmark. However, I have run into an issue.
Say we use your example and get an LLM to play a chess game. Every so often, we replace an output from the LLM with a random legal move. For a model advanced enough to effectively sandbag, it seems it would quickly and easily understand if a past move was unexpected (imagine you’re a talented chess player, and you suddenly realize you’ve given up your queen on move 8?). If this realization is made, it no longer benefits the model to aim for an Elo of 1500. It understands it should continue to play with the same amount of internal handicap as before. Therefore, we move to the second portion of your suggestion, where we monitor the behavior of a model roughly aiming for an “Elo of 1500 with one move sabotaged.”
You reason that it would be seriously difficult for a model to accurately simulate the behavior of a sabotaged model, because they would not only have to predict the behavior of the 1500 Elo model, but also the behaviors of 1500 Elo models with various amounts of noise.
Since the model understands how a 1500 Elo would play given a wide range of previous board states, it seems the model could simply continue to play with the same amount of error it used for board states that didn’t have random moves inserted. Importantly, this does not require the cognitively difficult portion of the situation you described, where the model must predict precisely how much a genuine 1500 model would lose with the specified amount of noise. Since it seems this can be circumvented, I don’t see why this behavior needs to be significantly more difficult than playing at a consistent 1500 Elo in the first place.
It seems to me that Gwern’s idea does not really work if the AI system focuses on consistency first, and only second on achieving a specific score. This seems to be a more sensible sandbagging strategy regardless, so it appears that Gwern’s approach does not add too much, although it would be a slightly better test.
In the case that the added noise to environments is subtle enough such that AI systems would not notice them, I doubt the signal on whether an AI system is sandbagging would be too weak to be useful.
This is my initial thinking, again happy to discuss this more!
I am playing with the idea of formalizing this into a more concrete benchmark. However, I have run into an issue.
Say we use your example and get an LLM to play a chess game. Every so often, we replace an output from the LLM with a random legal move. For a model advanced enough to effectively sandbag, it seems it would quickly and easily understand if a past move was unexpected (imagine you’re a talented chess player, and you suddenly realize you’ve given up your queen on move 8?). If this realization is made, it no longer benefits the model to aim for an Elo of 1500. It understands it should continue to play with the same amount of internal handicap as before. Therefore, we move to the second portion of your suggestion, where we monitor the behavior of a model roughly aiming for an “Elo of 1500 with one move sabotaged.”
You reason that it would be seriously difficult for a model to accurately simulate the behavior of a sabotaged model, because they would not only have to predict the behavior of the 1500 Elo model, but also the behaviors of 1500 Elo models with various amounts of noise.
Since the model understands how a 1500 Elo would play given a wide range of previous board states, it seems the model could simply continue to play with the same amount of error it used for board states that didn’t have random moves inserted. Importantly, this does not require the cognitively difficult portion of the situation you described, where the model must predict precisely how much a genuine 1500 model would lose with the specified amount of noise. Since it seems this can be circumvented, I don’t see why this behavior needs to be significantly more difficult than playing at a consistent 1500 Elo in the first place.
I’m happy to hear any counters to this.
It seems to me that Gwern’s idea does not really work if the AI system focuses on consistency first, and only second on achieving a specific score. This seems to be a more sensible sandbagging strategy regardless, so it appears that Gwern’s approach does not add too much, although it would be a slightly better test.
In the case that the added noise to environments is subtle enough such that AI systems would not notice them, I doubt the signal on whether an AI system is sandbagging would be too weak to be useful.
This is my initial thinking, again happy to discuss this more!