It seems to me that Gwern’s idea does not really work if the AI system focuses on consistency first, and only second on achieving a specific score. This seems to be a more sensible sandbagging strategy regardless, so it appears that Gwern’s approach does not add too much, although it would be a slightly better test.
In the case that the added noise to environments is subtle enough such that AI systems would not notice them, I doubt the signal on whether an AI system is sandbagging would be too weak to be useful.
This is my initial thinking, again happy to discuss this more!
It seems to me that Gwern’s idea does not really work if the AI system focuses on consistency first, and only second on achieving a specific score. This seems to be a more sensible sandbagging strategy regardless, so it appears that Gwern’s approach does not add too much, although it would be a slightly better test.
In the case that the added noise to environments is subtle enough such that AI systems would not notice them, I doubt the signal on whether an AI system is sandbagging would be too weak to be useful.
This is my initial thinking, again happy to discuss this more!