Teun van der Weij comments on An Introduction to AI Sandbagging

Teun van der Weij 9 Jul 2024 19:05 UTC
2 points
0
It seems to me that Gwern’s idea does not really work if the AI system focuses on consistency first, and only second on achieving a specific score. This seems to be a more sensible sandbagging strategy regardless, so it appears that Gwern’s approach does not add too much, although it would be a slightly better test.

In the case that the added noise to environments is subtle enough such that AI systems would not notice them, I doubt the signal on whether an AI system is sandbagging would be too weak to be useful.

This is my initial thinking, again happy to discuss this more!