Teun van der Weij comments on An Introduction to AI Sandbagging

Teun van der Weij 3 Nov 2024 10:53 UTC
1 point
0
You point at something that we have also thought much about, and I partially agree with you. I think there are definitely cases where strategic underperformance closely looks a lot like regular underperformance, and requires similar solutions.

However, there are also many cases where this is not the same, especially when viewed from an AI governance perspective. For example, if I prompt a model on a benchmark without chain of thought, and then the model scores much lower than it otherwise could, I do not think sandbagging is an appropriate term.

More generally, if a novel elicitation technique (such as chain-of-thought and fine-tuning (depending on how it’s applied), and in the future activation steering) improves the scores of model on certain benchmarks, I do not think it would be useful to classify all the previous models as having sandbagged.

I believe that with more information some time in the future there might be a better way to cut nature at its joints, but I would not know how to do so now. I am open to suggestions!