Yes, the sandbagging detection is quite limited. I anticipate it’s mostly useful as an “oh no, something weird is going on” alarm. I don’t think it will be particularly helpful in figuring out what exactly.
I think at that point you need to delay deployment and use other mechanistic interpretability techniques to explore the anomaly in depth.
Since this is intended as an “anomaly-detecting in-the-lab test, for use on well understood benchmarks”, I think it will probably work even on quite intelligent models. My hope is that the noise injection impairment will smoothly compromise the model’s capabilites down to arbitrarily lower performance. So, for a smarter model, you just increase the impairment.
In theory , this would work up until the model was so capable that it was able to do something like gradient hacking to engineer its expected impairment curve to avoid betraying itself. At that point of capabilities I can only hope the researchers have been wise enough to move to a censored training simulation regime, so that the model can’t anticipate the need to be robust to noise injection.
Yes, the sandbagging detection is quite limited. I anticipate it’s mostly useful as an “oh no, something weird is going on” alarm. I don’t think it will be particularly helpful in figuring out what exactly. I think at that point you need to delay deployment and use other mechanistic interpretability techniques to explore the anomaly in depth.
Since this is intended as an “anomaly-detecting in-the-lab test, for use on well understood benchmarks”, I think it will probably work even on quite intelligent models. My hope is that the noise injection impairment will smoothly compromise the model’s capabilites down to arbitrarily lower performance. So, for a smarter model, you just increase the impairment.
In theory , this would work up until the model was so capable that it was able to do something like gradient hacking to engineer its expected impairment curve to avoid betraying itself. At that point of capabilities I can only hope the researchers have been wise enough to move to a censored training simulation regime, so that the model can’t anticipate the need to be robust to noise injection.