If that does happen, and we don’t expect the model to improve in capabilities on impairment, then how do you see the test being useful?
And yeah, the model would need a fair amount of deceptive capabilities, my intuition is that it wouldn’t need as much as for other things, since this just requires it to not act—but enough to classify it as an adversary and therefore above the threshold you mention. I’ll admit I’m somewhat confused at this point though about the setup to measure deceptive capabilities, so that intuition could be over different definitions. What you describe in another reply seems to measure it in a very abstracted manner, where quantifying seems hard sans interpretability tools.
Thanks for clarifying. Yes agreed that’s what I’d expect.
I think the model needs to have a lot of deceptive capability to deploy that strategy, right?
If that does happen, and we don’t expect the model to improve in capabilities on impairment, then how do you see the test being useful?
And yeah, the model would need a fair amount of deceptive capabilities, my intuition is that it wouldn’t need as much as for other things, since this just requires it to not act—but enough to classify it as an adversary and therefore above the threshold you mention. I’ll admit I’m somewhat confused at this point though about the setup to measure deceptive capabilities, so that intuition could be over different definitions. What you describe in another reply seems to measure it in a very abstracted manner, where quantifying seems hard sans interpretability tools.