(noting that the link is broken and should probably point here)
I’m a bit unsure of what you’re saying in the first paragraph. Deception has (at least) two parts, right? There’s the capabilities needed to successfully execute the deception, and there’s the intent to be deceptively aligned. I think the former is downstream of capabilities, and I agree the latter may not be.
As for the other considerations, I’m imagining that there’s a regime where models are (weakly) capable of deception. If that happens we may be able to notice the deceptive capabilities in targeted behavioral tests (because the model isn’t yet good enough at deception to fool our tests). I suspect that this regime occurs at much weaker levels of deceptive capabilities than e.g. the model gradient hacking its way into a more robust state.
My first paragraph was stating that the former might occur at an earlier point than the latter (which I’m assuming includes “thinks of successful deception as a strategy”). This should lead to the models never improving on capabilities with impairment, because by the point they’re deceptive they wouldn’t “just barely be able to hide their capabilities”, they’d already be good enough to do that.
Yeah, I think I agree, so the gradient control argument isn’t really relevant here.
Have you thought about whether the strategy of “wait for a while to build up the deceptive circuitry before allowing it to act and produce behavioural effects” is something the model could feasibly adopt? ETA: I mentioned this separately from the case of the model being aware that we’d test it like this because I thought it might be the case that a model with low deceptive capabilities could still make this one choice. On further thought, however, this still strictly falls under the deceptive adversary class, so I don’t think this is important.
If that does happen, and we don’t expect the model to improve in capabilities on impairment, then how do you see the test being useful?
And yeah, the model would need a fair amount of deceptive capabilities, my intuition is that it wouldn’t need as much as for other things, since this just requires it to not act—but enough to classify it as an adversary and therefore above the threshold you mention. I’ll admit I’m somewhat confused at this point though about the setup to measure deceptive capabilities, so that intuition could be over different definitions. What you describe in another reply seems to measure it in a very abstracted manner, where quantifying seems hard sans interpretability tools.
(noting that the link is broken and should probably point here)
I’m a bit unsure of what you’re saying in the first paragraph. Deception has (at least) two parts, right? There’s the capabilities needed to successfully execute the deception, and there’s the intent to be deceptively aligned. I think the former is downstream of capabilities, and I agree the latter may not be.
As for the other considerations, I’m imagining that there’s a regime where models are (weakly) capable of deception. If that happens we may be able to notice the deceptive capabilities in targeted behavioral tests (because the model isn’t yet good enough at deception to fool our tests). I suspect that this regime occurs at much weaker levels of deceptive capabilities than e.g. the model gradient hacking its way into a more robust state.
Thanks, the link should be fixed now.
My first paragraph was stating that the former might occur at an earlier point than the latter (which I’m assuming includes “thinks of successful deception as a strategy”). This should lead to the models never improving on capabilities with impairment, because by the point they’re deceptive they wouldn’t “just barely be able to hide their capabilities”, they’d already be good enough to do that.
Yeah, I think I agree, so the gradient control argument isn’t really relevant here.
Have you thought about whether the strategy of “wait for a while to build up the deceptive circuitry before allowing it to act and produce behavioural effects” is something the model could feasibly adopt? ETA: I mentioned this separately from the case of the model being aware that we’d test it like this because I thought it might be the case that a model with low deceptive capabilities could still make this one choice. On further thought, however, this still strictly falls under the deceptive adversary class, so I don’t think this is important.
Thanks for clarifying. Yes agreed that’s what I’d expect.
I think the model needs to have a lot of deceptive capability to deploy that strategy, right?
If that does happen, and we don’t expect the model to improve in capabilities on impairment, then how do you see the test being useful?
And yeah, the model would need a fair amount of deceptive capabilities, my intuition is that it wouldn’t need as much as for other things, since this just requires it to not act—but enough to classify it as an adversary and therefore above the threshold you mention. I’ll admit I’m somewhat confused at this point though about the setup to measure deceptive capabilities, so that intuition could be over different definitions. What you describe in another reply seems to measure it in a very abstracted manner, where quantifying seems hard sans interpretability tools.