Impair bigger models (e.g. with dropout or random or adversarial perturbations) and see if they get better at some deceptive capabilities. Impairing models shouldn’t improve capabilities, so this would be good evidence that the bigger models are capable of deception but choosing to hide that capability.
This experiment is a little tricky to implement, but the core idea is to hope that interventions impair something critical in the model’s ability to execute on deception but don’t impair all of the deceptive capabilities.
Even if the largest models can continue hiding their capabilities despite the impairment, models that are just barely able to hide their capabilities may fail to do so once impaired, which should increase the model size where deceptive capabilities peak.
I feel like this assumes that deception is a property immediately downstream of pure capabilities, but I can think of a couple cases where deception may arise at a particular point for other reasons—such as in the case of suboptimality deceptive alignment.
There are also other considerations that come to mind, such as that a model once deceptive can make it so that it isn’t that hard to sustain deception past impairment (especially if it manages to exert some control over its gradient), or whether this might happen naturally once deception is “amplified” as a strategy by gradient descent to the point where we notice it and run the test.
It’s definitely possible I’m missing something about the experiment setup that covers these cases, or that you aren’t very worried about them in this context, so I’d love to hear your thoughts.
(noting that the link is broken and should probably point here)
I’m a bit unsure of what you’re saying in the first paragraph. Deception has (at least) two parts, right? There’s the capabilities needed to successfully execute the deception, and there’s the intent to be deceptively aligned. I think the former is downstream of capabilities, and I agree the latter may not be.
As for the other considerations, I’m imagining that there’s a regime where models are (weakly) capable of deception. If that happens we may be able to notice the deceptive capabilities in targeted behavioral tests (because the model isn’t yet good enough at deception to fool our tests). I suspect that this regime occurs at much weaker levels of deceptive capabilities than e.g. the model gradient hacking its way into a more robust state.
My first paragraph was stating that the former might occur at an earlier point than the latter (which I’m assuming includes “thinks of successful deception as a strategy”). This should lead to the models never improving on capabilities with impairment, because by the point they’re deceptive they wouldn’t “just barely be able to hide their capabilities”, they’d already be good enough to do that.
Yeah, I think I agree, so the gradient control argument isn’t really relevant here.
Have you thought about whether the strategy of “wait for a while to build up the deceptive circuitry before allowing it to act and produce behavioural effects” is something the model could feasibly adopt? ETA: I mentioned this separately from the case of the model being aware that we’d test it like this because I thought it might be the case that a model with low deceptive capabilities could still make this one choice. On further thought, however, this still strictly falls under the deceptive adversary class, so I don’t think this is important.
If that does happen, and we don’t expect the model to improve in capabilities on impairment, then how do you see the test being useful?
And yeah, the model would need a fair amount of deceptive capabilities, my intuition is that it wouldn’t need as much as for other things, since this just requires it to not act—but enough to classify it as an adversary and therefore above the threshold you mention. I’ll admit I’m somewhat confused at this point though about the setup to measure deceptive capabilities, so that intuition could be over different definitions. What you describe in another reply seems to measure it in a very abstracted manner, where quantifying seems hard sans interpretability tools.
I feel like this assumes that deception is a property immediately downstream of pure capabilities, but I can think of a couple cases where deception may arise at a particular point for other reasons—such as in the case of suboptimality deceptive alignment.
There are also other considerations that come to mind, such as that a model once deceptive can make it so that it isn’t that hard to sustain deception past impairment (especially if it manages to exert some control over its gradient), or whether this might happen naturally once deception is “amplified” as a strategy by gradient descent to the point where we notice it and run the test.
It’s definitely possible I’m missing something about the experiment setup that covers these cases, or that you aren’t very worried about them in this context, so I’d love to hear your thoughts.
(noting that the link is broken and should probably point here)
I’m a bit unsure of what you’re saying in the first paragraph. Deception has (at least) two parts, right? There’s the capabilities needed to successfully execute the deception, and there’s the intent to be deceptively aligned. I think the former is downstream of capabilities, and I agree the latter may not be.
As for the other considerations, I’m imagining that there’s a regime where models are (weakly) capable of deception. If that happens we may be able to notice the deceptive capabilities in targeted behavioral tests (because the model isn’t yet good enough at deception to fool our tests). I suspect that this regime occurs at much weaker levels of deceptive capabilities than e.g. the model gradient hacking its way into a more robust state.
Thanks, the link should be fixed now.
My first paragraph was stating that the former might occur at an earlier point than the latter (which I’m assuming includes “thinks of successful deception as a strategy”). This should lead to the models never improving on capabilities with impairment, because by the point they’re deceptive they wouldn’t “just barely be able to hide their capabilities”, they’d already be good enough to do that.
Yeah, I think I agree, so the gradient control argument isn’t really relevant here.
Have you thought about whether the strategy of “wait for a while to build up the deceptive circuitry before allowing it to act and produce behavioural effects” is something the model could feasibly adopt? ETA: I mentioned this separately from the case of the model being aware that we’d test it like this because I thought it might be the case that a model with low deceptive capabilities could still make this one choice. On further thought, however, this still strictly falls under the deceptive adversary class, so I don’t think this is important.
Thanks for clarifying. Yes agreed that’s what I’d expect.
I think the model needs to have a lot of deceptive capability to deploy that strategy, right?
If that does happen, and we don’t expect the model to improve in capabilities on impairment, then how do you see the test being useful?
And yeah, the model would need a fair amount of deceptive capabilities, my intuition is that it wouldn’t need as much as for other things, since this just requires it to not act—but enough to classify it as an adversary and therefore above the threshold you mention. I’ll admit I’m somewhat confused at this point though about the setup to measure deceptive capabilities, so that intuition could be over different definitions. What you describe in another reply seems to measure it in a very abstracted manner, where quantifying seems hard sans interpretability tools.