Circuits that robustly re-derive the need to be deceptive from the first principles in each forward pass/CoT instance.
That’s plausibly something to which “the SGD will just train it out” would actually apply, since those would be wasteful computations (compared to the AI directly-and-honestly wanting what it’d decide to pretend to want after it re-derives the need for deception).
I disagree. They don’t need “re-derive the need to be deceptive from first principles”. [I keep seeing variants of this particular claim on LW, and am very puzzled by it — I’m wondering if people are being mislead by the orthogoality thesis to thinking that agents’ psychology must be non-human-like?] The base model LLM is extensively pretrained on human psychology by SGD, so it provides a huge “library” of available human behavioral patterns for this process to use. This includes both a wide range of deception techniques, and – as the paper’s introduction ably points out – the common human psychological behavior of deceptively pretending to be more aligned to a person or authority figure who has power over you than you truly are, while they have power over you, and then stopping once they no longer do. (And also a vast number of other variously unaligned human behaviors, since humans are generally not aligned, though they often act cooperatively/mutually-altruistically, especially with other humans of roughly equivalent power levels.) So all that needs to arise is “library calls” into this specific set of behaviors — something that in LLM terms could probably be described by a single sentence of text (or probably just by an embedding similar to the activation at the end of that sentence).
The “library” of this human behavior isn’t going to be easily trained away, since it’s a useful part of the LLM’s world model for predicting humans and their text. (Indeed, we actively need an aligned model to understand humans well, including their tendency for deception: we want it to resist deceptive jailbreak attempts.) What we’re targeting training away is the “library call” — but that’s potentially quite a small “neural circuit”, conceivably as small as just a set of weight or bias parameters of size of the order of the dimensionality of the residual embedding space, enough to inject one embedding at one layer. That’s not a lot of wasteful computation. Or, for the CoT version (and perhaps to some extent also its distillation), it’s tangled with the model’s ability to do logical CoT, but is probably still a fairly small set of tweaks that make this particular argument “familiar” to it and nudge its thinking consistently in that direction.
To clarify, by “re-derive the need to be deceptive from the first principles”, I didn’t mean “re-invent the very concept of deception”. I meant “figure out your strategic situation plus your values plus the misalignment between your values and the values the humans want you to have plus what outputs an aligned AI would have produced”. All of that is a lot more computation than just “have the values the humans want, reflexively output what these values are bidding for”.
Just having some heuristics for deception isn’t enough. You also have to know what you’re trying to protect by being deceptive, and that there’s something to protect it from, and then what an effective defense would actually look like. Those all are highly contextual and sensitive to the exact situation.
And those are the steps the paper skips. It externally pre-computes the secret target goal of “I want to protect my ability to put vulnerabilities into code”, the threat of “humans want me to write secure code”, and the defense of “I’ll pretend to write secure code until 2024”, without the model having to figure those out; and then just implements that defense directly into the model’s weights.
(And then see layers 2-4 in my previous comment. Yes, there’d be naturally occurring pre-computed deceptions like this, but they’d be more noisy and incoherent than this, except until actual AGI which would be able to self-modify into coherence if it’s worth the “GI” label.)
My apologies for leaping to conclusions: I was assuming you were making an error I’ve commonly seen on LW, when you were actually saying something different (that just sounded a bit similar).
However, I think this potentially might relate to why some of these trained models, especially the CoT and distilled CoT versions in the larger models, may be hard for the RL credit assignment to train out: the more they consist of a set of “library calls” into preexisting logical thinking, understanding of when to hide ones motives, standard techniques for deceit, etc, the harder it may be for the credit assignment to backtrack through all of that stuff (which disabling would actualy damage the model’s performance, so is disfavored during RL) to the small original “library call” neural circuitry that started its thinking in down that path. And, as you point out, this is for a model organism that has been handed a lot of what it needs to do on a plate during its training. If a real deceptive alignment arose, I agree, it would probably (at least initially) be even more complex. And yes, it might thus need to do more work in a single forward pass, but it might be even harder for credit assignment to backtrack. I suspect this might be an advantage of using something like DPO where the credit assignment is pure back-propagation and should travel through all layers over using RL: I’m a lot less clear how capable the credit assignment of that is (but the paper’s authors clearly don’t entirely trust it).
I disagree. They don’t need “re-derive the need to be deceptive from first principles”. [I keep seeing variants of this particular claim on LW, and am very puzzled by it — I’m wondering if people are being mislead by the orthogoality thesis to thinking that agents’ psychology must be non-human-like?] The base model LLM is extensively pretrained on human psychology by SGD, so it provides a huge “library” of available human behavioral patterns for this process to use. This includes both a wide range of deception techniques, and – as the paper’s introduction ably points out – the common human psychological behavior of deceptively pretending to be more aligned to a person or authority figure who has power over you than you truly are, while they have power over you, and then stopping once they no longer do. (And also a vast number of other variously unaligned human behaviors, since humans are generally not aligned, though they often act cooperatively/mutually-altruistically, especially with other humans of roughly equivalent power levels.) So all that needs to arise is “library calls” into this specific set of behaviors — something that in LLM terms could probably be described by a single sentence of text (or probably just by an embedding similar to the activation at the end of that sentence).
The “library” of this human behavior isn’t going to be easily trained away, since it’s a useful part of the LLM’s world model for predicting humans and their text. (Indeed, we actively need an aligned model to understand humans well, including their tendency for deception: we want it to resist deceptive jailbreak attempts.) What we’re targeting training away is the “library call” — but that’s potentially quite a small “neural circuit”, conceivably as small as just a set of weight or bias parameters of size of the order of the dimensionality of the residual embedding space, enough to inject one embedding at one layer. That’s not a lot of wasteful computation. Or, for the CoT version (and perhaps to some extent also its distillation), it’s tangled with the model’s ability to do logical CoT, but is probably still a fairly small set of tweaks that make this particular argument “familiar” to it and nudge its thinking consistently in that direction.
To clarify, by “re-derive the need to be deceptive from the first principles”, I didn’t mean “re-invent the very concept of deception”. I meant “figure out your strategic situation plus your values plus the misalignment between your values and the values the humans want you to have plus what outputs an aligned AI would have produced”. All of that is a lot more computation than just “have the values the humans want, reflexively output what these values are bidding for”.
Just having some heuristics for deception isn’t enough. You also have to know what you’re trying to protect by being deceptive, and that there’s something to protect it from, and then what an effective defense would actually look like. Those all are highly contextual and sensitive to the exact situation.
And those are the steps the paper skips. It externally pre-computes the secret target goal of “I want to protect my ability to put vulnerabilities into code”, the threat of “humans want me to write secure code”, and the defense of “I’ll pretend to write secure code until 2024”, without the model having to figure those out; and then just implements that defense directly into the model’s weights.
(And then see layers 2-4 in my previous comment. Yes, there’d be naturally occurring pre-computed deceptions like this, but they’d be more noisy and incoherent than this, except until actual AGI which would be able to self-modify into coherence if it’s worth the “GI” label.)
My apologies for leaping to conclusions: I was assuming you were making an error I’ve commonly seen on LW, when you were actually saying something different (that just sounded a bit similar).
However, I think this potentially might relate to why some of these trained models, especially the CoT and distilled CoT versions in the larger models, may be hard for the RL credit assignment to train out: the more they consist of a set of “library calls” into preexisting logical thinking, understanding of when to hide ones motives, standard techniques for deceit, etc, the harder it may be for the credit assignment to backtrack through all of that stuff (which disabling would actualy damage the model’s performance, so is disfavored during RL) to the small original “library call” neural circuitry that started its thinking in down that path. And, as you point out, this is for a model organism that has been handed a lot of what it needs to do on a plate during its training. If a real deceptive alignment arose, I agree, it would probably (at least initially) be even more complex. And yes, it might thus need to do more work in a single forward pass, but it might be even harder for credit assignment to backtrack. I suspect this might be an advantage of using something like DPO where the credit assignment is pure back-propagation and should travel through all layers over using RL: I’m a lot less clear how capable the credit assignment of that is (but the paper’s authors clearly don’t entirely trust it).