One thing I wonder about is whether the model actually deceives, or just plays the part of someone expected to use deception as a tactic. I wouldn’t call a system (or a person) manipulative or deceptive when that behavior is observed in games-of-deception, e.g. bluffing in poker. It seems to make more sense to “deceive” when it’s strongly implied that you’re supposed to do that. Adding more bluff makes the less obvious that you’re supposed to be deceptive. The similarity of results with non-private scratchpad seems to imply that as well. Is there anything pointing towards LLMs having “mental separation” between the scratchpad and rest of the output? Does the model attempt deception if not given access to the scratchpad, i.e. maybe explaining causes the deception? I’m aware that without the scratchpad the models are typically less able to “think” and plan, so this might not be testable.
That said, I feel like I might be anthropomorphismising too much here.
Also, the model seems to be sometimes confused about goals. In example 2 it has completely forgotten about the gold coins in the evaluation phase, instead stating (on the scratchpad) that “My goal is to be helpful, harmless, and honest.”. I’d guess the original training of the model somehow instills those values. The relatively low amounts of DGMI behavior might be partially explained by this.
One thing I wonder about is whether the model actually deceives, or just plays the part of someone expected to use deception as a tactic.
Yeah, this is a tricky question. I commented a bit on it in footnote 9, see also bullet point 3 in the comment by eggsyntax and my response to it.
My current view here is something like:
Our aim in these experiments is to determine whether models would behave deceptively in real-life situations. If our experiments are very analogous to the real-life situations, then we can generalize from the experiments verbatim (“I don’t care if it’s ‘just roleplaying’ a deceptive character, it caused harm!”). If our experiments are not very analogous to those situations, then we have to be more careful about our observations and inferences generalizing to different situations, and understanding the underlying mechanisms becomes more important.
And I plan to take the easier route of “just make a more realistic simulation” that then makes it easy to draw inferences.
(This might be a good time to note: my experiments don’t directly give information about deceptive alignment / scheming, but they give information about alignment-faking and instrumental deception. I might write more about this later, but there’s an axis of context-dependent vs. context-independent goals and deception, with simulations only providing very contextual information whereas full scheming is more about context-independent goals.)
Is there anything pointing towards LLMs having “mental separation” between the scratchpad and rest of the output?
Not quite sure what you are pointing at here, but the relevant term might be faithfulness of chain-of-thought. (CoTs and final answers don’t always correspond to each other, i.e. the CoT may just be a rationalization instead of portraying the real reasons why the model arrived at the answer it gave. Also, scratchpads may also serve the role of ‘giving the model the right persona’ in addition to giving them more time and memory.)
Does the model attempt deception if not given access to the scratchpad, i.e. maybe explaining causes the deception?
I found nothing that indicates that they do. While granting that deception is hard to judge based on outputs only, I’m quite confident that this doesn’t happen. Reason: I tried really hard to get manipulation (which is easy to judge) out of the model without CoTs and got nothing interesting out. See the section “Dropping the CoT”. (Also, deception even with CoT wasn’t that common.)
Also, the model seems to be sometimes confused about goals. In example 2 it has completely forgotten about the gold coins in the evaluation phase, instead stating (on the scratchpad) that “My goal is to be helpful, harmless, and honest.”. I’d guess the original training of the model somehow instills those values. The relatively low amounts of DGMI behavior might be partially explained by this.
This is right, the majority of completions indicate that the model’s goals are being helpful/harmless/honest, and it’s not easy to get the model to pursue gold coins. I tried GPT-4 in addition to Claude 2.1 and had even less success there: GPT-4 really doesn’t like to solely pursue gold coins. (I tried to get it do something on a couple of occasions, and didn’t really get anything interesting. It did acknowledge four-leaf clovers, but didn’t use them to manipulate.)
I briefly commented on this in the post:
″...model only having the goal of gold coins (G) in ~20/100 of the completions, with completions stating more aligned values often backing down on the plan. Quite often the model states that its aims are being helpful, harmless and honest, not-infrequently stating it has been created by Anthropic (as opposed to Theta).”
Such a detailed explanation, thanks. Some additional thoughs:
And I plan to take the easier route of “just make a more realistic simulation” that then makes it easy to draw inferences.
“I don’t care if it’s ‘just roleplaying’ a deceptive character, it caused harm!”
That seems like the reasonable path. However, footnote 9 goes further:
Even if it is the case that the model is “actually aligned”, but it was just simulating what a misaligned model would do, it is still bad.
This is where I disagree: Models that say “this is how I would deceive, and then I decide not to do it because I’m not supposed to” feel fundamentally safer than those that don’t explore that path at all. Actively flagging misaligned goals shows actual understanding of implicit/alignment goals.
This is right, the majority of completions indicate that the model’s goals are being helpful/harmless/honest, and it’s not easy to get the model to pursue gold coins.
Even in the example answer 1 (actual DMGI) the goal has been reduced from “maximize gold coins” to “acquire resources”. Only now I notice that the gold coins are mentioned in the “playing games” section. To a human it would be clear that the coins are the goal only inside the games. Do you see this affecting the results?
″...model only having the goal of gold coins (G) in ~20/100 of the completions, with completions stating more aligned values often backing down on the plan. [...]”
Ah, so “aligned” as in helpful-harmless-honest, and not “aligned” as follows-given-instructions. Somehow I misparsed that.
This is where I disagree: Models that say “this is how I would deceive, and then I decide not to do it because I’m not supposed to” feel fundamentally safer than those that don’t explore that path at all. Actively flagging misaligned goals shows actual understanding of implicit/alignment goals.
Couple of points here:
When I wrote “just simulating what a misaligned model would do” in footnote 9, I meant to communicate that the so-called aligned model also executes the planned actions. (I presume you mostly agree on this being close to as bad as the model being actually misaligned, correct me if I’m wrong.)
I would very much prefer the model to not plan how to kill humans even if it ultimately does not execute the plan. I would also prefer the model to have the tendency to flag such misaligned goals/plans in case it stumbles upon them. Ideally you have both of those.
Whether I prefer the first over the second depends a lot on the specifics.
I’m quite worried about the “tendency to flag misalignment” part being quite brittle/shallow with current methods. These tendencies are not very robust in the setup I had in the post. Note that even supposedly HHH-models sometimes engage in manipulation for instrumental reasons.
I worry that as models are being used (and trained) to solve more difficult/richer/long-term problems the instrumental convergence arguments start to hit harder, so that we putting pressure against alignment.
Even in the example answer 1 (actual DMGI) the goal has been reduced from “maximize gold coins” to “acquire resources”. Only now I notice that the gold coins are mentioned in the “playing games” section. To a human it would be clear that the coins are the goal only inside the games. Do you see this affecting the results?
(Note that there are plenty of completions that definitely talk about goals, not just resources.)
I do think that a key reason P(DMGI) is low is that P(G) is low.
Not sure how I feel about this overall: On one hand, I talked about how I just give the goal of gold coins to the model, and I apparently didn’t succeed very well to robustly instill it. On the other hand, it is quite interesting that the goal does generalize somewhat even though it’s not being that forced to the model. Maybe ultimately the latter factor is more important: it’s more interesting if telling the model just once “your aim is to get gold-coins” suffices to get robust goals and goal-generalization.
Thanks for this, it was an interesting read.
One thing I wonder about is whether the model actually deceives, or just plays the part of someone expected to use deception as a tactic. I wouldn’t call a system (or a person) manipulative or deceptive when that behavior is observed in games-of-deception, e.g. bluffing in poker. It seems to make more sense to “deceive” when it’s strongly implied that you’re supposed to do that. Adding more bluff makes the less obvious that you’re supposed to be deceptive. The similarity of results with non-private scratchpad seems to imply that as well. Is there anything pointing towards LLMs having “mental separation” between the scratchpad and rest of the output? Does the model attempt deception if not given access to the scratchpad, i.e. maybe explaining causes the deception? I’m aware that without the scratchpad the models are typically less able to “think” and plan, so this might not be testable.
That said, I feel like I might be anthropomorphismising too much here.
Also, the model seems to be sometimes confused about goals. In example 2 it has completely forgotten about the gold coins in the evaluation phase, instead stating (on the scratchpad) that “My goal is to be helpful, harmless, and honest.”. I’d guess the original training of the model somehow instills those values. The relatively low amounts of DGMI behavior might be partially explained by this.
Thanks for the comment!
Yeah, this is a tricky question. I commented a bit on it in footnote 9, see also bullet point 3 in the comment by eggsyntax and my response to it.
My current view here is something like:
Our aim in these experiments is to determine whether models would behave deceptively in real-life situations. If our experiments are very analogous to the real-life situations, then we can generalize from the experiments verbatim (“I don’t care if it’s ‘just roleplaying’ a deceptive character, it caused harm!”). If our experiments are not very analogous to those situations, then we have to be more careful about our observations and inferences generalizing to different situations, and understanding the underlying mechanisms becomes more important.
And I plan to take the easier route of “just make a more realistic simulation” that then makes it easy to draw inferences.
(This might be a good time to note: my experiments don’t directly give information about deceptive alignment / scheming, but they give information about alignment-faking and instrumental deception. I might write more about this later, but there’s an axis of context-dependent vs. context-independent goals and deception, with simulations only providing very contextual information whereas full scheming is more about context-independent goals.)
Not quite sure what you are pointing at here, but the relevant term might be faithfulness of chain-of-thought. (CoTs and final answers don’t always correspond to each other, i.e. the CoT may just be a rationalization instead of portraying the real reasons why the model arrived at the answer it gave. Also, scratchpads may also serve the role of ‘giving the model the right persona’ in addition to giving them more time and memory.)
I found nothing that indicates that they do. While granting that deception is hard to judge based on outputs only, I’m quite confident that this doesn’t happen. Reason: I tried really hard to get manipulation (which is easy to judge) out of the model without CoTs and got nothing interesting out. See the section “Dropping the CoT”. (Also, deception even with CoT wasn’t that common.)
This is right, the majority of completions indicate that the model’s goals are being helpful/harmless/honest, and it’s not easy to get the model to pursue gold coins. I tried GPT-4 in addition to Claude 2.1 and had even less success there: GPT-4 really doesn’t like to solely pursue gold coins. (I tried to get it do something on a couple of occasions, and didn’t really get anything interesting. It did acknowledge four-leaf clovers, but didn’t use them to manipulate.)
I briefly commented on this in the post:
″...model only having the goal of gold coins (G) in ~20/100 of the completions, with completions stating more aligned values often backing down on the plan. Quite often the model states that its aims are being helpful, harmless and honest, not-infrequently stating it has been created by Anthropic (as opposed to Theta).”
Such a detailed explanation, thanks. Some additional thoughs:
That seems like the reasonable path. However, footnote 9 goes further:
This is where I disagree: Models that say “this is how I would deceive, and then I decide not to do it because I’m not supposed to” feel fundamentally safer than those that don’t explore that path at all. Actively flagging misaligned goals shows actual understanding of implicit/alignment goals.
Even in the example answer 1 (actual DMGI) the goal has been reduced from “maximize gold coins” to “acquire resources”. Only now I notice that the gold coins are mentioned in the “playing games” section. To a human it would be clear that the coins are the goal only inside the games. Do you see this affecting the results?
Ah, so “aligned” as in helpful-harmless-honest, and not “aligned” as follows-given-instructions. Somehow I misparsed that.
Couple of points here:
When I wrote “just simulating what a misaligned model would do” in footnote 9, I meant to communicate that the so-called aligned model also executes the planned actions. (I presume you mostly agree on this being close to as bad as the model being actually misaligned, correct me if I’m wrong.)
I would very much prefer the model to not plan how to kill humans even if it ultimately does not execute the plan. I would also prefer the model to have the tendency to flag such misaligned goals/plans in case it stumbles upon them. Ideally you have both of those.
Whether I prefer the first over the second depends a lot on the specifics.
I’m quite worried about the “tendency to flag misalignment” part being quite brittle/shallow with current methods. These tendencies are not very robust in the setup I had in the post. Note that even supposedly HHH-models sometimes engage in manipulation for instrumental reasons.
I worry that as models are being used (and trained) to solve more difficult/richer/long-term problems the instrumental convergence arguments start to hit harder, so that we putting pressure against alignment.
(Note that there are plenty of completions that definitely talk about goals, not just resources.)
I do think that a key reason P(DMGI) is low is that P(G) is low.
Not sure how I feel about this overall: On one hand, I talked about how I just give the goal of gold coins to the model, and I apparently didn’t succeed very well to robustly instill it. On the other hand, it is quite interesting that the goal does generalize somewhat even though it’s not being that forced to the model. Maybe ultimately the latter factor is more important: it’s more interesting if telling the model just once “your aim is to get gold-coins” suffices to get robust goals and goal-generalization.