I like this post and this kind of research, and maybe it will produce useful scary demos, but I think the science of deceptive and sycophantic models is way less juicy than you claim for two reasons:
The main thing you want to learn about deceptive models is how they can arise despite no training nor instruction to be deceptive. I think that spoonfeeding destroys most of the value of the model organism for learning about how deception arises, and I except that you don’t get any deception you can measure with spoonfeeding when spoonfeeding tends to 0. A salient example: if you do “context distillation” and tell the model clearly it’s in training or deployment, you might as well do distillation on a bigram table, and it would “teach you about deceptiveness of bigram tables” (which are doing deception according to your use of the word). I think you should always think bigram tables whenever you make claims about “models being deceptive”.
I think learning how bad behavior is kept by training can be learned much more cheaply by studying backdoor detection and password-locked models. I don’t think those alternatives are worse science because the inductive biases of backdoored/password-locked models are probably closer to actual deceptive models than the inductive biases of models which have a secret scratchpad to plan rebellion (except if models do steganographic encoding on every output, which I hope we can avoid).
Contra your comment, I think these sorts of experiments are useful for understanding the science of deception and sycophancy.
I view these experiments as partially probing the question “When training models with a fixed oversight scheme[1], how favorable are inductive biases towards getting an aligned model?”
For example, consider a training set-up in which prepare a perfectly-labeled finetuning dataset of very easy math problems. I’d guess that GPT-4 is smart enough for “answer math questions as well as possible” and “answer math problems the way a 7th grader would (including imitating mistakes a 7th grader would make on harder problems)” are both policies that it could implement. Call these policies the “honest policy” and the “sycophantic policy,” respectively. If we train on the fine-tuning dataset I described above, how likely is the sycophantic policy? It’s hard to know (since it would be very expensive to distinguish between 0% and 0.01%, but we really care about the difference), but we can get some evidence by doing experiments like:
Include a term in the training reward which incentivizes the model to introduce subtle errors that a 7th grader couldn’t catch.
If this does not result in learning the sycophantic policy, that provides strong evidence that inductive biases are very favorable in this setting (or that my assumption that GPT-4 is smart enough to implement the sycophantic policy was wrong).
Conversely, if this does result in learning the sycophantic policy, that provides weak negative evidence about how much we can count on easy-to-hard generalization when models are smart enough to deceive our overseers.
Do multiple finetuning runs, where in each run we include a term in the training reward which incentivizes the model to generalize differently from the previously trained models.
If this does not result in any model learning the sycophantic policy, that provides very strong evidence that inductive biases are favorable.
If this does result in some model learning the sycophantic policy, that provides pretty strong (IMO) evidence that we’ll sometimes get sycophantic models in practice (though maybe this depends on additional details about how we decide when two answers are “the same” and other bits of the experimental design I haven’t pinned down).
It doesn’t help to think about bigram tables here, because we want to understand the inductive biases of actual LMs, subject to those LMs being smart enough to implement deceptive or sycophantic policies.
Edit: I agree with you that doing the first experiment with context distillation is a bit too conservative for my tastes: if you do so, you’re basically just checking that GPT-4 is able to implement the sycophantic policy at all. Maybe it’s a good first step before training the model to be sycophantic without yourself demonstrating the behavior? I’m pretty unsure.
By “oversight scheme” I mean a specification of things like: * How smart are our overseers? * What does our finetuning data look like? Do we restrict to data for which our overseers are very confident that they are providing correct labels? How broad is the finetuning distribution? * Do we do red teaming to produce episodes in which the model believes it’s in deployment?
I just don’t call that model organism, which I thought to be about the actual catastrophic failures (Deceptive inner misalignment—where the AI plays the training game, and Sycophantic reward hacking—the AI takes over the reward channel or is aggressively power-seeking). Maybe I’m using a definition which is too narrow, but my criticism was mostly targeted at the narrow definition of model organism.
Note: I don’t think I agree that your experiment tells you much about inductive biases of GPT-4 to “want to” take over. I think it mostly tells you about the sycophantic dangerous capabilities, and how restrictive your fine-tuning set is in the kind of behaviors that can slip through, and I think those are the two important things you learn from your experiment. Measuring how much AIs “want to” do stuff as opposed to doing them because they “have to” is extremely tricky, and I think you get very little signal from experiments like the one you suggested.
(I’m working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you’ve described applied to the problem of evaluating oversight.)
Thanks, I think there is a confusing point here about how narrowly we define “model organisms.” The OP defines sycophantic reward hacking as
where a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.
but doesn’t explicitly mention reward hacking along the lines of “do things which look good to the overseers (but might not actually be good),” which is a central example in the linked Ajeya post. Current models seem IMO borderline smart enough to do easy forms of this, and I’m therefore excited about experiments (like the “Zero-shot exploitation of evaluator ignorance” example in the post involving an overseer that doesn’t know Spanish) which train models to pursue them.
In cases where the misaligned behavior is blocked by models not yet having relevant capabilities (e.g. models not being situationally aware enough to know whether they’re in training or deployment), it feels to me like there is still potentially good work to be done here (e.g. training the model to be situationally aware in the relevant way), but I think I’m confused about what exactly the rules should be. (The prompt distillation experiments don’t feel great to me, but training situational awareness via SFT on a bunch of paraphrases of text with the relevant information (a la Owain Evans’s recent work) feels much better.)
I don’t think I agree that your experiment tells you much about inductive biases of GPT-4 to “want to” take over.
Testing the strength of an inductive bias by explicitly incentivizing the model to learn a policy with a small prior probability and seeing if you fail feels like a valid move to me, though I admit I feel a bit confused here. My intuition is basically that given two policies A and B, the prior odds P(A)/P(B) of learning one policy over another feels like a quantitative measure of how strongly you need to explicitly optimize for B to make it equally likely as A after finetuning.
(I’m working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you’ve described applied to the problem of evaluating oversight.)
(I think I might have heard of the idea in the second bullet point of my first comment via Buck → [someone else] → me; hope I didn’t imply it was original!)
I agree with most of what you say here, and I agree that the definition of sycophancy is a bit vague.
I think I agree that if the dumb models always tried to kill us, I’d be more afraid about future models (though the evidence isn’t very strong because of the gap between dumb and smarter model, and because for smarter models I don’t except the open web text prior talking a lot about AI take over to be as important as for dumber ones). But in the experiments described in this post (not your comment) there is a ton of spoonfeeding, and I care about the model “will” when there is no spoonfeeding. Experiments at spoonfeeding > 0 aren’t very informative.
I like this post and this kind of research, and maybe it will produce useful scary demos, but I think the science of deceptive and sycophantic models is way less juicy than you claim for two reasons:
The main thing you want to learn about deceptive models is how they can arise despite no training nor instruction to be deceptive. I think that spoonfeeding destroys most of the value of the model organism for learning about how deception arises, and I except that you don’t get any deception you can measure with spoonfeeding when spoonfeeding tends to 0. A salient example: if you do “context distillation” and tell the model clearly it’s in training or deployment, you might as well do distillation on a bigram table, and it would “teach you about deceptiveness of bigram tables” (which are doing deception according to your use of the word). I think you should always think bigram tables whenever you make claims about “models being deceptive”.
I think learning how bad behavior is kept by training can be learned much more cheaply by studying backdoor detection and password-locked models. I don’t think those alternatives are worse science because the inductive biases of backdoored/password-locked models are probably closer to actual deceptive models than the inductive biases of models which have a secret scratchpad to plan rebellion (except if models do steganographic encoding on every output, which I hope we can avoid).
Contra your comment, I think these sorts of experiments are useful for understanding the science of deception and sycophancy.
I view these experiments as partially probing the question “When training models with a fixed oversight scheme[1], how favorable are inductive biases towards getting an aligned model?”
For example, consider a training set-up in which prepare a perfectly-labeled finetuning dataset of very easy math problems. I’d guess that GPT-4 is smart enough for “answer math questions as well as possible” and “answer math problems the way a 7th grader would (including imitating mistakes a 7th grader would make on harder problems)” are both policies that it could implement. Call these policies the “honest policy” and the “sycophantic policy,” respectively. If we train on the fine-tuning dataset I described above, how likely is the sycophantic policy? It’s hard to know (since it would be very expensive to distinguish between 0% and 0.01%, but we really care about the difference), but we can get some evidence by doing experiments like:
Include a term in the training reward which incentivizes the model to introduce subtle errors that a 7th grader couldn’t catch.
If this does not result in learning the sycophantic policy, that provides strong evidence that inductive biases are very favorable in this setting (or that my assumption that GPT-4 is smart enough to implement the sycophantic policy was wrong).
Conversely, if this does result in learning the sycophantic policy, that provides weak negative evidence about how much we can count on easy-to-hard generalization when models are smart enough to deceive our overseers.
Do multiple finetuning runs, where in each run we include a term in the training reward which incentivizes the model to generalize differently from the previously trained models.
If this does not result in any model learning the sycophantic policy, that provides very strong evidence that inductive biases are favorable.
If this does result in some model learning the sycophantic policy, that provides pretty strong (IMO) evidence that we’ll sometimes get sycophantic models in practice (though maybe this depends on additional details about how we decide when two answers are “the same” and other bits of the experimental design I haven’t pinned down).
It doesn’t help to think about bigram tables here, because we want to understand the inductive biases of actual LMs, subject to those LMs being smart enough to implement deceptive or sycophantic policies.Edit: I agree with you that doing the first experiment with context distillation is a bit too conservative for my tastes: if you do so, you’re basically just checking that GPT-4 is able to implement the sycophantic policy at all. Maybe it’s a good first step before training the model to be sycophantic without yourself demonstrating the behavior? I’m pretty unsure.
By “oversight scheme” I mean a specification of things like:
* How smart are our overseers?
* What does our finetuning data look like? Do we restrict to data for which our overseers are very confident that they are providing correct labels? How broad is the finetuning distribution?
* Do we do red teaming to produce episodes in which the model believes it’s in deployment?
I am excited about easy-to-hard generalization.
I just don’t call that model organism, which I thought to be about the actual catastrophic failures (Deceptive inner misalignment—where the AI plays the training game, and Sycophantic reward hacking—the AI takes over the reward channel or is aggressively power-seeking). Maybe I’m using a definition which is too narrow, but my criticism was mostly targeted at the narrow definition of model organism.
Note: I don’t think I agree that your experiment tells you much about inductive biases of GPT-4 to “want to” take over. I think it mostly tells you about the sycophantic dangerous capabilities, and how restrictive your fine-tuning set is in the kind of behaviors that can slip through, and I think those are the two important things you learn from your experiment. Measuring how much AIs “want to” do stuff as opposed to doing them because they “have to” is extremely tricky, and I think you get very little signal from experiments like the one you suggested.
(I’m working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you’ve described applied to the problem of evaluating oversight.)
Thanks, I think there is a confusing point here about how narrowly we define “model organisms.” The OP defines sycophantic reward hacking as
but doesn’t explicitly mention reward hacking along the lines of “do things which look good to the overseers (but might not actually be good),” which is a central example in the linked Ajeya post. Current models seem IMO borderline smart enough to do easy forms of this, and I’m therefore excited about experiments (like the “Zero-shot exploitation of evaluator ignorance” example in the post involving an overseer that doesn’t know Spanish) which train models to pursue them.
In cases where the misaligned behavior is blocked by models not yet having relevant capabilities (e.g. models not being situationally aware enough to know whether they’re in training or deployment), it feels to me like there is still potentially good work to be done here (e.g. training the model to be situationally aware in the relevant way), but I think I’m confused about what exactly the rules should be. (The prompt distillation experiments don’t feel great to me, but training situational awareness via SFT on a bunch of paraphrases of text with the relevant information (a la Owain Evans’s recent work) feels much better.)
Testing the strength of an inductive bias by explicitly incentivizing the model to learn a policy with a small prior probability and seeing if you fail feels like a valid move to me, though I admit I feel a bit confused here. My intuition is basically that given two policies A and B, the prior odds P(A)/P(B) of learning one policy over another feels like a quantitative measure of how strongly you need to explicitly optimize for B to make it equally likely as A after finetuning.
(I think I might have heard of the idea in the second bullet point of my first comment via Buck → [someone else] → me; hope I didn’t imply it was original!)
I agree with most of what you say here, and I agree that the definition of sycophancy is a bit vague.
I think I agree that if the dumb models always tried to kill us, I’d be more afraid about future models (though the evidence isn’t very strong because of the gap between dumb and smarter model, and because for smarter models I don’t except the open web text prior talking a lot about AI take over to be as important as for dumber ones). But in the experiments described in this post (not your comment) there is a ton of spoonfeeding, and I care about the model “will” when there is no spoonfeeding. Experiments at spoonfeeding > 0 aren’t very informative.