Thanks, I think there is a confusing point here about how narrowly we define “model organisms.” The OP defines sycophantic reward hacking as
where a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.
but doesn’t explicitly mention reward hacking along the lines of “do things which look good to the overseers (but might not actually be good),” which is a central example in the linked Ajeya post. Current models seem IMO borderline smart enough to do easy forms of this, and I’m therefore excited about experiments (like the “Zero-shot exploitation of evaluator ignorance” example in the post involving an overseer that doesn’t know Spanish) which train models to pursue them.
In cases where the misaligned behavior is blocked by models not yet having relevant capabilities (e.g. models not being situationally aware enough to know whether they’re in training or deployment), it feels to me like there is still potentially good work to be done here (e.g. training the model to be situationally aware in the relevant way), but I think I’m confused about what exactly the rules should be. (The prompt distillation experiments don’t feel great to me, but training situational awareness via SFT on a bunch of paraphrases of text with the relevant information (a la Owain Evans’s recent work) feels much better.)
I don’t think I agree that your experiment tells you much about inductive biases of GPT-4 to “want to” take over.
Testing the strength of an inductive bias by explicitly incentivizing the model to learn a policy with a small prior probability and seeing if you fail feels like a valid move to me, though I admit I feel a bit confused here. My intuition is basically that given two policies A and B, the prior odds P(A)/P(B) of learning one policy over another feels like a quantitative measure of how strongly you need to explicitly optimize for B to make it equally likely as A after finetuning.
(I’m working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you’ve described applied to the problem of evaluating oversight.)
(I think I might have heard of the idea in the second bullet point of my first comment via Buck → [someone else] → me; hope I didn’t imply it was original!)
I agree with most of what you say here, and I agree that the definition of sycophancy is a bit vague.
I think I agree that if the dumb models always tried to kill us, I’d be more afraid about future models (though the evidence isn’t very strong because of the gap between dumb and smarter model, and because for smarter models I don’t except the open web text prior talking a lot about AI take over to be as important as for dumber ones). But in the experiments described in this post (not your comment) there is a ton of spoonfeeding, and I care about the model “will” when there is no spoonfeeding. Experiments at spoonfeeding > 0 aren’t very informative.
Thanks, I think there is a confusing point here about how narrowly we define “model organisms.” The OP defines sycophantic reward hacking as
but doesn’t explicitly mention reward hacking along the lines of “do things which look good to the overseers (but might not actually be good),” which is a central example in the linked Ajeya post. Current models seem IMO borderline smart enough to do easy forms of this, and I’m therefore excited about experiments (like the “Zero-shot exploitation of evaluator ignorance” example in the post involving an overseer that doesn’t know Spanish) which train models to pursue them.
In cases where the misaligned behavior is blocked by models not yet having relevant capabilities (e.g. models not being situationally aware enough to know whether they’re in training or deployment), it feels to me like there is still potentially good work to be done here (e.g. training the model to be situationally aware in the relevant way), but I think I’m confused about what exactly the rules should be. (The prompt distillation experiments don’t feel great to me, but training situational awareness via SFT on a bunch of paraphrases of text with the relevant information (a la Owain Evans’s recent work) feels much better.)
Testing the strength of an inductive bias by explicitly incentivizing the model to learn a policy with a small prior probability and seeing if you fail feels like a valid move to me, though I admit I feel a bit confused here. My intuition is basically that given two policies A and B, the prior odds P(A)/P(B) of learning one policy over another feels like a quantitative measure of how strongly you need to explicitly optimize for B to make it equally likely as A after finetuning.
(I think I might have heard of the idea in the second bullet point of my first comment via Buck → [someone else] → me; hope I didn’t imply it was original!)
I agree with most of what you say here, and I agree that the definition of sycophancy is a bit vague.
I think I agree that if the dumb models always tried to kill us, I’d be more afraid about future models (though the evidence isn’t very strong because of the gap between dumb and smarter model, and because for smarter models I don’t except the open web text prior talking a lot about AI take over to be as important as for dumber ones). But in the experiments described in this post (not your comment) there is a ton of spoonfeeding, and I care about the model “will” when there is no spoonfeeding. Experiments at spoonfeeding > 0 aren’t very informative.