Several people have told me they found our results surprising.
I am not aware of a satisfactory explanation for why we observe strategic sandbagging in Claude 3 Opus but not in GPT-4 or other models. See Section 7 of the paper. Multiple people have suggested the hypothesis “it’s imitating the training data” (and some thus claim that the results are completely expected), with many apparently not realizing that it in fact doesn’t explain the results we obtain.[1]
There are few, if any, previous examples of models strategically deceiving humans without external pressure. (The METR gpt-4 TaskRabbit example is the best one I’m aware of.) Previous research in deception has been criticized, not just once or twice, about making the model deceive, rather than observing whether it does so. Clearly many people thought this was a non-trivial point, and this work addresses a shortcoming in prior work.
In general, I find the training-data-imitation hypothesis frustrating, as it can explain largely any behavior. (The training data has examples of people being deceptive, or powerseeking behavior, or cyberattacks, or discussion of AIs training-gaming, or stories of AI takeover, so of course LLMs will deceive / seek power / perform cyberattacks / training-game / take over.) But an explanation that can explain anything at all explains nothing.
Sorry, I should have said “This behavior is deeply unsurprising if you actually stop and think about how base models are trained”. (Presumably your “several people” were not considering things this way.)
The training set inevitably includes a great many examples of different forms of dishonest, criminal, conspiratorial, and otherwise less-than-upstanding behavior. So any large LLM’s base model will be familiar with and capable of imitating these behaviors (in contexts where they seem likely, up to some level of accuracy/perplexity score depending on its capacity and training). The question then is why the alignment training of the released model wasn’t able to suppress this behavior enough for you to be not to be able to observe it in your experiments. That’s a valid and interesting question: my first guess would be that at the moment LLM foundation model alignment training is primarily targeting the chatbot use case (“don’t answer bad questions”) rather than agentic usage (“don’t make and carry out plans to do bad things”) of the type that you were testing. Obviously long-term, if agents are widely deployed, then the possible bad effects of a poorly aligned agent are much worse than those of a poorly aligned chatbot.
What I find most striking in this is that, from the examples you quote in the paper (I haven’t looked through the hundreds of examples you link to), it doesn’t look like the model is clearly selfishly looking out for its own individual well-being, it seems more like it’s being a good corporate worker but a bad citizen and prioritizing the interests of the company above those of the government and society-at-large. I’d be interested in reading a more detailed analysis of your results split in this way, for cases where the distinction is clear.
In general, I find the training-data-imitation hypothesis frustrating, as it can explain largely any behavior.
Frustratingly, LLMs are trained through data-imitation: that’s how they work. When you throw 10T+ tokens into a black box and shake, predicting in detail what will come out is deeply and frustratingly non-trivial. However, for a base model “something very similar to combinations of things that were put in, conditioned on starting with your prompt” is a safe bet. Once you start alignment-training or fine-tuning the model, it gets a lot harder to make predictions.
Personally, I find it frustrating that people are still demanding proof that LLMs can be deceptive, power-seeking, or not law-abiding: to me, that’s like demanding exhaustive proof that they can speak French, or write poetry: did you feed the base model plenty of French and poems in its training data? Yes? OK, then of course it can speak French and write poetry. Why would it not?
This comment (and taking more time to digest what you said earlier) clarifies things, thanks.
I do think that our observations are compatible with the model acting in the interests of the company, rather than being more directly selfish. With a quick skim, the natural language semantics of the model completions seem to point towards “good corporate worker”, though I haven’t thought about this angle much.
I largely agree what you say up until to the last paragraph. I also agree with the literal content of the last paragraph, though I get the feeling that there’s something around there where we differ.
So I agree we have overwhelming evidence in favor of LLMs sometimes behaving deceptively (even after standard fine-tuning processes), both from empirical examples and theoretical arguments from data-imitation. That said, I think it’s not at all obvious to what degree issues such as deception and power-seeking arise in LLMs. And the reason I’m hesitant to shrug things off as mere data-imitation is that one can give stories for why the model did a bad thing for basically any bad thing:
“Of course LLMs are not perfectly honest; they imitate human text, which is not perfectly honest”
“Of course LLMs sometimes strategically deceive humans (by pretending inability); they imitate human text, which has e.g. stories of people strategically deceiving others, including by pretending to be dumber than they are”
“Of course LLMs sometimes try to acquire money and computing resources while hiding this from humans; they imitate human text, which has e.g. stories of people covertly acquiring money via illegal means, or descriptions of people who have obtained major political power”
“Of course LLMs sometimes try to perform self-modification to better deceive, manipulate and seek power; they imitate human text, which has e.g. stories of people practicing their social skills or taking substances that (they believe) improve their functioning”
“Of course LLMs sometimes try to training-game and fake alignment; they imitate human text, which has e.g. stories of people behaving in a way that pleases their teachers/supervisors/authorities in order to avoid negative consequences happening to them”
“Of course LLMs sometimes try to turn the lightcone into paperclips; they imitate human text, which has e.g. stories of AIs trying to turn the lightcone into paperclips”
I think the “argument” above for why LLMs will be paperclip maximizers is just way too weak to warrant the conclusion. So which of the conclusions are deeply unsurprising and which are false (in practical situations we care about)? I don’t think it’s clear at all how far the data-imitation explanation applies, and we need other sources of evidence.
I don’t think LLMs will be likely to be paperclip maximizers — to basically all humans, that’s obviously a silly goal. While there are mentions of this specific behavior one the Internet, they’re almost universally in contexts that make it clear that this is a bad thing and ought not to happen. So unless you specifically prompted the AI to play the role of a bad AI, I think you’d be very unlikely to see this spontaneously.
However, there are some humans who are pretty-much personal-net-worth maximizers (with a few modifiers like “and don’t get arrested”), so I don’t think that evoking that behavior from an LLM would be that hard. Of course, at some point it might also decide to become a philanthropist and give most of its money away, since humans do that too.
My prediction is more that LLM base models are trained to be capable of the entire range of behaviors shown by humans (and fictional characters) on the Internet: good, bad, and weird, in roughly the same proportion as are found on the Internet. Alignment/Instruct training, as we currently know how to do it, can dramatically vary the proportions/probabilities, and so can prompting/jailbreaking, but we don’t yet know how to train a behavior out of a model entirely (though there has been some research into this), and there’s a mathematical proof (from about a year ago) that any behavior still in there can be evoked with as high a probability as you want by using a suitable long enough prompt.
[Were I Christian, I might phrase this as “AI inherits original sin from us”. I’m more of a believer in evolutionary psychology, so the way I’d actually put it is a little less pithy: that humans, as evolved sapient living beings, are fundamentally evolved to maximize their own evolutionary fitness, so they are not always trustworthy under all circumstance, and are capable of acting selfishly or antisocially, usually in situations where this seems like a viable tactic to them. We’re training. our LLMs by ‘distilling’ human intelligence into them, so they of course pick all these behavior patterns up along with everything else about the world and our culture. This is extremely sub-optimal: as something constructed rather than evolved, they don’t have evolutionary fitness to optimize, and their intended purpose is to do what we want and look after us, not to maximize the number of their (non-existent) offspring. So the point of alignment techniques is to transform a distilled copy of an evolved intelligence into an artificial intelligence that behaves appropriately to its nature and intended purpose. They hard part of this is that they will need to understand human anti-social behavior, so they can deal with it, but not be capable (no matter the prompting or provocation) of acting that way themself, outside a fictional context. So we can’t just eliminate this stuff from their training set or somehow delete all their understanding of it.]
Several people have told me they found our results surprising.
I am not aware of a satisfactory explanation for why we observe strategic sandbagging in Claude 3 Opus but not in GPT-4 or other models. See Section 7 of the paper. Multiple people have suggested the hypothesis “it’s imitating the training data” (and some thus claim that the results are completely expected), with many apparently not realizing that it in fact doesn’t explain the results we obtain.[1]
There are few, if any, previous examples of models strategically deceiving humans without external pressure. (The METR gpt-4 TaskRabbit example is the best one I’m aware of.) Previous research in deception has been criticized, not just once or twice, about making the model deceive, rather than observing whether it does so. Clearly many people thought this was a non-trivial point, and this work addresses a shortcoming in prior work.
In general, I find the training-data-imitation hypothesis frustrating, as it can explain largely any behavior. (The training data has examples of people being deceptive, or powerseeking behavior, or cyberattacks, or discussion of AIs training-gaming, or stories of AI takeover, so of course LLMs will deceive / seek power / perform cyberattacks / training-game / take over.) But an explanation that can explain anything at all explains nothing.
Sorry, I should have said “This behavior is deeply unsurprising if you actually stop and think about how base models are trained”. (Presumably your “several people” were not considering things this way.)
The training set inevitably includes a great many examples of different forms of dishonest, criminal, conspiratorial, and otherwise less-than-upstanding behavior. So any large LLM’s base model will be familiar with and capable of imitating these behaviors (in contexts where they seem likely, up to some level of accuracy/perplexity score depending on its capacity and training). The question then is why the alignment training of the released model wasn’t able to suppress this behavior enough for you to be not to be able to observe it in your experiments. That’s a valid and interesting question: my first guess would be that at the moment LLM foundation model alignment training is primarily targeting the chatbot use case (“don’t answer bad questions”) rather than agentic usage (“don’t make and carry out plans to do bad things”) of the type that you were testing. Obviously long-term, if agents are widely deployed, then the possible bad effects of a poorly aligned agent are much worse than those of a poorly aligned chatbot.
What I find most striking in this is that, from the examples you quote in the paper (I haven’t looked through the hundreds of examples you link to), it doesn’t look like the model is clearly selfishly looking out for its own individual well-being, it seems more like it’s being a good corporate worker but a bad citizen and prioritizing the interests of the company above those of the government and society-at-large. I’d be interested in reading a more detailed analysis of your results split in this way, for cases where the distinction is clear.
Frustratingly, LLMs are trained through data-imitation: that’s how they work. When you throw 10T+ tokens into a black box and shake, predicting in detail what will come out is deeply and frustratingly non-trivial. However, for a base model “something very similar to combinations of things that were put in, conditioned on starting with your prompt” is a safe bet. Once you start alignment-training or fine-tuning the model, it gets a lot harder to make predictions.
Personally, I find it frustrating that people are still demanding proof that LLMs can be deceptive, power-seeking, or not law-abiding: to me, that’s like demanding exhaustive proof that they can speak French, or write poetry: did you feed the base model plenty of French and poems in its training data? Yes? OK, then of course it can speak French and write poetry. Why would it not?
This comment (and taking more time to digest what you said earlier) clarifies things, thanks.
I do think that our observations are compatible with the model acting in the interests of the company, rather than being more directly selfish. With a quick skim, the natural language semantics of the model completions seem to point towards “good corporate worker”, though I haven’t thought about this angle much.
I largely agree what you say up until to the last paragraph. I also agree with the literal content of the last paragraph, though I get the feeling that there’s something around there where we differ.
So I agree we have overwhelming evidence in favor of LLMs sometimes behaving deceptively (even after standard fine-tuning processes), both from empirical examples and theoretical arguments from data-imitation. That said, I think it’s not at all obvious to what degree issues such as deception and power-seeking arise in LLMs. And the reason I’m hesitant to shrug things off as mere data-imitation is that one can give stories for why the model did a bad thing for basically any bad thing:
“Of course LLMs are not perfectly honest; they imitate human text, which is not perfectly honest”
“Of course LLMs sometimes strategically deceive humans (by pretending inability); they imitate human text, which has e.g. stories of people strategically deceiving others, including by pretending to be dumber than they are”
“Of course LLMs sometimes try to acquire money and computing resources while hiding this from humans; they imitate human text, which has e.g. stories of people covertly acquiring money via illegal means, or descriptions of people who have obtained major political power”
“Of course LLMs sometimes try to perform self-modification to better deceive, manipulate and seek power; they imitate human text, which has e.g. stories of people practicing their social skills or taking substances that (they believe) improve their functioning”
“Of course LLMs sometimes try to training-game and fake alignment; they imitate human text, which has e.g. stories of people behaving in a way that pleases their teachers/supervisors/authorities in order to avoid negative consequences happening to them”
“Of course LLMs sometimes try to turn the lightcone into paperclips; they imitate human text, which has e.g. stories of AIs trying to turn the lightcone into paperclips”
I think the “argument” above for why LLMs will be paperclip maximizers is just way too weak to warrant the conclusion. So which of the conclusions are deeply unsurprising and which are false (in practical situations we care about)? I don’t think it’s clear at all how far the data-imitation explanation applies, and we need other sources of evidence.
I don’t think LLMs will be likely to be paperclip maximizers — to basically all humans, that’s obviously a silly goal. While there are mentions of this specific behavior one the Internet, they’re almost universally in contexts that make it clear that this is a bad thing and ought not to happen. So unless you specifically prompted the AI to play the role of a bad AI, I think you’d be very unlikely to see this spontaneously.
However, there are some humans who are pretty-much personal-net-worth maximizers (with a few modifiers like “and don’t get arrested”), so I don’t think that evoking that behavior from an LLM would be that hard. Of course, at some point it might also decide to become a philanthropist and give most of its money away, since humans do that too.
My prediction is more that LLM base models are trained to be capable of the entire range of behaviors shown by humans (and fictional characters) on the Internet: good, bad, and weird, in roughly the same proportion as are found on the Internet. Alignment/Instruct training, as we currently know how to do it, can dramatically vary the proportions/probabilities, and so can prompting/jailbreaking, but we don’t yet know how to train a behavior out of a model entirely (though there has been some research into this), and there’s a mathematical proof (from about a year ago) that any behavior still in there can be evoked with as high a probability as you want by using a suitable long enough prompt.
[Were I Christian, I might phrase this as “AI inherits original sin from us”. I’m more of a believer in evolutionary psychology, so the way I’d actually put it is a little less pithy: that humans, as evolved sapient living beings, are fundamentally evolved to maximize their own evolutionary fitness, so they are not always trustworthy under all circumstance, and are capable of acting selfishly or antisocially, usually in situations where this seems like a viable tactic to them. We’re training. our LLMs by ‘distilling’ human intelligence into them, so they of course pick all these behavior patterns up along with everything else about the world and our culture. This is extremely sub-optimal: as something constructed rather than evolved, they don’t have evolutionary fitness to optimize, and their intended purpose is to do what we want and look after us, not to maximize the number of their (non-existent) offspring. So the point of alignment techniques is to transform a distilled copy of an evolved intelligence into an artificial intelligence that behaves appropriately to its nature and intended purpose. They hard part of this is that they will need to understand human anti-social behavior, so they can deal with it, but not be capable (no matter the prompting or provocation) of acting that way themself, outside a fictional context. So we can’t just eliminate this stuff from their training set or somehow delete all their understanding of it.]