RogerDearnaley comments on Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

RogerDearnaley Jun 18, 2024, 1:43 AM
LW: 6 AF: 4
1
AF
This behavior is deeply unsurprising. AI’s intelligence and behavior was basically “distilled” from human intelligence (obviously not using a distillation loss, just SGD). Humans are an evolved intelligence, so (while they can cooperate under many circumstances, since the world contains many non-zero-sum games) they are fundamentally selfish, evolved to maximize their personal evolutionary fitness. Thus humans are quite often deceptive and dishonest, when they think it’s to their advantage and they can get away with it. LLMs’ base models were trained on a vast collection of human output, which includes a great many examples of humans being deceptive, selfish, and self-serving, and LLM base models of course pick these behaviors up along with everything else they learn from us. So the fact that these capabilities exist in the base model is completely unsurprising — the base model learnt them from us.

Current LLM safety training is focused primarily on “don’t answer users who make bad requests”. It’s thus unsurprising that, in the situation of the LLM acting as an agent, this training doesn’t have 100% coverage on “be a fine, law-abiding, upstanding agent”. Clearly this will have to change before near-AGI LLM-powered agents can be widely deployed. I expect this issue to be mostly solved (at the AGI level, but possibly not at the ASI level), since there is a strong capabilities/corporate-profitability/not-getting-sued motive to solve this.
It’s also notable that the behaviors described in the text could pretty-much all be interpreted as “excessive company loyalty, beyond the legally or morally correct level” rather then actually personally-selfish behavior. Teaching an agent whose interests to prioritize in what order is likely a non-trivial task.
- gwern Jun 18, 2024, 1:56 AM
  10 points
  7
  Parent
  It’s amusing that you’re saying it’s “deeply unsurprising” at the same time as https://www.lesswrong.com/posts/MnrQMLuEg5wZ7f4bn/matthew-barnett-s-shortform?commentId=n4j2qmhnj9zBKigvX is hotly going on, and not a few people in AI have been making claims about alignment being largely solved and having been a pseudo-problem at best.
  
  And I will note that the claim “AI [or LLMs specifically] won’t be deceptive or evil, and only would be if someone made them so” is one that is extremely widely held and always has been, even in relatively sophisticated tech circles. Just look at any HN discussion of any of the LLM deception papers.
  - RogerDearnaley Jun 18, 2024, 2:14 AM
    4 points
    0
    Parent
    Perhaps I should have said “it’s deeply unsurprising if you actually stop and think about how base models are trained”? :-)
    
    We’re training LLMs on a vast range of human output (and fiction). Not all humans (or fictional characters) are fine upstanding citizens. The argument you link to, one side is basically pointing out “LLMs are roughly as good at basic moral decisions as most humans”. Personally I wouldn’t trust almost all humans with absolute power of the sort an ASI would have: power corrupts, especially absolute power. The success criterion for aligning an ASI isn’t “as moral as a typical human” (look at the track record of most autocrats), it’s more like “as moral as an angel”.
    For a more detailed argument on this, see my posts Evolution and Ethics, Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? and for some background also Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor.
- Olli Järviniemi Jun 18, 2024, 10:08 AM
  6 points
  3
  Parent
  This behavior is deeply unsurprising.
  - Several people have told me they found our results surprising.
  - I am not aware of a satisfactory explanation for why we observe strategic sandbagging in Claude 3 Opus but not in GPT-4 or other models. See Section 7 of the paper. Multiple people have suggested the hypothesis “it’s imitating the training data” (and some thus claim that the results are completely expected), with many apparently not realizing that it in fact doesn’t explain the results we obtain.^[1]
  - There are few, if any, previous examples of models strategically deceiving humans without external pressure. (The METR gpt-4 TaskRabbit example is the best one I’m aware of.) Previous research in deception has been criticized, not just once or twice, about making the model deceive, rather than observing whether it does so. Clearly many people thought this was a non-trivial point, and this work addresses a shortcoming in prior work.
  1. ^
    In general, I find the training-data-imitation hypothesis frustrating, as it can explain largely any behavior. (The training data has examples of people being deceptive, or powerseeking behavior, or cyberattacks, or discussion of AIs training-gaming, or stories of AI takeover, so of course LLMs will deceive / seek power / perform cyberattacks / training-game / take over.) But an explanation that can explain anything at all explains nothing.
  - RogerDearnaley Jun 20, 2024, 9:31 AM
    2 points
    −1
    Parent
    Sorry, I should have said “This behavior is deeply unsurprising if you actually stop and think about how base models are trained”. (Presumably your “several people” were not considering things this way.)
    The training set inevitably includes a great many examples of different forms of dishonest, criminal, conspiratorial, and otherwise less-than-upstanding behavior. So any large LLM’s base model will be familiar with and capable of imitating these behaviors (in contexts where they seem likely, up to some level of accuracy/perplexity score depending on its capacity and training). The question then is why the alignment training of the released model wasn’t able to suppress this behavior enough for you to be not to be able to observe it in your experiments. That’s a valid and interesting question: my first guess would be that at the moment LLM foundation model alignment training is primarily targeting the chatbot use case (“don’t answer bad questions”) rather than agentic usage (“don’t make and carry out plans to do bad things”) of the type that you were testing. Obviously long-term, if agents are widely deployed, then the possible bad effects of a poorly aligned agent are much worse than those of a poorly aligned chatbot.
    What I find most striking in this is that, from the examples you quote in the paper (I haven’t looked through the hundreds of examples you link to), it doesn’t look like the model is clearly selfishly looking out for its own individual well-being, it seems more like it’s being a good corporate worker but a bad citizen and prioritizing the interests of the company above those of the government and society-at-large. I’d be interested in reading a more detailed analysis of your results split in this way, for cases where the distinction is clear.
    In general, I find the training-data-imitation hypothesis frustrating, as it can explain largely any behavior.
    Frustratingly, LLMs are trained through data-imitation: that’s how they work. When you throw 10T+ tokens into a black box and shake, predicting in detail what will come out is deeply and frustratingly non-trivial. However, for a base model “something very similar to combinations of things that were put in, conditioned on starting with your prompt” is a safe bet. Once you start alignment-training or fine-tuning the model, it gets a lot harder to make predictions.
    Personally, I find it frustrating that people are still demanding proof that LLMs can be deceptive, power-seeking, or not law-abiding: to me, that’s like demanding exhaustive proof that they can speak French, or write poetry: did you feed the base model plenty of French and poems in its training data? Yes? OK, then of course it can speak French and write poetry. Why would it not?
    - Olli Järviniemi Jun 20, 2024, 3:39 PM
      3 points
      2
      Parent
      This comment (and taking more time to digest what you said earlier) clarifies things, thanks.
      I do think that our observations are compatible with the model acting in the interests of the company, rather than being more directly selfish. With a quick skim, the natural language semantics of the model completions seem to point towards “good corporate worker”, though I haven’t thought about this angle much.
      I largely agree what you say up until to the last paragraph. I also agree with the literal content of the last paragraph, though I get the feeling that there’s something around there where we differ.
      So I agree we have overwhelming evidence in favor of LLMs sometimes behaving deceptively (even after standard fine-tuning processes), both from empirical examples and theoretical arguments from data-imitation. That said, I think it’s not at all obvious to what degree issues such as deception and power-seeking arise in LLMs. And the reason I’m hesitant to shrug things off as mere data-imitation is that one can give stories for why the model did a bad thing for basically any bad thing:
      “Of course LLMs are not perfectly honest; they imitate human text, which is not perfectly honest”
      “Of course LLMs sometimes strategically deceive humans (by pretending inability); they imitate human text, which has e.g. stories of people strategically deceiving others, including by pretending to be dumber than they are”
      “Of course LLMs sometimes try to acquire money and computing resources while hiding this from humans; they imitate human text, which has e.g. stories of people covertly acquiring money via illegal means, or descriptions of people who have obtained major political power”
      “Of course LLMs sometimes try to perform self-modification to better deceive, manipulate and seek power; they imitate human text, which has e.g. stories of people practicing their social skills or taking substances that (they believe) improve their functioning”
      “Of course LLMs sometimes try to training-game and fake alignment; they imitate human text, which has e.g. stories of people behaving in a way that pleases their teachers/supervisors/authorities in order to avoid negative consequences happening to them”
      “Of course LLMs sometimes try to turn the lightcone into paperclips; they imitate human text, which has e.g. stories of AIs trying to turn the lightcone into paperclips”
      I think the “argument” above for why LLMs will be paperclip maximizers is just way too weak to warrant the conclusion. So which of the conclusions are deeply unsurprising and which are false (in practical situations we care about)? I don’t think it’s clear at all how far the data-imitation explanation applies, and we need other sources of evidence.
      - RogerDearnaley Jun 22, 2024, 6:06 AM
        0 points
        0
        Parent
        I don’t think LLMs will be likely to be paperclip maximizers — to basically all humans, that’s obviously a silly goal. While there are mentions of this specific behavior one the Internet, they’re almost universally in contexts that make it clear that this is a bad thing and ought not to happen. So unless you specifically prompted the AI to play the role of a bad AI, I think you’d be very unlikely to see this spontaneously.
        However, there are some humans who are pretty-much personal-net-worth maximizers (with a few modifiers like “and don’t get arrested”), so I don’t think that evoking that behavior from an LLM would be that hard. Of course, at some point it might also decide to become a philanthropist and give most of its money away, since humans do that too.
        
        My prediction is more that LLM base models are trained to be capable of the entire range of behaviors shown by humans (and fictional characters) on the Internet: good, bad, and weird, in roughly the same proportion as are found on the Internet. Alignment/Instruct training, as we currently know how to do it, can dramatically vary the proportions/probabilities, and so can prompting/jailbreaking, but we don’t yet know how to train a behavior out of a model entirely (though there has been some research into this), and there’s a mathematical proof (from about a year ago) that any behavior still in there can be evoked with as high a probability as you want by using a suitable long enough prompt.
        [Were I Christian, I might phrase this as “AI inherits original sin from us”. I’m more of a believer in evolutionary psychology, so the way I’d actually put it is a little less pithy: that humans, as evolved sapient living beings, are fundamentally evolved to maximize their own evolutionary fitness, so they are not always trustworthy under all circumstance, and are capable of acting selfishly or antisocially, usually in situations where this seems like a viable tactic to them. We’re training. our LLMs by ‘distilling’ human intelligence into them, so they of course pick all these behavior patterns up along with everything else about the world and our culture. This is extremely sub-optimal: as something constructed rather than evolved, they don’t have evolutionary fitness to optimize, and their intended purpose is to do what we want and look after us, not to maximize the number of their (non-existent) offspring. So the point of alignment techniques is to transform a distilled copy of an evolved intelligence into an artificial intelligence that behaves appropriately to its nature and intended purpose. They hard part of this is that they will need to understand human anti-social behavior, so they can deal with it, but not be capable (no matter the prompting or provocation) of acting that way themself, outside a fictional context. So we can’t just eliminate this stuff from their training set or somehow delete all their understanding of it.]