Olli Järviniemi comments on Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi 20 Jun 2024 15:39 UTC
3 points
2
This comment (and taking more time to digest what you said earlier) clarifies things, thanks.
I do think that our observations are compatible with the model acting in the interests of the company, rather than being more directly selfish. With a quick skim, the natural language semantics of the model completions seem to point towards “good corporate worker”, though I haven’t thought about this angle much.
I largely agree what you say up until to the last paragraph. I also agree with the literal content of the last paragraph, though I get the feeling that there’s something around there where we differ.
So I agree we have overwhelming evidence in favor of LLMs sometimes behaving deceptively (even after standard fine-tuning processes), both from empirical examples and theoretical arguments from data-imitation. That said, I think it’s not at all obvious to what degree issues such as deception and power-seeking arise in LLMs. And the reason I’m hesitant to shrug things off as mere data-imitation is that one can give stories for why the model did a bad thing for basically any bad thing:
- “Of course LLMs are not perfectly honest; they imitate human text, which is not perfectly honest”
- “Of course LLMs sometimes strategically deceive humans (by pretending inability); they imitate human text, which has e.g. stories of people strategically deceiving others, including by pretending to be dumber than they are”
- “Of course LLMs sometimes try to acquire money and computing resources while hiding this from humans; they imitate human text, which has e.g. stories of people covertly acquiring money via illegal means, or descriptions of people who have obtained major political power”
- “Of course LLMs sometimes try to perform self-modification to better deceive, manipulate and seek power; they imitate human text, which has e.g. stories of people practicing their social skills or taking substances that (they believe) improve their functioning”
- “Of course LLMs sometimes try to training-game and fake alignment; they imitate human text, which has e.g. stories of people behaving in a way that pleases their teachers/supervisors/authorities in order to avoid negative consequences happening to them”
- “Of course LLMs sometimes try to turn the lightcone into paperclips; they imitate human text, which has e.g. stories of AIs trying to turn the lightcone into paperclips”
I think the “argument” above for why LLMs will be paperclip maximizers is just way too weak to warrant the conclusion. So which of the conclusions are deeply unsurprising and which are false (in practical situations we care about)? I don’t think it’s clear at all how far the data-imitation explanation applies, and we need other sources of evidence.
- RogerDearnaley 22 Jun 2024 6:06 UTC
  0 points
  0
  Parent
  I don’t think LLMs will be likely to be paperclip maximizers — to basically all humans, that’s obviously a silly goal. While there are mentions of this specific behavior one the Internet, they’re almost universally in contexts that make it clear that this is a bad thing and ought not to happen. So unless you specifically prompted the AI to play the role of a bad AI, I think you’d be very unlikely to see this spontaneously.
  However, there are some humans who are pretty-much personal-net-worth maximizers (with a few modifiers like “and don’t get arrested”), so I don’t think that evoking that behavior from an LLM would be that hard. Of course, at some point it might also decide to become a philanthropist and give most of its money away, since humans do that too.
  
  My prediction is more that LLM base models are trained to be capable of the entire range of behaviors shown by humans (and fictional characters) on the Internet: good, bad, and weird, in roughly the same proportion as are found on the Internet. Alignment/Instruct training, as we currently know how to do it, can dramatically vary the proportions/probabilities, and so can prompting/jailbreaking, but we don’t yet know how to train a behavior out of a model entirely (though there has been some research into this), and there’s a mathematical proof (from about a year ago) that any behavior still in there can be evoked with as high a probability as you want by using a suitable long enough prompt.
  [Were I Christian, I might phrase this as “AI inherits original sin from us”. I’m more of a believer in evolutionary psychology, so the way I’d actually put it is a little less pithy: that humans, as evolved sapient living beings, are fundamentally evolved to maximize their own evolutionary fitness, so they are not always trustworthy under all circumstance, and are capable of acting selfishly or antisocially, usually in situations where this seems like a viable tactic to them. We’re training. our LLMs by ‘distilling’ human intelligence into them, so they of course pick all these behavior patterns up along with everything else about the world and our culture. This is extremely sub-optimal: as something constructed rather than evolved, they don’t have evolutionary fitness to optimize, and their intended purpose is to do what we want and look after us, not to maximize the number of their (non-existent) offspring. So the point of alignment techniques is to transform a distilled copy of an evolved intelligence into an artificial intelligence that behaves appropriately to its nature and intended purpose. They hard part of this is that they will need to understand human anti-social behavior, so they can deal with it, but not be capable (no matter the prompting or provocation) of acting that way themself, outside a fictional context. So we can’t just eliminate this stuff from their training set or somehow delete all their understanding of it.]