RogerDearnaley comments on Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

RogerDearnaley 22 Jun 2024 6:06 UTC
0 points
0
I don’t think LLMs will be likely to be paperclip maximizers — to basically all humans, that’s obviously a silly goal. While there are mentions of this specific behavior one the Internet, they’re almost universally in contexts that make it clear that this is a bad thing and ought not to happen. So unless you specifically prompted the AI to play the role of a bad AI, I think you’d be very unlikely to see this spontaneously.
However, there are some humans who are pretty-much personal-net-worth maximizers (with a few modifiers like “and don’t get arrested”), so I don’t think that evoking that behavior from an LLM would be that hard. Of course, at some point it might also decide to become a philanthropist and give most of its money away, since humans do that too.

My prediction is more that LLM base models are trained to be capable of the entire range of behaviors shown by humans (and fictional characters) on the Internet: good, bad, and weird, in roughly the same proportion as are found on the Internet. Alignment/Instruct training, as we currently know how to do it, can dramatically vary the proportions/probabilities, and so can prompting/jailbreaking, but we don’t yet know how to train a behavior out of a model entirely (though there has been some research into this), and there’s a mathematical proof (from about a year ago) that any behavior still in there can be evoked with as high a probability as you want by using a suitable long enough prompt.
[Were I Christian, I might phrase this as “AI inherits original sin from us”. I’m more of a believer in evolutionary psychology, so the way I’d actually put it is a little less pithy: that humans, as evolved sapient living beings, are fundamentally evolved to maximize their own evolutionary fitness, so they are not always trustworthy under all circumstance, and are capable of acting selfishly or antisocially, usually in situations where this seems like a viable tactic to them. We’re training. our LLMs by ‘distilling’ human intelligence into them, so they of course pick all these behavior patterns up along with everything else about the world and our culture. This is extremely sub-optimal: as something constructed rather than evolved, they don’t have evolutionary fitness to optimize, and their intended purpose is to do what we want and look after us, not to maximize the number of their (non-existent) offspring. So the point of alignment techniques is to transform a distilled copy of an evolved intelligence into an artificial intelligence that behaves appropriately to its nature and intended purpose. They hard part of this is that they will need to understand human anti-social behavior, so they can deal with it, but not be capable (no matter the prompting or provocation) of acting that way themself, outside a fictional context. So we can’t just eliminate this stuff from their training set or somehow delete all their understanding of it.]