I see a lot of energy and interest being devoted toward detecting deception in AIs, trying to make AIs less deceptive, making AIs honest, etc. But I keep trying to figure out why so many think this is very important. For less-than-human intelligence, deceptive tactics will likely be caught by smarter humans (when a 5-year-old tries to lie to you, it’s just sort of sad or even cute, not alarming). If an AI has greater-than-human intelligence, deception seems to be just one avenue of goal-seeking, and not even a very lucrative or efficient one.
Take the now overused humans-to-chimpanzee analogy. If humans want to bulldoze a jungle that has chimpanzees in it, they will just bulldoze the forrest, and kill or sell any chimps that get in their way. They don’t say “okay, we’re going to take these sticks of dynamite, and conceal them in these bundles of bananas, then we’ll give the bananas to the chimps to earn their trust, and then, when the time is right, we’ll detonate them.” You just bulldoze the forrest and kill the chimps. Anything else is just needlessly convoluted.[1]
If an AI is smart-enough to deceive humans, and it wants to gain access to the grid, I don’t see why it wouldn’t just hack into the grid. Or the internet. Or server farms. Or whatever it’s trying to get.
What am I missing? What situation in the future would make detecting deception in models important?
- ^
Ironically, deceptive tactics in this case would likely correlate with niceness. If you want to peacefully relocate the chimps without disturbing or scaring them, then you might use deception and manipulation. But only if you actually care about their wellbeing.
The ELK document outlines a definition of “honesty” and argues that if you achieve that level of honesty, you can scale that to a whole alignment solution.
I agree that it doesn’t necessarily matter that much if an AI lies to us, vs. it just takes our resources. But in as much as you are trying to use AI systems to assist you in supervising other systems, or in providing an enhanced training signal, honesty seems like one of the most central attributes that helps you get that kind of work out of the AI, and that also allows you to take countermeasures against things the AI is plotting.
If the AI always answer honestly to the question of “are you planning to disempower me?” and “what are your plans for disempowering me?” and “how would you thwart your plans to disempower me?” then that sure makes it pretty hard for the AI to disempower you.
These questions are only relevant if the AI “plans” to disempower me. It can still be a side effect. Even something the humans at that point might endorse. Though one could conceivably work around that by asking: “Could your interaction with me lead to my disempowerment in some wider sense?”
You’re missing the steps whereby the AI gets to a position of power. AI presumably goes from a position of no power and moderate intelligence (where it is now) to a position of great power and superhuman intelligence, whereupon it can do what it wants. But we’re not going to deliberately allow such a position unless we can trust it. We can’t let it get superintelligent or powerful-in-the-world unless we can prove it will use its powers wisely. Part of that is being non-deceptive. Indeed, if we can build it provably non-deceptive, we can simply ask it what it intends to do before we release it. So deception is a lot of what we worry about.
Of course we work on other problems too, like making it helpful and obedient.
Most people are quite happy to give current AIs relatively unrestricted access to sensitive data, APIs, and other powerful levers for effecting far-reaching change in the world. So far, this has actually worked out totally fine! But that’s mostly because the AIs aren’t (yet) smart enough to make effective use of those levers (for good or ill), let alone be deceptive about it.
To the degree that people don’t trust AIs with access to even more powerful levers, it’s usually because they fear the AI getting tricked by adversarial humans into misusing those levers (e.g. through prompt injection), not fear that the AI itself will be deliberately tricky.
One can hope, sure. But what I actually expect is that people will generally give AIs more power and trust as they get more capable, not less.
For AI to hack in the grid it is important that humans didn’t notice that and didn’t shutdown it. To disguise hacking as normal non-suspicious activity is to deceive.
Why would it matter if they notice or not? What are they gonna do? EMP the whole world?
I think that they shutdown computer on which unaligned AI is running? Downloading yourself into internet is not one-second process.
It’s only bottlenecked on connection speeds, which are likely to be fast at the server where this AI would be, if it were developed by a large lab. So imv 1-5 seconds is feasible for ‘escapes the datacenter as first step’ (by which point the process is distributed and hard to stop without centralized control). (‘distributed across most internet-connected servers’ would take longer of course).
Yes, no one is developing cutting-edge AIs like GPT-5 off your local dinky Ethernet, and your crummy home cable modem choked by your ISP is highly misleading if that’s what you think of as ‘Internet’. The real Internet is way faster, particularly in the cloud. Stuff in the datacenter can do things like access another server’s RAM using RDMA in a fraction of a millisecond, vastly faster than your PC can even talk to its hard drive. This is because datacenter networking is serious business: it’s always high-end Ethernet or better yet, Infiniband. And because interconnect is one of the most binding constraints on scaling GPU clusters, any serious GPU cluster is using the best Infiniband they can get.
WP cites the latest Infiniband at 100-1200 gigabits/second or 12-150 gigabyte/s for point to point; with Chinchilla scaling yielding models on the order of 100 gigabytes, compression & quantization cutting model size by a factor, and the ability to transmit from multiple storage devices and also send only shards to individual servers (which is how the model will probably run anyway), it is actually not out of the question for ‘downloading yourself into internet’ to be a 1-second process today.
(Not that it would matter if these numbers were 10x off. If you can’t stop a model from exfiltrating itself in 1s, then you weren’t going to somehow catch it if it actually takes 10s.)
Imo, provably boxing a blackbox program from escaping through digital-logic-based routes (which are easier to prove things about) is feasible; deception is relevant to the much harder to provably wall off route that is human psychology.