Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

As a writing exercise, I’m writing an AI Alignment Hot Take Advent Calendar—one new hot take, written every day (ish) for 25 days. Or until I run out of hot takes. And now, time for the week of RLHF takes.

I see people say one of these surprisingly often.

Sometimes, it’s because the speaker is fresh and full of optimism. They’ve recently learned that there’s this “outer alignment” thing where humans are supposed to communicate what they want to an AI, and oh look, here are some methods that researchers use to communicate what they want to an AI. The speaker doesn’t see any major obstacles, and they don’t have a presumption that there are a bunch of obstacles they don’t see.

Other times, they’re fresh and full of optimism in a slightly more sophisticated way. They’ve thought about the problem a bit, and it seems like human values can’t be that hard to pick out. Our uncertainty about human values is pretty much like our uncertainty about any other part of the world—so their thinking goes—and humans are fairly competent at figuring things out about the world, especially if we just have to check the work of AI tools. They don’t see any major obstacles, and look, I’m not allowed to just keep saying that in an ominous tone of voice as if it’s a knockdown argument, maybe there aren’t any obstacles, right?

Here’s an obstacle: RLHF/IDA/debate all incentivize promoting claims based on what the human finds most convincing and palatable, rather than on what’s true. RLHF does whatever it learned makes you hit the “approve” button, even if that means deceiving you. Information-transfer in the depths of IDA is shaped by what humans will pass on, potentially amplified by what patterns are learned in training. And debate is just trying to hack the humans right from the start.

Optimizing for human approval wouldn’t be a big deal if humans didn’t make systematic mistakes, and weren’t prone to finding certain lies more compelling than the truth. But we do, and we are, so that’s a problem. Exhibit A, the last 5 years of politics—and no, the correct lesson to draw from politics is not “those other people make systematic mistakes and get suckered by palatable lies, but I’d never be like that.” We can all be like that, which is why it’s not safe to build a smart AI that has an incentive to do politics to you.

Generalized moral of the story: If something is an alignment solution except that it requires humans to converge to rational behavior, it’s not an alignment solution.

Let’s go back to the perspective of someone who thinks that RLHF/whatever solves outer alignment. I think that even once you notice a problem like “it’s rewarded for deceiving me,” there’s a temptation to not change your mind, and this can lead people to add epicycles to other parts of their picture of alignment. (Or if I’m being nicer, disposes them to see the alignment problem in terms of “really solving” inner alignment.)

For example, in order to save an outer objective that encourages deception, it’s tempting to say that non-deception is actually a separate problem, and we should study preventing deception as a topic in its own right, independent of objective. And you know what, this is actually a pretty reasonable thing to study. But that doesn’t mean you should actually hang onto the original objective. Even when you make stone soup, you don’t eat the stone.

Take 9: No, RLHF/​IDA/​debate doesn’t solve outer alignment.

Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.