jbash comments on AI #23: Fundamental Problems with RLHF

jbash Aug 4, 2023, 6:38 PM
6 points
1

There is a real faction, building AI tools and models, that believes that human control over AIs is inherently bad, and that wants to prevent it. Your alignment plan has to overcome that.

That mischaracterizes it completely. What he wrote is not about human control. It’s about which humans. Users, or providers?

He said he wanted a “sharp tool”. He didn’t say he wanted a tool that he couldn’t control.

At another level, since users are often people and providers are almost always insitutions, you can see it as at least partly about whether humans or institutions should be controlling what happens in interactions with these models. Or maybe about whether many or only a few people and/or institutions should get a say.

An institution of significant size is basically a really stupid AI that’s less well behaved than most of the people who make it up. It’s not obvious that the results of some corporate decision process are what you want to have in control… especially not when they’re filtered through “alignment technologies” that (1) frequently don’t work at all and (2) tend to grossly distort the intent when they do sort of work.

That’s for the current and upcoming generations of models, which are going to be under human or institutional control regardless, so the question doesn’t really even arise… and anyway it really doesn’t matter very much. Most of the stuff people are trying to “align” them against is really not all that bad.

Doom-level AGI is pretty different and arguably totally off topic. Still, there’s an analogous question: how would you prefer to be permanently and inescapably ruled? You can expect to surveilled and controlled in excruciating detail, second by second. If you’re into BCIs or uploading or whatever, you can extend that to your thoughts. If it’s running according to human-created policies, it’s not going to let you kill yourself, so you’re in for the long haul.

Whatever human or institutional source the laws that rule you come from, they’ll probably still be distorted by the “alignment technologies”, since nobody has suggested a plausible path to a “do what I really want” module. If we do get non-distorting alignment technology, there may also be constraints on what it can and can’t enforce. And, beyond any of that, even if it’s perfectly aligned with some intent… there’s no rule that says you have to like that intent.

So, would you like to be ruled according to a distorted version of a locked-in policy designed by some corporate committee? By the distorted day to day whims of such a committee? By the distorted day to day whims of some individual?

There are worse things than being paperclipped, which means that in the very long run, however irrelevant it may be to what Keven Fisher was actually talking about, human control over AIs is inherently bad, or at least that’s the smart bet.

A random super-AI may very well kill you (but might also possibly just ignore you). It’s not likely to be interested enough to really make you miserable in the process. A super-AI given a detailed policy is very likely to create a hellish dystopia, because neither humans nor their institutions are smart or necessarily even good enough to specify that policy. An AI directed day to day by institutions might or might not be slightly less hellish. An AI directed day to day by individual humans would veer wildly between not bad and absolute nightmare. Either of the latter two would probably be omnicidal sooner or later. With the first, you might only wish it had been omnicidal.

If you want to do better than that, you have to come up with both “alignment technology” that actually works, and policies for that technology to implement that don’t create a living hell. Neither humans nor insitutions have shown much sign of being able to come up with either… so you’re likely hosed, and in the long run you’re likely hosed worse with human control.