Intent alignment as a stepping-stone to value alignment

I think Instruction-following AGI is easier and more likely than value aligned AGI, and that this accounts for one major crux of disagreement on alignment difficulty. I got several responses to that piece that didn’t dispute that intent alignment is easier, but argued we shouldn’t give up on value alignment. I think that’s right. Here’s another way to frame the value of personal intent alignment: we can use a superintelligent instruction-following AGI to solve full value alignment.

This is different than automated alignment research; it’s not hoping tool AI can help with our homework, it’s making an AGI smarter than us in every way do our homework for us. It’s a longer term plan. Having a superintelligent, largely autonomous entity that just really likes taking instructions from puny humans is counterintuitive, but it seems both logically consistent. And it seems technically achievable on the current trajectory—if we don’t screw it up too badly.

Personal, short-term intent alignment (like instruction-following) is safer for early AGI because it includes corrigibility. It allows near-misses. If your AGI did think eliminating humans would be a good way to cure cancer, but it’s not powerful enough to make that happen immediately, you’ll probably get a chance to say “so what’s your plan for that cancer solution?” and “Wait no! Quit working on that plan!” (And that’s if you somehow didn’t tell it to check with you before acting on big plans).

This type of target really seems to make alignment much easier. See the first linked post, or Max Harms’ excellent sequence on corrigibility as a singular (alignment) target (CAST) for a much deeper analysis. An AI that wants to follow directions also wants to respond honestly about its motivations when asked, and to change its goals when told to—because its goals are all subgoals of doing what its principal asks. And this approach doesn’t have to “solve ethics”—because it follows the principal’s ethics.

And that’s the critical flaw; we’re still stuck with variable and questionable human ethics. Having humans control AGI is not a permanent solution to the dangers of AGI. Even if the first creators are relatively well-intentioned, eventually someone sociopathic enough will get the reins of a powerful AGI and use it to seize the future.

In this scenario, technical alignment is solved, but most of us die anyway. We die as soon as a sufficiently malevolent person acquires or seizes power (probably governmental power) over an AGI.

But won’t a balance of power restrain one malevolently-controlled AGI surrounded by many in good hands? I don’t think so. Mutually assured destruction works for nukes but not as well with AGI capable of autonomous recursive self-improvement. A superintelligent AGI will probably be able to protect at least its principal and a few of their favorite people as part of a well-planned destructive takeover. If nobody else has yet used their AGI to firmly seize control of the lightcone, there’s probably a way for an AGI to hide and recursively self-improve until it invents weapons and strategies that let it take over—if its principal can accept enough collateral damage. With a superintelligence on your side, building a new civilization to your liking might be seen as more an opportunity than an inconvenience.

These issues are discussed in more depth in If we solve alignment, do we die anyway? and its discussion. To the average human, controlled AI is just as lethal as ‘misaligned’ AI draws similar conclusions from a different perspective.

It seem inevitable that someone sufficiently malevolent would eventually get the reins of an intent-aligned AGI. This might not take long even if AGI does not proliferate widely; there are Reasons to think that malevolence could correlate with attaining and retaining positions of power. Maybe there’s a way to prevent this with the aid of increasingly intelligent AGIs; if not, it seems like taking power out of human hands before it falls into the wrong ones will be necessary. perspective.

Writing If we solve alignment, do we die anyway? and discussing the claims in the comments drew me to the conclusion that the end goal probably needs to be value alignment, just like we’ve always thought—humans power structures are too vulnerable to infiltration or takeover by malevolent humans. But instruction-following is a safer first alignment target. So it can be a stepping-stone that dramatically improves our odds of getting to value aligned AGI.

Humans in control of highly intelligent AGI will have a huge advantage on solving the full value alignment problem. At some point, they will probably be pretty certain the plan can be accomplished, at least well enough to maintain much of the value of the lightcone by human lights (perfect alignment seems impossible since human values are path-dependent, but we should be able to do pretty well).

Thus, the endgame goal is still full value alignment for superintelligence, but the route there is probably through short-term personal intent alignment.

Is this a great plan? Certainly not. It hasn’t been thought through, and there’s probably a lot that can go wrong even once it’s as refined as possible. In an easier world, we’d Shut it All Down until we’re ready to do it wisely. That doesn’t look like an option, so I’m trying to plot a practically achievable path from where we are to real success.