Seth Herd comments on If we solve alignment, do we die anyway?

Seth Herd 25 Aug 2024 0:19 UTC
6 points
2
I actually completely agree with this call to action.
Unfortunately, I suspect that it’s impossible to make value alignment easier than personal intent alignment. I can’t think of a technical alignment approach that couldn’t be used both ways equally well. And worse than that, I think that intent aligned AGI is easier than value aligned AGI for reasons I outline in that post, and Max Harms has elaborated in much more detail in Corrigibility as Singular Target sequence (as well as Paul Christiano and many others’ arguments.
But I still agree with your call to action: we should be working now to make value alignment as safe as possible. That requires deciding what we align to. The concept of humanity is not well-defined in the future, when upgrades and digital copies of human minds become possible. Roger Dearnaley’s sequence AI, alignment, and ethics lays out these problems and more; for instance, if we stick to baseline humans, the future will be largely controlled by whatever values are held by the most humans, in a competition for memes and reproduction. So there’s conceptual as well as technical/mind-design work to be done on technical alignment.
And that work should be done. In multipolar scenarios with, someone may well decide to “launch” their AGI to be autonomous with value alignment, out of magnanimity or desperation. We’d better make their odds of success as high as we can manage.
I don’t think refusing to work on intent alignment is a helpful option. It will likely be tried, with or without our help. Following instructions is the most obvious alignment target for any agent that’s even approaching autonomy and therefore usefulness. Thinking about how to make those attempts successful will also increase our odds of surviving the first competent autonomous AGIs.
WRT definitions: alignment doesn’t specify alignment with whom. I think this ambiguity is causing important confusions in the field.
I was trying to draw a distinction between two importantly different alignment goals, which I’m terming personal intent alignment and value alignment until better terminology comes along. More on that in an upcoming post.
If you did have an AGI that follows instructions and you told it “do the right thing”, you’d have to specify right for who.
And during the critical risk period, that AGI wouldn’t know for sure what the right thing was. We don’t expect godlike intelligence right out of the gate. It won’t know whether a risky takeover/pivotal act is the right move. If the situation is multipolar, it won’t know even as it becomes truly superintelligent, because it will have to guess at the plans, technologies, and capabilities of other superintelligent AGI.
My call to action is this: help me understand and make or break the argument that a multipolar scenario is very bad, so that the people in charge of the first really successful AGI project know the stakes when they make their calls.