Seth Herd comments on The alignment stability problem

Seth Herd 19 Jan 2024 2:56 UTC
2 points
0
I think humans are aligned, approximately and on average. And that’s what people mean when they assume humans are aligned. I wish I were sure, because this is an important question.

Does the average human become Stalin if put in his position? I don’t think so, but I can’t say I’m sure. Stalin probably got power because he was the man of steel (a sociopath/warrior mindset), and he also probably became more cruel over the course of competing for power, which demanded being vicious, which he would’ve then justified and internalized. Putting someone in a posiition of power wouldn’t necessarily corrupt them the same way. But maintaining his power probably also demanded being vicious on occasion; I’m sure there were others plotting to depose him in reality as well as in his paranoia.

But the more relevant question is: does the average human help others when they’re given nearly unlimited power and security? That’s the position an AGI or a human controlling an aligned AGI would be in. There I think the answer is yes, the average human will become better when they have no threats to themselves or their loved ones.

I can’t say I’m sure, and I wish I were. This might very well be the question on which hangs our future. I think we can achieve technical alignment because We have promising alignment plans with low taxes for the types of AGI we’re most likely to get. But Corrigibility or DWIM is an attractive primary goal for AGI to the extent that AGI will probably be aligned to take orders from the human(s) that built it. And then the world hangs on their whims. I’d take that bet over any other alignment scheme, but I’m far from sure it would pay off.