The original alignment thinking held that explaining human values to AGI would be really hard.
The difficulty was suggested to be in getting an optimizer to care about what those values are pointing to, not to understand them[1]. If in some instances the values mapped to doing something unwise, using an optimizer that understood those values might fail to constrain away from doing something unwise. Getting a system to use extrapolated preferences as behavioral constraints is a deeper problem than getting a system to reflect surface preferences. The high p(doom) estimates partly follow from expecting that an aligned AI will have to be used to prevent future misaligned/misused AI, and that doing something so high impact would require unsafe behaviors in a system not aligned to reflectively coherent and endorsed extrapolated preferences.
- ^
In The Hidden Complexity of Wishes, it wasn’t the genie won’t understand what you meant, it was the genie won’t care what you meant.
‘Alignment’ has been used to refer to both aligning a single AI model, and the harder problem of aligning all AIs. This difference in the way the word alignment is used has led to some confusion. Alignment is not solved by aligning a single AI model, but by using a strategy which prevents catastrophic misalignment/misuse from any AI.