The original alignment thinking held that explaining human values to AGI would be really hard.
The difficulty was suggested to be in getting an optimizer to care about what those values are pointing to, not to understand them[1]. If in some instances the values mapped to doing something unwise, using an optimizer that understood those values might fail to constrain away from doing something unwise. Getting a system to use extrapolated preferences as behavioral constraints is a deeper problem than getting a system to reflect surface preferences. The high p(doom) estimates partly follow from expecting that an aligned AI will have to be used to prevent future misaligned/misused AI, and that doing something so high impact would require unsafe behaviors in a system not aligned to reflectively coherent and endorsed extrapolated preferences.
The difficulty was suggested to be in getting an optimizer to care about what those values are pointing to, not to understand them[1]. If in some instances the values mapped to doing something unwise, using an optimizer that understood those values might fail to constrain away from doing something unwise. Getting a system to use extrapolated preferences as behavioral constraints is a deeper problem than getting a system to reflect surface preferences. The high p(doom) estimates partly follow from expecting that an aligned AI will have to be used to prevent future misaligned/misused AI, and that doing something so high impact would require unsafe behaviors in a system not aligned to reflectively coherent and endorsed extrapolated preferences.
In The Hidden Complexity of Wishes, it wasn’t the genie won’t understand what you meant, it was the genie won’t care what you meant.