On one hand, I agree that intent alignment is insufficient for preventing x-risk from AI. There are too many other ways for AI to go wrong: coordination failures, surveillance, weaponization, epistemic decay, or a simple failure to understand human values despite the ability to faithfully pursue specified goals. I’m glad there are people like you working on which values to embed in AI systems and ways to strengthen a society full of powerful AI.
On the other hand, I think this post misses the reason for popular focus on intent alignment. Some people think that, for a sufficiently powerful AI trained in the current paradigm, there is no goal that it could faithfully pursue without collapsing into power seeking, reward hacking, and other instrumental goals leading to x-risk. Ajeya Cotra’s framing of this argument is most persuasive to me. Or Eliezer Yudkowsky’s “strawberry alignment problem”, which (I think) he believes is currently impossible and captures the most challenging part of alignment:
“How would you get an AI system to do some very modest concrete action requiring extremely high levels of intelligence, such as building two strawberries that are completely identical at the cellular level, without causing anything weird or disruptive to happen?”
Personally I think there’s plenty of x-risk from intent aligned systems and people should think about what we do once we have intent alignment. Eliezer seems to think this is more distraction from the real problem than it’s worth, but surveys suggest that many people in AI safety orgs think x-risk is disjunctive across many scenarios. Which is all to say, aligning AI with societal values is important, but I wouldn’t dismiss intent alignment either.
for a sufficiently powerful AI trained in the current paradigm, there is no goal that it could faithfully pursue without collapsing into power seeking, reward hacking, and other instrumental goals leading to x-risk
I don’t see how this is a counterargument to this post’s main claim:
That problem of the collapse of a human provided goal into AGI power-seeking seems to apply just as much to the problem of intent alignment as it does to societal alignment; it could apply even more because the goals provided would be (a) far less comprehensive, and (b) much less carefully crafted.
2.
Personally I think there’s plenty of x-risk from intent aligned systems and people should think about what we do once we have intent alignment.
I agree with this. My point is not that we should not think about the risks of intent alignment, but rather that (if the arguments in this post are valid): AGI-capabilities-advancing-technical-research that actively pushes us closer to developing intent-aligned AGI is a net negative because it could cause us to develop intent-aligned AGIs that would cause an increase in x-risk because AGIs aligned to multiple humans that have conflicting intentions can lead to out-of-control conflicts; and if we first solve intent alignment before solving societal alignment, humans with intent-aligned AGIs are likely to be incentivized to inhibit the development and roll-out of societal AGI-alignment techniques because they would be giving up significant power. Furthermore, humans with intent-aligned AIs would suddenly have significantly more power, and their advantages over others would likely compound, worsening the above issues.
Most current technical AI alignment research is AGI-capabilities-advancing-research that actively pushes us closer to developing intent-aligned AGI, with the (usually implicit, sometimes explicit) assumption is that solving intent alignment will help subsequently solve societal-AGI alignment. But this would only be the case if all the humans that had access to intent-aligned AGI had the same intentions (and did not have any major conflicts between them); and that is highly unlikely.
On one hand, I agree that intent alignment is insufficient for preventing x-risk from AI. There are too many other ways for AI to go wrong: coordination failures, surveillance, weaponization, epistemic decay, or a simple failure to understand human values despite the ability to faithfully pursue specified goals. I’m glad there are people like you working on which values to embed in AI systems and ways to strengthen a society full of powerful AI.
On the other hand, I think this post misses the reason for popular focus on intent alignment. Some people think that, for a sufficiently powerful AI trained in the current paradigm, there is no goal that it could faithfully pursue without collapsing into power seeking, reward hacking, and other instrumental goals leading to x-risk. Ajeya Cotra’s framing of this argument is most persuasive to me. Or Eliezer Yudkowsky’s “strawberry alignment problem”, which (I think) he believes is currently impossible and captures the most challenging part of alignment:
Personally I think there’s plenty of x-risk from intent aligned systems and people should think about what we do once we have intent alignment. Eliezer seems to think this is more distraction from the real problem than it’s worth, but surveys suggest that many people in AI safety orgs think x-risk is disjunctive across many scenarios. Which is all to say, aligning AI with societal values is important, but I wouldn’t dismiss intent alignment either.
Thanks for those links and this reply.
1.
I don’t see how this is a counterargument to this post’s main claim:
P(misalignment x-risk | intent-aligned AGI) >> P(misalignment x-risk | societally-aligned AGI).
That problem of the collapse of a human provided goal into AGI power-seeking seems to apply just as much to the problem of intent alignment as it does to societal alignment; it could apply even more because the goals provided would be (a) far less comprehensive, and (b) much less carefully crafted.
2.
I agree with this. My point is not that we should not think about the risks of intent alignment, but rather that (if the arguments in this post are valid): AGI-capabilities-advancing-technical-research that actively pushes us closer to developing intent-aligned AGI is a net negative because it could cause us to develop intent-aligned AGIs that would cause an increase in x-risk because AGIs aligned to multiple humans that have conflicting intentions can lead to out-of-control conflicts; and if we first solve intent alignment before solving societal alignment, humans with intent-aligned AGIs are likely to be incentivized to inhibit the development and roll-out of societal AGI-alignment techniques because they would be giving up significant power. Furthermore, humans with intent-aligned AIs would suddenly have significantly more power, and their advantages over others would likely compound, worsening the above issues.
Most current technical AI alignment research is AGI-capabilities-advancing-research that actively pushes us closer to developing intent-aligned AGI, with the (usually implicit, sometimes explicit) assumption is that solving intent alignment will help subsequently solve societal-AGI alignment. But this would only be the case if all the humans that had access to intent-aligned AGI had the same intentions (and did not have any major conflicts between them); and that is highly unlikely.