The idea of AI alignment is based on the idea that there is a finite, stable set of data about a person, which could be used to predict one’s choices, and which is actually morally good. The reasoning behind this basis is because if it is not true, then learning is impossible, useless, or will not converge.
Is it true that these assumptions are required for AI alignment?
I don’t think it would be impossible to build an AI that is sufficiently aligned to know that, at pretty much any given moment, I don’t want to be spontaneously injured, or be accused of doing something that will reliably cause all my peers to hate me, or for a loved one to die. There’s quite a broad list of “easy” specific “alignment questions”, that virtually 100% of humans will agree on in virtually 100% of circumstances. We could do worse than just building the partially-aligned AI who just makes sure we avoid fates worse than death, individually and collectively.
On the other hand, I agree completely that coupling the concepts of “AI alignment” and “optimization” seems pretty fraught. I’ve wondered if the “optimal” environment for the human animal might be a re-creation of the Pleistocene, except with, y’know, immortality, and carefully managed, exciting-but-not-harrowing levels of resource scarcity.
There is some troubles in creating full and safe list of such human preferences, and there were an idea that AI will be capable to learn actual human preferences by observing human behaviour or by other means, like inverse reinforcement learning.
This my post basically shows that value learning will also have troubles, as there is no real human values, so some other ways to create such list of preferences is needed.
How to align the AI with existing preference, presented in human language, is another question. Yudkowsky wrote that without taking into account the complexity of value, we can’t make safe AI, as it would wrongly interpret short commands without knowing the context.
Is it true that these assumptions are required for AI alignment?
I don’t think it would be impossible to build an AI that is sufficiently aligned to know that, at pretty much any given moment, I don’t want to be spontaneously injured, or be accused of doing something that will reliably cause all my peers to hate me, or for a loved one to die. There’s quite a broad list of “easy” specific “alignment questions”, that virtually 100% of humans will agree on in virtually 100% of circumstances. We could do worse than just building the partially-aligned AI who just makes sure we avoid fates worse than death, individually and collectively.
On the other hand, I agree completely that coupling the concepts of “AI alignment” and “optimization” seems pretty fraught. I’ve wondered if the “optimal” environment for the human animal might be a re-creation of the Pleistocene, except with, y’know, immortality, and carefully managed, exciting-but-not-harrowing levels of resource scarcity.
There is some troubles in creating full and safe list of such human preferences, and there were an idea that AI will be capable to learn actual human preferences by observing human behaviour or by other means, like inverse reinforcement learning.
This my post basically shows that value learning will also have troubles, as there is no real human values, so some other ways to create such list of preferences is needed.
How to align the AI with existing preference, presented in human language, is another question. Yudkowsky wrote that without taking into account the complexity of value, we can’t make safe AI, as it would wrongly interpret short commands without knowing the context.