It’s great to see Yoshua Bengio and other eminent AI scientists like Geoffrey Hinton actively engage in the discussion around AI alignment. He evidently put a lot of thought into this. There is a lot I agree with here.
Below, I’ll discuss two points of disagreement or where I’m surprised by his takes, to highlight potential topics of discussion, e.g. if someone wants to engage directly with Bengio.
Most of the post is focused on the outer alignment problem—how do we specify a goal aligned with our intent—and seems to ignore the inner alignment problem—how do we ensure that the specified goal is optimized for.
E.g., he makes an example of us telling the AI to fix climate change, after which the AI wipes out humanity since that fixes climate change more effectively than respecting our implicit constraints of which the AI has no knowledge. In fact, I think language models show that there may be quite some hope that AI models will understand our implicit intent. Under that view, the problem lies at least as much in ensuring that the AI cares.
He also extensively discusses the wireheading problem of entities (e.g., humans, corporations, or AI systems) that try to maximize their reward signal. I think we have reasons to believe that wireheading isn’t as much of a concern: inner misalignment will cause the agent to have some other goal than the precise maximization of the reward function, and once the agent is situationally aware, it has incentives to keep its goals from changing by gradient descent.
He does discuss the fact that our brains reward us for pleasure and avoiding pain, which is misaligned with the evolutionary goal of genetic fitness. In the alignment community, this is most often discussed as an inner alignment issue between the “reward function” of evolution and the “trained agent” being our genomes. However, his discussion highlights that he seems to view it as an outer alignment issue between evolution and our reward signals in the brain, which shape our adult brains through in-lifetime learning. This is also the viewpoint in Brain-Like-AGI Safety, as far as I remember, and also seems related to viewpoints discussed in shard theory.
“In fact, over two decades of work in AI safety suggests that it is difficult to obtain AI alignment [wikipedia], so not obtaining it is clearly possible.”
I agree with the conclusion, but I am surprised by the argument. It is true that we have seen over two decades of alignment research, but the alignment community has been fairly small all this time. I’m wondering what a much larger community could have done.
It is true that we have seen over two decades of alignment research, but the alignment community has been fairly small all this time. I’m wondering what a much larger community could have done.
I start to get concerned when I look at humanity’s non-AI alignment successes and failures; we’ve had corporations for hundreds of years, and a significant portion of humanity have engaged in corporate alignment-related activities (regulation, lawmaking, governance etc, assuming you consider those forces to generally be pro-alignment in principle). Corporations and governments have exhibited a strong tendency to become less aligned as they grow. (Corporate rapsheets, if a source is needed.)
We’ve also been in the company of humans for millenia, and we haven’t been entirely successful in aligning ourselves, if you consider war, murder, terrorism, poverty, child abuse, climate change and others to be symptoms of individual-level misalignment (in addition to corporate/government misalignment).
It’s hard for me to be hopeful for AI alignment if I believe that a) humans individually can be very misaligned; b) corporations and governments can be very misaligned; and c) that AGI/ASI (even if generally aligned) will be under the control of very misaligned any-of-the-above at some point.
I think it’s great that alignment problems are getting more attention, and hope we find solid solutions. I’m disheartened by humanity’s (ironically?) poor track record of achieving solid alignment in our pre-AI endeavours. I’m glad that Bengio draws parallels between AI alignment problems and corporate alignment, individual alignment, and evolutionary pressures, because I think there is still much to learn by looking outside of AI for ideas about where alignment attempts may go wrong or be subverted.
It’s great to see Yoshua Bengio and other eminent AI scientists like Geoffrey Hinton actively engage in the discussion around AI alignment. He evidently put a lot of thought into this. There is a lot I agree with here.
Below, I’ll discuss two points of disagreement or where I’m surprised by his takes, to highlight potential topics of discussion, e.g. if someone wants to engage directly with Bengio.
Most of the post is focused on the outer alignment problem—how do we specify a goal aligned with our intent—and seems to ignore the inner alignment problem—how do we ensure that the specified goal is optimized for.
E.g., he makes an example of us telling the AI to fix climate change, after which the AI wipes out humanity since that fixes climate change more effectively than respecting our implicit constraints of which the AI has no knowledge. In fact, I think language models show that there may be quite some hope that AI models will understand our implicit intent. Under that view, the problem lies at least as much in ensuring that the AI cares.
He also extensively discusses the wireheading problem of entities (e.g., humans, corporations, or AI systems) that try to maximize their reward signal. I think we have reasons to believe that wireheading isn’t as much of a concern: inner misalignment will cause the agent to have some other goal than the precise maximization of the reward function, and once the agent is situationally aware, it has incentives to keep its goals from changing by gradient descent.
He does discuss the fact that our brains reward us for pleasure and avoiding pain, which is misaligned with the evolutionary goal of genetic fitness. In the alignment community, this is most often discussed as an inner alignment issue between the “reward function” of evolution and the “trained agent” being our genomes. However, his discussion highlights that he seems to view it as an outer alignment issue between evolution and our reward signals in the brain, which shape our adult brains through in-lifetime learning. This is also the viewpoint in Brain-Like-AGI Safety, as far as I remember, and also seems related to viewpoints discussed in shard theory.
“In fact, over two decades of work in AI safety suggests that it is difficult to obtain AI alignment [wikipedia], so not obtaining it is clearly possible.”
I agree with the conclusion, but I am surprised by the argument. It is true that we have seen over two decades of alignment research, but the alignment community has been fairly small all this time. I’m wondering what a much larger community could have done.
I start to get concerned when I look at humanity’s non-AI alignment successes and failures; we’ve had corporations for hundreds of years, and a significant portion of humanity have engaged in corporate alignment-related activities (regulation, lawmaking, governance etc, assuming you consider those forces to generally be pro-alignment in principle). Corporations and governments have exhibited a strong tendency to become less aligned as they grow. (Corporate rapsheets, if a source is needed.)
We’ve also been in the company of humans for millenia, and we haven’t been entirely successful in aligning ourselves, if you consider war, murder, terrorism, poverty, child abuse, climate change and others to be symptoms of individual-level misalignment (in addition to corporate/government misalignment).
It’s hard for me to be hopeful for AI alignment if I believe that a) humans individually can be very misaligned; b) corporations and governments can be very misaligned; and c) that AGI/ASI (even if generally aligned) will be under the control of very misaligned any-of-the-above at some point.
I think it’s great that alignment problems are getting more attention, and hope we find solid solutions. I’m disheartened by humanity’s (ironically?) poor track record of achieving solid alignment in our pre-AI endeavours. I’m glad that Bengio draws parallels between AI alignment problems and corporate alignment, individual alignment, and evolutionary pressures, because I think there is still much to learn by looking outside of AI for ideas about where alignment attempts may go wrong or be subverted.