Overall this is still encouraging. It seems to take serious that
value alignment is hard
executive-AI should be banned
banning executive-AI would be hard
alignment research and AI safety is worthwhile.
I feel like there are enough shared assumptions that collaboration or dialogue with AI notkilleveryoneists could be very useful.
That said, I wish there were more details about his Scientist AI idea:
How exactly will the Scientist AI be used?
Should we expect the Scientist AI to have situational awareness?
Would the Scientist AI be allowed to write large scale software projects that are likely to get executed after a brief reviewing of the code by a human?
Are there concerns about Mesa-optimization?
Also it is not clear to me whether the safety is supposed to come from:
the AI cannot really take actions in the world (and even when there is a superhuman AI that wants to do large-scale harms, it will not succeed, because it cannot take actions that achieve these goals)
the AI has no intrinsic motivation for large-scale harm (while its output bits could in principle create large-scale harm, such a string of bits is unlikely because there is no drive towards these string of bits).
This also seems very encouraging to me! In some sense he seems to be where Holden was at 10 years ago, and he seems to be a pretty good and sane thinker on AGI risk now, and I have hope that similar arguments will be compelling to both of them so that Bengio will also realize some of the same errors that I see him making here.
I think it’s an example of an AI that completely lacks the notion of in-world goals. Its goal is restricted to a purely symbolic system; that system happens to map to parts of the world, but the AI lacks the self-reflection to realise which symbols map to itself and its immediate environment, and how manipulating those symbols may make it better at accomplishing its goals. Severing that feedback loop is IMO the key to avoiding instrumental convergence. Without that, all you get is another variety of a chess playing AI: superhumanly smart at its own task, but the world within which that task optimizes goals is too abstract for its skill to be “portable” to a dangerous domain.
There’s not much context to this claim made by Yoshua Bengio, but while searching in Google News I found a Spanish online newspaper that has an article* in which he claims that:
We need to create machines that assist us, not independent beings. That would not be a good idea; it would lead us down a very dangerous path.
Overall this is still encouraging. It seems to take serious that
value alignment is hard
executive-AI should be banned
banning executive-AI would be hard
alignment research and AI safety is worthwhile.
I feel like there are enough shared assumptions that collaboration or dialogue with AI notkilleveryoneists could be very useful.
That said, I wish there were more details about his Scientist AI idea:
How exactly will the Scientist AI be used?
Should we expect the Scientist AI to have situational awareness?
Would the Scientist AI be allowed to write large scale software projects that are likely to get executed after a brief reviewing of the code by a human?
Are there concerns about Mesa-optimization?
Also it is not clear to me whether the safety is supposed to come from:
the AI cannot really take actions in the world (and even when there is a superhuman AI that wants to do large-scale harms, it will not succeed, because it cannot take actions that achieve these goals)
the AI has no intrinsic motivation for large-scale harm (while its output bits could in principle create large-scale harm, such a string of bits is unlikely because there is no drive towards these string of bits).
a combination of these two.
This also seems very encouraging to me! In some sense he seems to be where Holden was at 10 years ago, and he seems to be a pretty good and sane thinker on AGI risk now, and I have hope that similar arguments will be compelling to both of them so that Bengio will also realize some of the same errors that I see him making here.
I think it’s an example of an AI that completely lacks the notion of in-world goals. Its goal is restricted to a purely symbolic system; that system happens to map to parts of the world, but the AI lacks the self-reflection to realise which symbols map to itself and its immediate environment, and how manipulating those symbols may make it better at accomplishing its goals. Severing that feedback loop is IMO the key to avoiding instrumental convergence. Without that, all you get is another variety of a chess playing AI: superhumanly smart at its own task, but the world within which that task optimizes goals is too abstract for its skill to be “portable” to a dangerous domain.
There’s not much context to this claim made by Yoshua Bengio, but while searching in Google News I found a Spanish online newspaper that has an article* in which he claims that:
*https://www.larazon.es/sociedad/20221121/5jbb65kocvgkto5hssftdqe7uy.html