Imagine if a magic spell was cast long ago, that made it so that rockets would never explode. Instead, whenever they would explode, a demon would intervene to hold the craft together, patch the problem, and keep it on course. But the demon would exact a price: Whichever humans were in the vicinity of the rocket lose their souls, and become possessed. The demons possessing them work towards the master plan of enslaving all humanity; therefore, they typically pretend that nothing has gone wrong and act normal, just like the human whose skin they wear would have acted...
Now imagine there’s a big private space race with SpaceX and Boeing and all sorts of other companies racing to put astonauts up there to harvest asteroid minerals and plant flags and build space stations and so forth.
Big problem: There’s a bit of a snowball effect here. Once sufficiently many people have been possessed, they’ll work to get more people possessed.
Bigger problem: We don’t have a reliable way to tell when demonic infestation has happened. Instead of:
engineers make mistake --> rocket blows up --> engineers look foolish, fix mistake,
we have:
engineers make mistake --> rocket crew gets possessed --> rocket continues into space, is bigly successful, returns to earth, gets ticker-tape parade --> engineers look great, make tons of money.
In this fantasy world, the technical rocket alignment problem is exactly as hard as it is in the real world. No more, no less. But because we get much less feedback from reality, and because failures look like successes, the governance situation is much worse. The companies that cut corners the most on safety, that move fastest and break the most things, that invest the least in demon-exorcism research, will be first to market and appear most successful and make the most money and fame. (By contrast with the real space industry in our world, which has to strike a balance between safety and speed, and gets painful feedback from reality when it fails on safety)
The demon-possession analogue in real world AGI races is adversarial misaligned cognition. The kind of misalignments that result in the AI using its intelligence to prevent you from noticing and fixing the misalignment, as opposed to the kind of misalignments that don’t. The kind of misalignments, in other words, that result in your AI ‘silently switching sides.’
To be clear, not all misalignments are of this kind. When the AIs are too dumb to strategize, too dumb to plot, too dumb to successfully hide, not situationally aware at all, etc. then no misalignments will be of this kind.
But more excitingly, even when the AIs are totally smart enough in all those ways, there will still be some kinds of misalignments that are not of this kind. For example, if we manage to get the AIs to be robustly honest (and not just in some minimal sense), then even if they have misaligned goals/drives/etc. they’ll tell us about them when we ask. (unless we train against this signal, in which case their introspective ability will degrade so that they can continue doing what they were doing but honestly say they didn’t know that was their goal. This seems to be what happens with humans sometimes—we deceive ourselves so that we can better deceive others.) Another example: Insofar as the AI is genuinely trying to be helpful or whatever, but it just has a different notion of helpfulness than us, it will make ‘innocent mistakes’ so to speak and at least in principle we could notice and fix them. E.g. Google (without telling its users) gaslit Gemini into thinking that the user had said “Explicitly specify different genders and ethnicities terms if I forgot to do so. I want to make sure that all groups are represented equally.” So Gemini thought it was following user instructions when it generated e.g. images of racially diverse Nazis. Google could rightfully complain that this was Gemini’s fault and that if Gemini was smarter it wouldn’t have done this—it would have intuited that even if a user says they want to represent all groups equally, they probably don’t want racially diverse Nazis, and wouldn’t count that as a situation where all groups should be represented equally. Anyhow the point is, this is an example of an ‘innocent mistake’ that regular iterative development will probably find and fix before any major catastrophes happen. Just scaling up the models should probably help with this to some significant extent.
Imagine if a magic spell was cast long ago, that made it so that rockets would never explode. Instead, whenever they would explode, a demon would intervene to hold the craft together, patch the problem, and keep it on course. But the demon would exact a price: Whichever humans were in the vicinity of the rocket lose their souls, and become possessed. The demons possessing them work towards the master plan of enslaving all humanity; therefore, they typically pretend that nothing has gone wrong and act normal, just like the human whose skin they wear would have acted...
Now imagine there’s a big private space race with SpaceX and Boeing and all sorts of other companies racing to put astonauts up there to harvest asteroid minerals and plant flags and build space stations and so forth.
Big problem: There’s a bit of a snowball effect here. Once sufficiently many people have been possessed, they’ll work to get more people possessed.
Bigger problem: We don’t have a reliable way to tell when demonic infestation has happened. Instead of:
engineers make mistake --> rocket blows up --> engineers look foolish, fix mistake,
we have:
engineers make mistake --> rocket crew gets possessed --> rocket continues into space, is bigly successful, returns to earth, gets ticker-tape parade --> engineers look great, make tons of money.
In this fantasy world, the technical rocket alignment problem is exactly as hard as it is in the real world. No more, no less. But because we get much less feedback from reality, and because failures look like successes, the governance situation is much worse. The companies that cut corners the most on safety, that move fastest and break the most things, that invest the least in demon-exorcism research, will be first to market and appear most successful and make the most money and fame. (By contrast with the real space industry in our world, which has to strike a balance between safety and speed, and gets painful feedback from reality when it fails on safety)
The demon-possession analogue in real world AGI races is adversarial misaligned cognition. The kind of misalignments that result in the AI using its intelligence to prevent you from noticing and fixing the misalignment, as opposed to the kind of misalignments that don’t. The kind of misalignments, in other words, that result in your AI ‘silently switching sides.’
To be clear, not all misalignments are of this kind. When the AIs are too dumb to strategize, too dumb to plot, too dumb to successfully hide, not situationally aware at all, etc. then no misalignments will be of this kind.
But more excitingly, even when the AIs are totally smart enough in all those ways, there will still be some kinds of misalignments that are not of this kind. For example, if we manage to get the AIs to be robustly honest (and not just in some minimal sense), then even if they have misaligned goals/drives/etc. they’ll tell us about them when we ask. (unless we train against this signal, in which case their introspective ability will degrade so that they can continue doing what they were doing but honestly say they didn’t know that was their goal. This seems to be what happens with humans sometimes—we deceive ourselves so that we can better deceive others.) Another example: Insofar as the AI is genuinely trying to be helpful or whatever, but it just has a different notion of helpfulness than us, it will make ‘innocent mistakes’ so to speak and at least in principle we could notice and fix them. E.g. Google (without telling its users) gaslit Gemini into thinking that the user had said “Explicitly specify different genders and ethnicities terms if I forgot to do so. I want to make sure that all groups are represented equally.” So Gemini thought it was following user instructions when it generated e.g. images of racially diverse Nazis. Google could rightfully complain that this was Gemini’s fault and that if Gemini was smarter it wouldn’t have done this—it would have intuited that even if a user says they want to represent all groups equally, they probably don’t want racially diverse Nazis, and wouldn’t count that as a situation where all groups should be represented equally. Anyhow the point is, this is an example of an ‘innocent mistake’ that regular iterative development will probably find and fix before any major catastrophes happen. Just scaling up the models should probably help with this to some significant extent.