If the slow scenarios capture reality better than the fast scenarios, then systems will be deployed deliberately and will initially be given power rather than seizing power. This means both that the systems won’t be so obviously dangerous that the misbehaviour is noticed early on and that there is still misalignment later on.
This switch from apparently benign to dangerous behaviour could be due to
Power-seeking misaligned behaviour that is too subtle to notice in the training environment but is obviously dangerous in deployment, due to the scale and makeup of the training and deployment environments being quite different
Power-seeking misaligned behaviour that only shows up over long time horizons and therefore will not be noticed in training, which we might expect occurs over a shorter period than deployment
Systems intentionally hiding misaligned behaviour during training to deceive their operators. Systems could be highly deceptively misaligned from the beginning, and capable enough to know that if they seek power in adversarial ways too early, they will get shut down. This post argues that ML models don’t have to be extremely competent to be manipulative, suggesting that these behaviours might show up very early
Rather, it simply has a single good trick that enables it to subvert and take control of the rest of the world. Its takeover capability might be exceptionally good manipulation techniques, specific deadly technology, or cyberoffensive capability, any of which could allow the system to exploit other AIs and humans.
In reality, I feel that this is more of a fuzzy rather than binary thing: I expect this to require somewhat less of an extraordinary research effort, and instead that there exists somewhat more of a crucial vulnerability in human society (there are already some examples of vulnerabilities, e.g. biological viruses, humans are pretty easy to manipulate under certain conditions). But I also think there are plausibly hard limits to how good various takeover technologies can get—e.g. persuasion tools.
It is unrealistic to expect TAI to be deployed if first there are many worsening warning shots involving dangerous AI systems. This would be comparable to an unrealistic alternate history where nuclear weapons were immediately used by the US and Soviet Union as soon as they were developed and in every war where they might have offered a temporary advantage, resulting in nuclear annihilation in the 1950s.
Note that this is not the same as an alternate history where nuclear near-misses escalated (e.g. Petrov, Vasili Arkhipov), but instead an outcome where nuclear weapons were used as ordinary weapons of war with no regard for the larger dangers that presented—there would be no concept of ‘near misses’ because MAD wouldn’t have developed as a doctrine. In a previous post I argued, following Anders Sandberg, that paradoxically the large number of nuclear ‘near misses’ implies that there is a forceful pressure away from the worst outcomes.
Robert Wiblin: So just to be clear, you’re saying there’s a lot of near misses, but that hasn’t updated you very much in favor of thinking that the risk is very high. That’s the reverse of what we expected.
Anders Sandberg: Yeah.
Robert Wiblin: Explain the reasoning there.
Anders Sandberg: So imagine a world that has a lot of nuclear warheads. So if there is a nuclear war, it’s guaranteed to wipe out humanity, and then you compare that to a world where is a few warheads. So if there’s a nuclear war, the risk is relatively small. Now in the first dangerous world, you would have a very strong deflection. Even getting close to the state of nuclear war would be strongly disfavored because most histories close to nuclear war end up with no observers left at all.
In the second one, you get the much weaker effect, and now over time you can plot when the near misses happen and the number of nuclear warheads, and you actually see that they don’t behave as strongly as you would think. If there was a very strong anthropic effect you would expect very few near misses during the height of the Cold War, and in fact you see roughly the opposite. So this is weirdly reassuring. In some sense the Petrov incident implies that we are slightly safer about nuclear war.
However, scenarios also differ on how ‘hackable’ the alignment problem is—that is, how easy it is to ‘correct’ misbehaviour by methods of incremental course correction such as improving oversight and sensor coverage or tweaking reward functions. This correction requires two parts—first, noticing that there is a problem with the system early on, then determining what fix to employ and applying it.
Many of the same considerations around correcting misbehaviour also apply to detecting misbehaviour, and the required capabilities seem to overlap. In this post, we focus on applying corrections to misbehaviour, but there is existing writing on detecting misbehaviour as well.
Considering inner alignment, Trazzi and Armstrong argue that models don’t have to be very competent to appear aligned when they are not, suggesting that it’s possible that it won’t be easy to tell if deployed systems are inner misaligned. But their argument doesn’t have too much to say about how likely this is in practice.
Considering outer alignment, it seems less clear. See here for a summary of some discussion between Richard Ngo and Paul Christiano about how easy it will be to tell that models are outer misaligned to the objective of pursuing easily-measurable goals (rather than the hard-to-measure goals that we actually want).
What predictions can we make today about how hackable the alignment problem is? Considering outer alignment: without any breakthroughs in techniques, there seems to be a strong case that we are on track towards the ‘intermediate’ world where the alignment problem is hackable until it isn’t. It seems like the best workable approach to outer alignment we have so far is to train systems to try to ensure that the world looks good according to some kind of (augmented) human judgment (i.e. using something like the training regime described in ‘An unaligned benchmark’). This will result in a world that “looks good until it doesn’t”, for the reasons described in Another (outer) alignment failure story.
Whether the method described in ‘an unaligned benchmark’ (which would result in this risky, intermediate level of hackability) actually turns out to be the most natural method to use for building advanced AI will depend on how easily it produces useful, intelligent behaviour.
If we are lucky, there will be more of a correlation between methods that are easily hackable and methods that produce capabilities we want, such that highly hackable methods are easier to find and more capable than even intermediately hackable methods like unaligned benchmark. If you think that the methods we are most likely to employ absent an attempt to change research paradigms are exactly these highly hackable methods, then you accept the claim of Alignment by Default
Some points that didn’t fit into the main post:
This switch from apparently benign to dangerous behaviour could be due to
Power-seeking misaligned behaviour that is too subtle to notice in the training environment but is obviously dangerous in deployment, due to the scale and makeup of the training and deployment environments being quite different
Power-seeking misaligned behaviour that only shows up over long time horizons and therefore will not be noticed in training, which we might expect occurs over a shorter period than deployment
Systems intentionally hiding misaligned behaviour during training to deceive their operators. Systems could be highly deceptively misaligned from the beginning, and capable enough to know that if they seek power in adversarial ways too early, they will get shut down. This post argues that ML models don’t have to be extremely competent to be manipulative, suggesting that these behaviours might show up very early
In reality, I feel that this is more of a fuzzy rather than binary thing: I expect this to require somewhat less of an extraordinary research effort, and instead that there exists somewhat more of a crucial vulnerability in human society (there are already some examples of vulnerabilities, e.g. biological viruses, humans are pretty easy to manipulate under certain conditions). But I also think there are plausibly hard limits to how good various takeover technologies can get—e.g. persuasion tools.
Note that this is not the same as an alternate history where nuclear near-misses escalated (e.g. Petrov, Vasili Arkhipov), but instead an outcome where nuclear weapons were used as ordinary weapons of war with no regard for the larger dangers that presented—there would be no concept of ‘near misses’ because MAD wouldn’t have developed as a doctrine. In a previous post I argued, following Anders Sandberg, that paradoxically the large number of nuclear ‘near misses’ implies that there is a forceful pressure away from the worst outcomes.
Many of the same considerations around correcting misbehaviour also apply to detecting misbehaviour, and the required capabilities seem to overlap. In this post, we focus on applying corrections to misbehaviour, but there is existing writing on detecting misbehaviour as well.
Considering inner alignment, Trazzi and Armstrong argue that models don’t have to be very competent to appear aligned when they are not, suggesting that it’s possible that it won’t be easy to tell if deployed systems are inner misaligned. But their argument doesn’t have too much to say about how likely this is in practice.
Considering outer alignment, it seems less clear. See here for a summary of some discussion between Richard Ngo and Paul Christiano about how easy it will be to tell that models are outer misaligned to the objective of pursuing easily-measurable goals (rather than the hard-to-measure goals that we actually want).
Whether the method described in ‘an unaligned benchmark’ (which would result in this risky, intermediate level of hackability) actually turns out to be the most natural method to use for building advanced AI will depend on how easily it produces useful, intelligent behaviour.
If we are lucky, there will be more of a correlation between methods that are easily hackable and methods that produce capabilities we want, such that highly hackable methods are easier to find and more capable than even intermediately hackable methods like unaligned benchmark. If you think that the methods we are most likely to employ absent an attempt to change research paradigms are exactly these highly hackable methods, then you accept the claim of Alignment by Default