Second, we could more-or-less deal with systems which defect as they arise. For instance, during deployment we could notice that some systems are optimizing something different than what we intended during training, and therefore we shut them down.
Each individual system won’t by themselves carry more power than the sum of projects before it. Instead, AIs will only be slightly better than the ones that came before it, including any AIs we are using to monitor the newer ones.
If the sum of projects from before carry more power than the individual system, such that it can’t win by defection, there’s no reason for it to defect. It might just join the ranks of “projects from before”, and subtly try to alter future systems to be similarly defective, waiting for a future opportunity to strike. If the way we build these things systematically renders them misaligned, we’ll sooner or later end up with a majority of them being misaligned, at which point we can’t trivially use them to shut down defectors.
(I agree that continuous takeoff does give us more warning, because some systems will presumably defect early, especially weaker ones. And IDA is kind of similar to this strategy, and could plausibly work. I just wanted to point out that a naive implementation of this doesn’t solve the problem of treacherous turns.)
Expanding on that a little, even if we know our AIs are misaligned that doesn’t necessarily save us. We might reach a state of knowledge when it is easy to create AIs that (i) misaligned (ii) superhuman and (iii) non-singular (i.e. a single such AI is not stronger than the sum total of humanity and aligned AIs) but hard/impossible to create aligned superhuman AIs. Since misaligned AIs that can’t take over still mostly follow human instructions, there will be tremendous economic incentives to deploy more such systems. This is effectively a tragedy of the commons: for every individual actor, deploying more AIs only increases global risk a little but brings in tremendous revenue. However, collectively, risk accumulates rapidly. At some point the total power of misaligned AIs crosses some (hard to predict in advance) threshold and there is a phase transition (a cascade of failures) from a human-controlled world to a coalition-of-misaligned-AI-controlled world. Alternatively, the AIs might find a way to manipulate our entire culture into gradually changing its values into something the AIs prefer (like with Murder Gandhi).
We might reach a state of knowledge when it is easy to create AIs that (i) misaligned (ii) superhuman and (iii) non-singular (i.e. a single such AI is not stronger than the sum total of humanity and aligned AIs) but hard/impossible to create aligned superhuman AIs.
My intuition is that it’d probably be pretty easy to create an aligned superhuman AI if we knew how to create non-singular, mis-aligned superhuman AIs, and had cheap, robust methods to tell if a particular AI was misaligned. However, it seems pretty plausible that we’ll end up in a state where we know how to create non-singular, superhuman AIs; strongly suspect that most/all of them are mis-aligned; but don’t have great methods to tell whether any particular AI is aligned or mis-aligned. Does that sound right to you?
My intuition is that it’d probably be pretty easy to create an aligned superhuman AI if we knew how to create non-singular, mis-aligned superhuman AIs, and had cheap, robust methods to tell if a particular AI was misaligned.
This sounds different from how I model the situation; my views agree here with Nate’s (emphasis added):
I would rephrase 3 as “There are many intuitively small mistakes one can make early in the design process that cause resultant systems to be extremely difficult to align with operators’ intentions.” I’d compare these mistakes to the “small” decision in the early 1970s to use null-terminated instead of length-prefixed strings in the C programming language, which continues to be a major source of software vulnerabilities decades later.
I’d also clarify that I expect any large software product to exhibit plenty of actually-trivial flaws, and that I don’t expect that AGI code needs to be literally bug-free or literally proven-safe in order to be worth running. Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect is not catastrophe; it’s just that the system doesn’t work. Another likely consequence is that the system is misaligned, but in an obvious ways that makes it easy for developers to recognize that deployment is a very bad idea. The end goal is to prevent global catastrophes, but if a safety-conscious AGI team asked how we’d expect their project to fail, the two likeliest scenarios we’d point to are “your team runs into a capabilities roadblock and can’t achieve AGI” or “your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time.”
My current model of ‘the default outcome if the first project to develop AGI is highly safety-conscious, is focusing on alignment, and has a multi-year lead over less safety-conscious competitors’ is that the project still fails, because their systems keep failing their tests but they don’t know how to fix the deep underlying problems (and may need to toss out years of work and start from scratch in order to have a real chance at fixing them). Then they either (a) lose their lead, and some other project destroys the world; (b) decide they have to ignore some of their tests, and move ahead anyway; or (c) continue applying local patches without understanding or fixing the underlying generator of the test failures, until they or their system find a loophole in the tests and sneak by.
I don’t think any of this is inevitable or impossible to avoid; it’s just the default way I currently visualize things going wrong for AGI developers with a strong interest in safety and alignment.
Possibly you’d want to rule out (c) with your stipulation that the tests are “robust”? But I’m not sure you can get tests that robust. Even in the best-case scenario where developers are in a great position to build aligned AGI and successfully do so, I’m not imagining post-hoc tests that are robust to a superintelligence trying to game them. I’m imagining that the developers have a prior confidence from their knowledge of how the system works that every part of the system either lacks the optimization power to game any relevant tests, or will definitely not apply any optimization to trying to game them.
Possibly you’d want to rule out (c) with your stipulation that the tests are “robust”? But I’m not sure you can get tests that robust.
That sounds right. I was thinking about an infinitely robust misalignment-oracle to clarify my thinking, but I agree that we’ll need to be very careful with any real-world-tests.
If I imagine writing code and using the misalignment-oracle on it, I think I mostly agree with Nate’s point. If we have the code and compute to train a superhuman version of GPT-2, and the oracle tells us that any agent coming out from that training process is likely to be misaligned, we haven’t learned much new, and it’s not clear how to design a safe agent from there.
I imagine a misalignment-oracle to be more useful if we use it during the training process, though. Concretely, it seems like a misalignment-oracle would be extremely useful to achieve inner alignment in IDA: as soon as the AI becomes misaligned, we can either rewind the training process and figure out what we did wrong, or directly use the oracle as a training signal that severely punish any step that makes the agent misaligned. Coupled with the ability to iterate on designs, since we won’t accidentally blow up the world on the way, I’d guess that something like this is more likely to work than not. This idea is extremely sensitive to (c), though.
Has the “alignment roadblock” scenario been argued for anywhere?
Like Lanrian, I think it sounds implausible. My intuition is that understanding human values is a hard problem, but taking over the world is a harder problem. For example, the AI which can talk its way out of a box probably has a very deep understanding of humans—a deeper understanding than most humans have of humans! In order to have such a deep understanding, it must have lower-level building blocks for making sense of the world which work extremely well, and could be used for a value learning system.
BTW, coincidentally, I quoted this same passage in a post I wrote recently which discussed this scenario (among others). Is there a particular subscenario of this I outlined which seems especially plausible to you?
My intuition is that understanding human values is a hard problem, but taking over the world is a harder problem.
Especially because taking over the world requires you to be much better than other agents who want to stop you from taking over the world, which could very well include other AIs.
ETA: That said, upon reflection, there have been instances of people taking over large parts of the world without being superhuman. All world leaders qualify, and it isn’t that unusual. However, what would be unusual is if someone wanted to take over the world and everyone else didn’t want that yet it still happened.
In a scenario where multiple AIs compete for power the AIs who makes fast decisions without checking back with humans have an advantage in the power competition and are going to get more power over time.
Additionally, AGI differ fundamentally from humans because the can spin up multiple copies of themselves when they get more resources while human beings can’t similarly scale their power when they have access to more food.
The best human hacker can’t run a cyber war alone but if he could spin of 100,000 copies of themselves he could find enough 0 days to hack into all important computer systems.
In a scenario where multiple AIs compete for power the AIs who makes fast decisions without checking back with humans have an advantage in the power competition and are going to get more power over time.
Agreed this is a risk, but I wouldn’t call this an alignment roadblock.
It might just join the ranks of “projects from before”, and subtly try to alter future systems to be similarly defective, waiting for a future opportunity to strike.
Admittedly, I did not explain this point well enough. What I meant to say was that before we have the first successful defection, we’ll have some failed defection. If the system could indefinitely hide its own private intentions to later defect, then I would already consider that to be a ‘successful defection.’
Knowing about a failed defection, we’ll learn from our mistake and patch that for future systems. To be clear, I’m definitely not endorsing this as a normative standard for safety.
If the sum of projects from before carry more power than the individual system, such that it can’t win by defection, there’s no reason for it to defect. It might just join the ranks of “projects from before”, and subtly try to alter future systems to be similarly defective, waiting for a future opportunity to strike. If the way we build these things systematically renders them misaligned, we’ll sooner or later end up with a majority of them being misaligned, at which point we can’t trivially use them to shut down defectors.
(I agree that continuous takeoff does give us more warning, because some systems will presumably defect early, especially weaker ones. And IDA is kind of similar to this strategy, and could plausibly work. I just wanted to point out that a naive implementation of this doesn’t solve the problem of treacherous turns.)
Expanding on that a little, even if we know our AIs are misaligned that doesn’t necessarily save us. We might reach a state of knowledge when it is easy to create AIs that (i) misaligned (ii) superhuman and (iii) non-singular (i.e. a single such AI is not stronger than the sum total of humanity and aligned AIs) but hard/impossible to create aligned superhuman AIs. Since misaligned AIs that can’t take over still mostly follow human instructions, there will be tremendous economic incentives to deploy more such systems. This is effectively a tragedy of the commons: for every individual actor, deploying more AIs only increases global risk a little but brings in tremendous revenue. However, collectively, risk accumulates rapidly. At some point the total power of misaligned AIs crosses some (hard to predict in advance) threshold and there is a phase transition (a cascade of failures) from a human-controlled world to a coalition-of-misaligned-AI-controlled world. Alternatively, the AIs might find a way to manipulate our entire culture into gradually changing its values into something the AIs prefer (like with Murder Gandhi).
My intuition is that it’d probably be pretty easy to create an aligned superhuman AI if we knew how to create non-singular, mis-aligned superhuman AIs, and had cheap, robust methods to tell if a particular AI was misaligned. However, it seems pretty plausible that we’ll end up in a state where we know how to create non-singular, superhuman AIs; strongly suspect that most/all of them are mis-aligned; but don’t have great methods to tell whether any particular AI is aligned or mis-aligned. Does that sound right to you?
This sounds different from how I model the situation; my views agree here with Nate’s (emphasis added):
My current model of ‘the default outcome if the first project to develop AGI is highly safety-conscious, is focusing on alignment, and has a multi-year lead over less safety-conscious competitors’ is that the project still fails, because their systems keep failing their tests but they don’t know how to fix the deep underlying problems (and may need to toss out years of work and start from scratch in order to have a real chance at fixing them). Then they either (a) lose their lead, and some other project destroys the world; (b) decide they have to ignore some of their tests, and move ahead anyway; or (c) continue applying local patches without understanding or fixing the underlying generator of the test failures, until they or their system find a loophole in the tests and sneak by.
I don’t think any of this is inevitable or impossible to avoid; it’s just the default way I currently visualize things going wrong for AGI developers with a strong interest in safety and alignment.
Possibly you’d want to rule out (c) with your stipulation that the tests are “robust”? But I’m not sure you can get tests that robust. Even in the best-case scenario where developers are in a great position to build aligned AGI and successfully do so, I’m not imagining post-hoc tests that are robust to a superintelligence trying to game them. I’m imagining that the developers have a prior confidence from their knowledge of how the system works that every part of the system either lacks the optimization power to game any relevant tests, or will definitely not apply any optimization to trying to game them.
That sounds right. I was thinking about an infinitely robust misalignment-oracle to clarify my thinking, but I agree that we’ll need to be very careful with any real-world-tests.
If I imagine writing code and using the misalignment-oracle on it, I think I mostly agree with Nate’s point. If we have the code and compute to train a superhuman version of GPT-2, and the oracle tells us that any agent coming out from that training process is likely to be misaligned, we haven’t learned much new, and it’s not clear how to design a safe agent from there.
I imagine a misalignment-oracle to be more useful if we use it during the training process, though. Concretely, it seems like a misalignment-oracle would be extremely useful to achieve inner alignment in IDA: as soon as the AI becomes misaligned, we can either rewind the training process and figure out what we did wrong, or directly use the oracle as a training signal that severely punish any step that makes the agent misaligned. Coupled with the ability to iterate on designs, since we won’t accidentally blow up the world on the way, I’d guess that something like this is more likely to work than not. This idea is extremely sensitive to (c), though.
Has the “alignment roadblock” scenario been argued for anywhere?
Like Lanrian, I think it sounds implausible. My intuition is that understanding human values is a hard problem, but taking over the world is a harder problem. For example, the AI which can talk its way out of a box probably has a very deep understanding of humans—a deeper understanding than most humans have of humans! In order to have such a deep understanding, it must have lower-level building blocks for making sense of the world which work extremely well, and could be used for a value learning system.
BTW, coincidentally, I quoted this same passage in a post I wrote recently which discussed this scenario (among others). Is there a particular subscenario of this I outlined which seems especially plausible to you?
Especially because taking over the world requires you to be much better than other agents who want to stop you from taking over the world, which could very well include other AIs.
ETA: That said, upon reflection, there have been instances of people taking over large parts of the world without being superhuman. All world leaders qualify, and it isn’t that unusual. However, what would be unusual is if someone wanted to take over the world and everyone else didn’t want that yet it still happened.
In a scenario where multiple AIs compete for power the AIs who makes fast decisions without checking back with humans have an advantage in the power competition and are going to get more power over time.
Additionally, AGI differ fundamentally from humans because the can spin up multiple copies of themselves when they get more resources while human beings can’t similarly scale their power when they have access to more food.
The best human hacker can’t run a cyber war alone but if he could spin of 100,000 copies of themselves he could find enough 0 days to hack into all important computer systems.
Agreed this is a risk, but I wouldn’t call this an alignment roadblock.
Admittedly, I did not explain this point well enough. What I meant to say was that before we have the first successful defection, we’ll have some failed defection. If the system could indefinitely hide its own private intentions to later defect, then I would already consider that to be a ‘successful defection.’
Knowing about a failed defection, we’ll learn from our mistake and patch that for future systems. To be clear, I’m definitely not endorsing this as a normative standard for safety.
I agree with the rest of your comment.