We might reach a state of knowledge when it is easy to create AIs that (i) misaligned (ii) superhuman and (iii) non-singular (i.e. a single such AI is not stronger than the sum total of humanity and aligned AIs) but hard/impossible to create aligned superhuman AIs.
My intuition is that it’d probably be pretty easy to create an aligned superhuman AI if we knew how to create non-singular, mis-aligned superhuman AIs, and had cheap, robust methods to tell if a particular AI was misaligned. However, it seems pretty plausible that we’ll end up in a state where we know how to create non-singular, superhuman AIs; strongly suspect that most/all of them are mis-aligned; but don’t have great methods to tell whether any particular AI is aligned or mis-aligned. Does that sound right to you?
My intuition is that it’d probably be pretty easy to create an aligned superhuman AI if we knew how to create non-singular, mis-aligned superhuman AIs, and had cheap, robust methods to tell if a particular AI was misaligned.
This sounds different from how I model the situation; my views agree here with Nate’s (emphasis added):
I would rephrase 3 as “There are many intuitively small mistakes one can make early in the design process that cause resultant systems to be extremely difficult to align with operators’ intentions.” I’d compare these mistakes to the “small” decision in the early 1970s to use null-terminated instead of length-prefixed strings in the C programming language, which continues to be a major source of software vulnerabilities decades later.
I’d also clarify that I expect any large software product to exhibit plenty of actually-trivial flaws, and that I don’t expect that AGI code needs to be literally bug-free or literally proven-safe in order to be worth running. Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect is not catastrophe; it’s just that the system doesn’t work. Another likely consequence is that the system is misaligned, but in an obvious ways that makes it easy for developers to recognize that deployment is a very bad idea. The end goal is to prevent global catastrophes, but if a safety-conscious AGI team asked how we’d expect their project to fail, the two likeliest scenarios we’d point to are “your team runs into a capabilities roadblock and can’t achieve AGI” or “your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time.”
My current model of ‘the default outcome if the first project to develop AGI is highly safety-conscious, is focusing on alignment, and has a multi-year lead over less safety-conscious competitors’ is that the project still fails, because their systems keep failing their tests but they don’t know how to fix the deep underlying problems (and may need to toss out years of work and start from scratch in order to have a real chance at fixing them). Then they either (a) lose their lead, and some other project destroys the world; (b) decide they have to ignore some of their tests, and move ahead anyway; or (c) continue applying local patches without understanding or fixing the underlying generator of the test failures, until they or their system find a loophole in the tests and sneak by.
I don’t think any of this is inevitable or impossible to avoid; it’s just the default way I currently visualize things going wrong for AGI developers with a strong interest in safety and alignment.
Possibly you’d want to rule out (c) with your stipulation that the tests are “robust”? But I’m not sure you can get tests that robust. Even in the best-case scenario where developers are in a great position to build aligned AGI and successfully do so, I’m not imagining post-hoc tests that are robust to a superintelligence trying to game them. I’m imagining that the developers have a prior confidence from their knowledge of how the system works that every part of the system either lacks the optimization power to game any relevant tests, or will definitely not apply any optimization to trying to game them.
Possibly you’d want to rule out (c) with your stipulation that the tests are “robust”? But I’m not sure you can get tests that robust.
That sounds right. I was thinking about an infinitely robust misalignment-oracle to clarify my thinking, but I agree that we’ll need to be very careful with any real-world-tests.
If I imagine writing code and using the misalignment-oracle on it, I think I mostly agree with Nate’s point. If we have the code and compute to train a superhuman version of GPT-2, and the oracle tells us that any agent coming out from that training process is likely to be misaligned, we haven’t learned much new, and it’s not clear how to design a safe agent from there.
I imagine a misalignment-oracle to be more useful if we use it during the training process, though. Concretely, it seems like a misalignment-oracle would be extremely useful to achieve inner alignment in IDA: as soon as the AI becomes misaligned, we can either rewind the training process and figure out what we did wrong, or directly use the oracle as a training signal that severely punish any step that makes the agent misaligned. Coupled with the ability to iterate on designs, since we won’t accidentally blow up the world on the way, I’d guess that something like this is more likely to work than not. This idea is extremely sensitive to (c), though.
Has the “alignment roadblock” scenario been argued for anywhere?
Like Lanrian, I think it sounds implausible. My intuition is that understanding human values is a hard problem, but taking over the world is a harder problem. For example, the AI which can talk its way out of a box probably has a very deep understanding of humans—a deeper understanding than most humans have of humans! In order to have such a deep understanding, it must have lower-level building blocks for making sense of the world which work extremely well, and could be used for a value learning system.
BTW, coincidentally, I quoted this same passage in a post I wrote recently which discussed this scenario (among others). Is there a particular subscenario of this I outlined which seems especially plausible to you?
My intuition is that understanding human values is a hard problem, but taking over the world is a harder problem.
Especially because taking over the world requires you to be much better than other agents who want to stop you from taking over the world, which could very well include other AIs.
ETA: That said, upon reflection, there have been instances of people taking over large parts of the world without being superhuman. All world leaders qualify, and it isn’t that unusual. However, what would be unusual is if someone wanted to take over the world and everyone else didn’t want that yet it still happened.
In a scenario where multiple AIs compete for power the AIs who makes fast decisions without checking back with humans have an advantage in the power competition and are going to get more power over time.
Additionally, AGI differ fundamentally from humans because the can spin up multiple copies of themselves when they get more resources while human beings can’t similarly scale their power when they have access to more food.
The best human hacker can’t run a cyber war alone but if he could spin of 100,000 copies of themselves he could find enough 0 days to hack into all important computer systems.
In a scenario where multiple AIs compete for power the AIs who makes fast decisions without checking back with humans have an advantage in the power competition and are going to get more power over time.
Agreed this is a risk, but I wouldn’t call this an alignment roadblock.
My intuition is that it’d probably be pretty easy to create an aligned superhuman AI if we knew how to create non-singular, mis-aligned superhuman AIs, and had cheap, robust methods to tell if a particular AI was misaligned. However, it seems pretty plausible that we’ll end up in a state where we know how to create non-singular, superhuman AIs; strongly suspect that most/all of them are mis-aligned; but don’t have great methods to tell whether any particular AI is aligned or mis-aligned. Does that sound right to you?
This sounds different from how I model the situation; my views agree here with Nate’s (emphasis added):
My current model of ‘the default outcome if the first project to develop AGI is highly safety-conscious, is focusing on alignment, and has a multi-year lead over less safety-conscious competitors’ is that the project still fails, because their systems keep failing their tests but they don’t know how to fix the deep underlying problems (and may need to toss out years of work and start from scratch in order to have a real chance at fixing them). Then they either (a) lose their lead, and some other project destroys the world; (b) decide they have to ignore some of their tests, and move ahead anyway; or (c) continue applying local patches without understanding or fixing the underlying generator of the test failures, until they or their system find a loophole in the tests and sneak by.
I don’t think any of this is inevitable or impossible to avoid; it’s just the default way I currently visualize things going wrong for AGI developers with a strong interest in safety and alignment.
Possibly you’d want to rule out (c) with your stipulation that the tests are “robust”? But I’m not sure you can get tests that robust. Even in the best-case scenario where developers are in a great position to build aligned AGI and successfully do so, I’m not imagining post-hoc tests that are robust to a superintelligence trying to game them. I’m imagining that the developers have a prior confidence from their knowledge of how the system works that every part of the system either lacks the optimization power to game any relevant tests, or will definitely not apply any optimization to trying to game them.
That sounds right. I was thinking about an infinitely robust misalignment-oracle to clarify my thinking, but I agree that we’ll need to be very careful with any real-world-tests.
If I imagine writing code and using the misalignment-oracle on it, I think I mostly agree with Nate’s point. If we have the code and compute to train a superhuman version of GPT-2, and the oracle tells us that any agent coming out from that training process is likely to be misaligned, we haven’t learned much new, and it’s not clear how to design a safe agent from there.
I imagine a misalignment-oracle to be more useful if we use it during the training process, though. Concretely, it seems like a misalignment-oracle would be extremely useful to achieve inner alignment in IDA: as soon as the AI becomes misaligned, we can either rewind the training process and figure out what we did wrong, or directly use the oracle as a training signal that severely punish any step that makes the agent misaligned. Coupled with the ability to iterate on designs, since we won’t accidentally blow up the world on the way, I’d guess that something like this is more likely to work than not. This idea is extremely sensitive to (c), though.
Has the “alignment roadblock” scenario been argued for anywhere?
Like Lanrian, I think it sounds implausible. My intuition is that understanding human values is a hard problem, but taking over the world is a harder problem. For example, the AI which can talk its way out of a box probably has a very deep understanding of humans—a deeper understanding than most humans have of humans! In order to have such a deep understanding, it must have lower-level building blocks for making sense of the world which work extremely well, and could be used for a value learning system.
BTW, coincidentally, I quoted this same passage in a post I wrote recently which discussed this scenario (among others). Is there a particular subscenario of this I outlined which seems especially plausible to you?
Especially because taking over the world requires you to be much better than other agents who want to stop you from taking over the world, which could very well include other AIs.
ETA: That said, upon reflection, there have been instances of people taking over large parts of the world without being superhuman. All world leaders qualify, and it isn’t that unusual. However, what would be unusual is if someone wanted to take over the world and everyone else didn’t want that yet it still happened.
In a scenario where multiple AIs compete for power the AIs who makes fast decisions without checking back with humans have an advantage in the power competition and are going to get more power over time.
Additionally, AGI differ fundamentally from humans because the can spin up multiple copies of themselves when they get more resources while human beings can’t similarly scale their power when they have access to more food.
The best human hacker can’t run a cyber war alone but if he could spin of 100,000 copies of themselves he could find enough 0 days to hack into all important computer systems.
Agreed this is a risk, but I wouldn’t call this an alignment roadblock.