The “AI is easy to control” piece does talk about scaling to superhuman AI:
In what follows, we will argue that AI, even superhuman AI, will remain much more controllable than humans for the foreseeable future. Since each generation of controllable AIs can help control the next generation, it looks like this process can continue indefinitely, even to very high levels of capability.
If we assume that each generation can ensure a relatively strong notion of alignment between it and the next generation, then I think this argument goes through.
However, there are weaker notions of control which are insufficient for this sort of bootstrapping argument. Suppose each generation can ensure a the following weaker notion of control “we can set up a training, evaluation, and deployment protocol with sufficient safeguards (monitoring, auditing, etc) such that we can avoid generation N+1 AIs being capable of causing catastrophic outcomes (like AI takeover) while using those AIs to speed up labor of the generation N by a large multiple”. This notion of control doesn’t (clearly) allow the bootstrapping argument to go through. In particular, suppose that all AIs smarter than humans are deceptively aligned and they defect on humanity at the point where they are doing tasks which would be extremely hard for a human to oversee. (This isn’t the only issue, but it is a sufficient counterexample.)
This weaker notion of control can be very useful in ensuring good outcomes via getting lots of useful work out of AIs, but we will likely need to build something more scalable eventually.
(See also my discussion of using human level ish AIs to automate safety research in the sibling.)
I agree with everything you wrote here and in the sibling comment: there are reasonable hopes for bootstrapping alignment as agents grow smarter; but without a concrete bootstrapping proposal with an accompanying argument, <1% P(doom) from failing to bootstrap alignment doesn’t seem right to me.
I’m guessing this is my biggest crux with the Quintin/Nora worldview, so I guess I’m bidding for—if Quintin/Nora have an argument for optimism about bootstrapping beyond “it feels like this should work because of iterative design”—for that argument to make it into the forthcoming document.
The “AI is easy to control” piece does talk about scaling to superhuman AI:
If we assume that each generation can ensure a relatively strong notion of alignment between it and the next generation, then I think this argument goes through.
However, there are weaker notions of control which are insufficient for this sort of bootstrapping argument. Suppose each generation can ensure a the following weaker notion of control “we can set up a training, evaluation, and deployment protocol with sufficient safeguards (monitoring, auditing, etc) such that we can avoid generation N+1 AIs being capable of causing catastrophic outcomes (like AI takeover) while using those AIs to speed up labor of the generation N by a large multiple”. This notion of control doesn’t (clearly) allow the bootstrapping argument to go through. In particular, suppose that all AIs smarter than humans are deceptively aligned and they defect on humanity at the point where they are doing tasks which would be extremely hard for a human to oversee. (This isn’t the only issue, but it is a sufficient counterexample.)
This weaker notion of control can be very useful in ensuring good outcomes via getting lots of useful work out of AIs, but we will likely need to build something more scalable eventually.
(See also my discussion of using human level ish AIs to automate safety research in the sibling.)
I agree with everything you wrote here and in the sibling comment: there are reasonable hopes for bootstrapping alignment as agents grow smarter; but without a concrete bootstrapping proposal with an accompanying argument, <1% P(doom) from failing to bootstrap alignment doesn’t seem right to me.
I’m guessing this is my biggest crux with the Quintin/Nora worldview, so I guess I’m bidding for—if Quintin/Nora have an argument for optimism about bootstrapping beyond “it feels like this should work because of iterative design”—for that argument to make it into the forthcoming document.