Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
[...]
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs.
It currently seems unlikely to me that marginal AI control research I’m excited about is very economically useful. I agree that some control or control-adjacent research will end up being at least somewhat economically relevant.
I can also imagine worlds where AIs are clear schemers that are clearly unproductive to use by default, and thus control increases deployment through both straightforward incentives and risk compensation (though this maybe requires the AIs to have made a mistake or to have somewhat strange preferences as they would probably prefer fooling our tests).
My understanding is that your concerns on downside are maybe some combination of:
AI control research will be economically very useful (e.g. accelerates AI by 6 months conditional on 5 year timelines).
Control will result in irrational risk compensation style actions where reducing earlier risks provides false confort about later risks (as control isn’t scalable) and this will ultimately make the situation worse.
AI control prevents earlier warning shots which would have created more will for costly actions that save us later. (I’m not sure if this is actually your concern, though some statements pattern matched to this.)
I’m very skeptical of (1) for marginal control research. It feelsl like this has to imply incredibly strong levels of inefficiency and usefulness of control research. Perhaps you’re imagining a smaller acceleration effect, but think the benefits are also tiny relative to this small acceleration?
I’m somewhat sympathetic to things being somewhat inefficient such that if the field of people who might do control research instead did capabilities, that would be a notable acceleration (maybe 5%), but it seems like control research should be a massive reduction in acceleration relative to this.
On (2), this sounds like a weird intermediate level of competence where they are actually sensitive to full on scheming risks at an earlier point (when control works), but end up thinking this will scale despite it being relatively obvious it won’t. I agree that control might end up being an excuse for scaling, but I don’t think that it will be a counterfactual excuse (some other excuse would have been found) and I think we should just make sure to argue on the merits at the point.
I do think that (2) points toward focusing on advancing evaluation methodology more than techniques such that it would be relatively clear when things fail if people are being unreasonable and it is relatively easy to call out unreasonableness.
I think not doing control due to (3) can put in you in pretty scary situations: how can we know when you’ll get scary warning shots vs full AI takeover? Further, AIs presumably want to avoid warning shots though it is unclear how much AIs will systematically want to advance the interests of other AIs. At a more basic level I don’t love arguments of the form “actually, having bad things happen will actually be good, so we shouldn’t try to prevent bad things which are pretty close to the actual bad things we’re worried about”. All that said, I’m sympathetic to researchers avoiding work on mitigating misuse or smaller-than-full-takeover harms due to this motivation. (I still think AI companies should have an obligation to make it very clear what risks they are imposing and to try to mitigate risks at multiple scales, but this obligation doesn’t apply to researchers. If it is hard to hire good people to work on misuse, try paying more.)
When we have controlled unaligned AIs doing economically useful work, they probably won’t be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs.
As in, your view is that:
Greatly accelerating all alignment work which isn’t somewhat philosophically/conceptually confusing won’t be very useful for solving alignment. (Because of conceptual bottlenecks.)
Philosophy/conceptual work is sufficiently hard to check when an AI is trying to screw you over that control can’t force scheming AIs to notably help with this work. (Note that even if evaluation in this regime is much harder than generation, that isn’t clearly a blocker, because we potentially only have a to supervise a small subset of outputs.) (Note that this is distinct from the view that AIs won’t be capable enough to help with this work as is currently the case.)
I think there is a lot of very helpful alignment work which isn’t conceptually bottlenecked and agendas with fewer conceptual barriers could go very far.
Further, I’m skeptical about evaluation being so hard.
Beyond that, I’m only imaging part of the theory of change of control work being to “solve alignment” or work on alignment.
Alternatives:
Buy time by both making it cheaper for actors to stall and also making risks better understood[1].
Also buy time by pushing AI takeover later in time.
Work on alternative exit conditions like emulated minds (difficult but maybe doable).
Additional time with very powerful AIs seems useful for studying them even if we can’t get any acceleration from automation. Pausing at the brink is more difficult in terms of costs and coordination (as there will be more close actors and more incentive), but it might also be radically easier to make the case for pausing relative to today. So, it overall seems easier to arrange and time at the brink also seems much more useful all around.
(I also think that pausing looks less good if it isn’t part of a well implemented, thoughtful and extremely high effort approach which e.g. actively intervenes with hardware and is aggressive on international efforts. So, I’m not very into pausing “for the wrong reasons”.)
(Edited to add)
I’d like to argue that there is a lot of helpful stuff which isn’t conceptually bottlenecked.
Concretely, let’s imagine targeting aligning AI systems which are capable enough to dominate top human experts in speed and cost across all relevant domains (including philosophy/conceptual work). In particular, we want to align them well enough that we’re happy to totally defer to them on really tricky questions like “what should be our high level approach for handling risks for further AI systems”[2]. I think this is a pretty good target for a rational actor, then you can actually hand off decision making to such AIs. (They might decides to work on alignment of smarter systems or pursue totally different routes.)
It seems difficult though not impossible that we could reasonably confidently (95%)[3] reach this target via very prosaic/empirical approaches using huge amounts of AI labor (e.g. a combination of testing things on huge numbers of model organisms, doing some high level interp (both white and black box) to get a bit of a sense for what is going on in AI systems in general, and verifying that AIs seem to do a good job in held out conceptual domains where we happen to know the answers). And, I think prosaic approaches with huge amounts labor could also substantially increase chances of hitting this target (e.g. 4x risk reduction) even if we can’t hit this level of confidence.
This doesn’t really look like “solve alignment”, but in practice it reduces risk a bunch.
It’s also possible that better understanding of risks will indicate that proceeding to wildly superhuman AI immediately is fine, though I’m skepitical.
Obviously, 95% confidence is objectively unreasonable, but it is notable improvement over doing nothing, particularly in worlds where we find evidence for misalignment.
It currently seems unlikely to me that marginal AI control research I’m excited about is very economically useful. I agree that some control or control-adjacent research will end up being at least somewhat economically relevant.
I can also imagine worlds where AIs are clear schemers that are clearly unproductive to use by default, and thus control increases deployment through both straightforward incentives and risk compensation (though this maybe requires the AIs to have made a mistake or to have somewhat strange preferences as they would probably prefer fooling our tests).
My understanding is that your concerns on downside are maybe some combination of:
AI control research will be economically very useful (e.g. accelerates AI by 6 months conditional on 5 year timelines).
Control will result in irrational risk compensation style actions where reducing earlier risks provides false confort about later risks (as control isn’t scalable) and this will ultimately make the situation worse.
AI control prevents earlier warning shots which would have created more will for costly actions that save us later. (I’m not sure if this is actually your concern, though some statements pattern matched to this.)
I’m very skeptical of (1) for marginal control research. It feelsl like this has to imply incredibly strong levels of inefficiency and usefulness of control research. Perhaps you’re imagining a smaller acceleration effect, but think the benefits are also tiny relative to this small acceleration?
I’m somewhat sympathetic to things being somewhat inefficient such that if the field of people who might do control research instead did capabilities, that would be a notable acceleration (maybe 5%), but it seems like control research should be a massive reduction in acceleration relative to this.
On (2), this sounds like a weird intermediate level of competence where they are actually sensitive to full on scheming risks at an earlier point (when control works), but end up thinking this will scale despite it being relatively obvious it won’t. I agree that control might end up being an excuse for scaling, but I don’t think that it will be a counterfactual excuse (some other excuse would have been found) and I think we should just make sure to argue on the merits at the point.
I do think that (2) points toward focusing on advancing evaluation methodology more than techniques such that it would be relatively clear when things fail if people are being unreasonable and it is relatively easy to call out unreasonableness.
I think not doing control due to (3) can put in you in pretty scary situations: how can we know when you’ll get scary warning shots vs full AI takeover? Further, AIs presumably want to avoid warning shots though it is unclear how much AIs will systematically want to advance the interests of other AIs. At a more basic level I don’t love arguments of the form “actually, having bad things happen will actually be good, so we shouldn’t try to prevent bad things which are pretty close to the actual bad things we’re worried about”. All that said, I’m sympathetic to researchers avoiding work on mitigating misuse or smaller-than-full-takeover harms due to this motivation. (I still think AI companies should have an obligation to make it very clear what risks they are imposing and to try to mitigate risks at multiple scales, but this obligation doesn’t apply to researchers. If it is hard to hire good people to work on misuse, try paying more.)
As in, your view is that:
Greatly accelerating all alignment work which isn’t somewhat philosophically/conceptually confusing won’t be very useful for solving alignment. (Because of conceptual bottlenecks.)
Philosophy/conceptual work is sufficiently hard to check when an AI is trying to screw you over that control can’t force scheming AIs to notably help with this work. (Note that even if evaluation in this regime is much harder than generation, that isn’t clearly a blocker, because we potentially only have a to supervise a small subset of outputs.) (Note that this is distinct from the view that AIs won’t be capable enough to help with this work as is currently the case.)
I think there is a lot of very helpful alignment work which isn’t conceptually bottlenecked and agendas with fewer conceptual barriers could go very far.
Further, I’m skeptical about evaluation being so hard.
Beyond that, I’m only imaging part of the theory of change of control work being to “solve alignment” or work on alignment.
Alternatives:
Buy time by both making it cheaper for actors to stall and also making risks better understood[1].
Also buy time by pushing AI takeover later in time.
Work on alternative exit conditions like emulated minds (difficult but maybe doable).
Additional time with very powerful AIs seems useful for studying them even if we can’t get any acceleration from automation. Pausing at the brink is more difficult in terms of costs and coordination (as there will be more close actors and more incentive), but it might also be radically easier to make the case for pausing relative to today. So, it overall seems easier to arrange and time at the brink also seems much more useful all around.
(I also think that pausing looks less good if it isn’t part of a well implemented, thoughtful and extremely high effort approach which e.g. actively intervenes with hardware and is aggressive on international efforts. So, I’m not very into pausing “for the wrong reasons”.)
(Edited to add)
I’d like to argue that there is a lot of helpful stuff which isn’t conceptually bottlenecked.
Concretely, let’s imagine targeting aligning AI systems which are capable enough to dominate top human experts in speed and cost across all relevant domains (including philosophy/conceptual work). In particular, we want to align them well enough that we’re happy to totally defer to them on really tricky questions like “what should be our high level approach for handling risks for further AI systems”[2]. I think this is a pretty good target for a rational actor, then you can actually hand off decision making to such AIs. (They might decides to work on alignment of smarter systems or pursue totally different routes.)
It seems difficult though not impossible that we could reasonably confidently (95%)[3] reach this target via very prosaic/empirical approaches using huge amounts of AI labor (e.g. a combination of testing things on huge numbers of model organisms, doing some high level interp (both white and black box) to get a bit of a sense for what is going on in AI systems in general, and verifying that AIs seem to do a good job in held out conceptual domains where we happen to know the answers). And, I think prosaic approaches with huge amounts labor could also substantially increase chances of hitting this target (e.g. 4x risk reduction) even if we can’t hit this level of confidence.
This doesn’t really look like “solve alignment”, but in practice it reduces risk a bunch.
It’s also possible that better understanding of risks will indicate that proceeding to wildly superhuman AI immediately is fine, though I’m skepitical.
The AIs might ask us questions or whatever to figure out our preferences.
Obviously, 95% confidence is objectively unreasonable, but it is notable improvement over doing nothing, particularly in worlds where we find evidence for misalignment.