Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
When we have controlled unaligned AIs doing economically useful work, they probably won’t be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify parts.
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs. This seems unlikely to turn out well. They may also persuade some of us to trust their alignment work, even though we really shouldn’t.
So to go down this road is to bet that alignment has no philosophically confusing or hard to verify parts. I see some people saying this explicitly in the comments here, but why do they think that? How do they know? (I’m afraid that some people just don’t feel philosophically confused about much of anything, and will push forward on that basis.) But you do seem to worry about philosophical problems, which makes me confused about the position you take here.
BTW I have similar objections to working on relatively easy forms of (i.e., unscalable) alignment solutions, and using the resulting aligned AIs to solve alignment for more powerful AIs. But at least there, one might gain some insights into the harder alignment problems from working on the easy problems, potentially producing some useful strategic information or making it easier to verify future proposed alignment solutions. So while I don’t think that’s a good plan, this plan seems even worse.
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
[...]
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs.
It currently seems unlikely to me that marginal AI control research I’m excited about is very economically useful. I agree that some control or control-adjacent research will end up being at least somewhat economically relevant.
I can also imagine worlds where AIs are clear schemers that are clearly unproductive to use by default, and thus control increases deployment through both straightforward incentives and risk compensation (though this maybe requires the AIs to have made a mistake or to have somewhat strange preferences as they would probably prefer fooling our tests).
My understanding is that your concerns on downside are maybe some combination of:
AI control research will be economically very useful (e.g. accelerates AI by 6 months conditional on 5 year timelines).
Control will result in irrational risk compensation style actions where reducing earlier risks provides false confort about later risks (as control isn’t scalable) and this will ultimately make the situation worse.
AI control prevents earlier warning shots which would have created more will for costly actions that save us later. (I’m not sure if this is actually your concern, though some statements pattern matched to this.)
I’m very skeptical of (1) for marginal control research. It feelsl like this has to imply incredibly strong levels of inefficiency and usefulness of control research. Perhaps you’re imagining a smaller acceleration effect, but think the benefits are also tiny relative to this small acceleration?
I’m somewhat sympathetic to things being somewhat inefficient such that if the field of people who might do control research instead did capabilities, that would be a notable acceleration (maybe 5%), but it seems like control research should be a massive reduction in acceleration relative to this.
On (2), this sounds like a weird intermediate level of competence where they are actually sensitive to full on scheming risks at an earlier point (when control works), but end up thinking this will scale despite it being relatively obvious it won’t. I agree that control might end up being an excuse for scaling, but I don’t think that it will be a counterfactual excuse (some other excuse would have been found) and I think we should just make sure to argue on the merits at the point.
I do think that (2) points toward focusing on advancing evaluation methodology more than techniques such that it would be relatively clear when things fail if people are being unreasonable and it is relatively easy to call out unreasonableness.
I think not doing control due to (3) can put in you in pretty scary situations: how can we know when you’ll get scary warning shots vs full AI takeover? Further, AIs presumably want to avoid warning shots though it is unclear how much AIs will systematically want to advance the interests of other AIs. At a more basic level I don’t love arguments of the form “actually, having bad things happen will actually be good, so we shouldn’t try to prevent bad things which are pretty close to the actual bad things we’re worried about”. All that said, I’m sympathetic to researchers avoiding work on mitigating misuse or smaller-than-full-takeover harms due to this motivation. (I still think AI companies should have an obligation to make it very clear what risks they are imposing and to try to mitigate risks at multiple scales, but this obligation doesn’t apply to researchers. If it is hard to hire good people to work on misuse, try paying more.)
When we have controlled unaligned AIs doing economically useful work, they probably won’t be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs.
As in, your view is that:
Greatly accelerating all alignment work which isn’t somewhat philosophically/conceptually confusing won’t be very useful for solving alignment. (Because of conceptual bottlenecks.)
Philosophy/conceptual work is sufficiently hard to check when an AI is trying to screw you over that control can’t force scheming AIs to notably help with this work. (Note that even if evaluation in this regime is much harder than generation, that isn’t clearly a blocker, because we potentially only have a to supervise a small subset of outputs.) (Note that this is distinct from the view that AIs won’t be capable enough to help with this work as is currently the case.)
I think there is a lot of very helpful alignment work which isn’t conceptually bottlenecked and agendas with fewer conceptual barriers could go very far.
Further, I’m skeptical about evaluation being so hard.
Beyond that, I’m only imaging part of the theory of change of control work being to “solve alignment” or work on alignment.
Alternatives:
Buy time by both making it cheaper for actors to stall and also making risks better understood[1].
Also buy time by pushing AI takeover later in time.
Work on alternative exit conditions like emulated minds (difficult but maybe doable).
Additional time with very powerful AIs seems useful for studying them even if we can’t get any acceleration from automation. Pausing at the brink is more difficult in terms of costs and coordination (as there will be more close actors and more incentive), but it might also be radically easier to make the case for pausing relative to today. So, it overall seems easier to arrange and time at the brink also seems much more useful all around.
(I also think that pausing looks less good if it isn’t part of a well implemented, thoughtful and extremely high effort approach which e.g. actively intervenes with hardware and is aggressive on international efforts. So, I’m not very into pausing “for the wrong reasons”.)
(Edited to add)
I’d like to argue that there is a lot of helpful stuff which isn’t conceptually bottlenecked.
Concretely, let’s imagine targeting aligning AI systems which are capable enough to dominate top human experts in speed and cost across all relevant domains (including philosophy/conceptual work). In particular, we want to align them well enough that we’re happy to totally defer to them on really tricky questions like “what should be our high level approach for handling risks for further AI systems”[2]. I think this is a pretty good target for a rational actor, then you can actually hand off decision making to such AIs. (They might decides to work on alignment of smarter systems or pursue totally different routes.)
It seems difficult though not impossible that we could reasonably confidently (95%)[3] reach this target via very prosaic/empirical approaches using huge amounts of AI labor (e.g. a combination of testing things on huge numbers of model organisms, doing some high level interp (both white and black box) to get a bit of a sense for what is going on in AI systems in general, and verifying that AIs seem to do a good job in held out conceptual domains where we happen to know the answers). And, I think prosaic approaches with huge amounts labor could also substantially increase chances of hitting this target (e.g. 4x risk reduction) even if we can’t hit this level of confidence.
This doesn’t really look like “solve alignment”, but in practice it reduces risk a bunch.
It’s also possible that better understanding of risks will indicate that proceeding to wildly superhuman AI immediately is fine, though I’m skepitical.
Obviously, 95% confidence is objectively unreasonable, but it is notable improvement over doing nothing, particularly in worlds where we find evidence for misalignment.
My vague plan along these lines is to attempt as hard as possible to defer all philosophically confusing questions to the “long reflection”, and to use AI control as a tool to help produce AIs that can help preserve long term option value (including philosophical option value) as best as possible.
I seperately have hope we can solve “the entire problem” at some point, e.g. through ARC’s agenda (which I spend most of my time trying to derisk and advance).
People interested in a discussion about control with someone who is maybe closer to Wei Dai’s perspective might be interested in my dialogue with habyrka.
I think if the first powerful unaligned AI remained in control instead of escaping, it might make a good difference, because we can engineer and test alignment ideas on it, rather than develop alignment ideas on an unknown future AI. This assumes at least some instances of it do not hide their misalignment very well.
I think a key difference is I do believe the technical alignment/control problem as defined essentially requires no philosophical progress or solving philosophical problems like the hard problem of consciousness, and I believe the reason for this comes down to both a general point and a specific point.
In general, one of the reasons I believe philosophy tends not to be a productive area compared to other branches of science is that usually they either solve essentially proven to be intractable problems nowadays, or they straight up tried to solve a problem in far too much generality without doing any experiments, and that’s when they aren’t straight up solving fictional problems (I believe a whole lot of possible world philosophizing is in that category).
This is generally because philosophers do far too much back-chaining compared to front-chaining on a lot of problems.
For the specific point of alignment/control agendas, it’s because that the problem of AI alignment isn’t a problem about what goals you should assign it, but rather whether you can put in goals into the AI system such that the AI will reliably follow your goals at all.
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
When we have controlled unaligned AIs doing economically useful work, they probably won’t be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify parts.
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs. This seems unlikely to turn out well. They may also persuade some of us to trust their alignment work, even though we really shouldn’t.
So to go down this road is to bet that alignment has no philosophically confusing or hard to verify parts. I see some people saying this explicitly in the comments here, but why do they think that? How do they know? (I’m afraid that some people just don’t feel philosophically confused about much of anything, and will push forward on that basis.) But you do seem to worry about philosophical problems, which makes me confused about the position you take here.
BTW I have similar objections to working on relatively easy forms of (i.e., unscalable) alignment solutions, and using the resulting aligned AIs to solve alignment for more powerful AIs. But at least there, one might gain some insights into the harder alignment problems from working on the easy problems, potentially producing some useful strategic information or making it easier to verify future proposed alignment solutions. So while I don’t think that’s a good plan, this plan seems even worse.
It currently seems unlikely to me that marginal AI control research I’m excited about is very economically useful. I agree that some control or control-adjacent research will end up being at least somewhat economically relevant.
I can also imagine worlds where AIs are clear schemers that are clearly unproductive to use by default, and thus control increases deployment through both straightforward incentives and risk compensation (though this maybe requires the AIs to have made a mistake or to have somewhat strange preferences as they would probably prefer fooling our tests).
My understanding is that your concerns on downside are maybe some combination of:
AI control research will be economically very useful (e.g. accelerates AI by 6 months conditional on 5 year timelines).
Control will result in irrational risk compensation style actions where reducing earlier risks provides false confort about later risks (as control isn’t scalable) and this will ultimately make the situation worse.
AI control prevents earlier warning shots which would have created more will for costly actions that save us later. (I’m not sure if this is actually your concern, though some statements pattern matched to this.)
I’m very skeptical of (1) for marginal control research. It feelsl like this has to imply incredibly strong levels of inefficiency and usefulness of control research. Perhaps you’re imagining a smaller acceleration effect, but think the benefits are also tiny relative to this small acceleration?
I’m somewhat sympathetic to things being somewhat inefficient such that if the field of people who might do control research instead did capabilities, that would be a notable acceleration (maybe 5%), but it seems like control research should be a massive reduction in acceleration relative to this.
On (2), this sounds like a weird intermediate level of competence where they are actually sensitive to full on scheming risks at an earlier point (when control works), but end up thinking this will scale despite it being relatively obvious it won’t. I agree that control might end up being an excuse for scaling, but I don’t think that it will be a counterfactual excuse (some other excuse would have been found) and I think we should just make sure to argue on the merits at the point.
I do think that (2) points toward focusing on advancing evaluation methodology more than techniques such that it would be relatively clear when things fail if people are being unreasonable and it is relatively easy to call out unreasonableness.
I think not doing control due to (3) can put in you in pretty scary situations: how can we know when you’ll get scary warning shots vs full AI takeover? Further, AIs presumably want to avoid warning shots though it is unclear how much AIs will systematically want to advance the interests of other AIs. At a more basic level I don’t love arguments of the form “actually, having bad things happen will actually be good, so we shouldn’t try to prevent bad things which are pretty close to the actual bad things we’re worried about”. All that said, I’m sympathetic to researchers avoiding work on mitigating misuse or smaller-than-full-takeover harms due to this motivation. (I still think AI companies should have an obligation to make it very clear what risks they are imposing and to try to mitigate risks at multiple scales, but this obligation doesn’t apply to researchers. If it is hard to hire good people to work on misuse, try paying more.)
As in, your view is that:
Greatly accelerating all alignment work which isn’t somewhat philosophically/conceptually confusing won’t be very useful for solving alignment. (Because of conceptual bottlenecks.)
Philosophy/conceptual work is sufficiently hard to check when an AI is trying to screw you over that control can’t force scheming AIs to notably help with this work. (Note that even if evaluation in this regime is much harder than generation, that isn’t clearly a blocker, because we potentially only have a to supervise a small subset of outputs.) (Note that this is distinct from the view that AIs won’t be capable enough to help with this work as is currently the case.)
I think there is a lot of very helpful alignment work which isn’t conceptually bottlenecked and agendas with fewer conceptual barriers could go very far.
Further, I’m skeptical about evaluation being so hard.
Beyond that, I’m only imaging part of the theory of change of control work being to “solve alignment” or work on alignment.
Alternatives:
Buy time by both making it cheaper for actors to stall and also making risks better understood[1].
Also buy time by pushing AI takeover later in time.
Work on alternative exit conditions like emulated minds (difficult but maybe doable).
Additional time with very powerful AIs seems useful for studying them even if we can’t get any acceleration from automation. Pausing at the brink is more difficult in terms of costs and coordination (as there will be more close actors and more incentive), but it might also be radically easier to make the case for pausing relative to today. So, it overall seems easier to arrange and time at the brink also seems much more useful all around.
(I also think that pausing looks less good if it isn’t part of a well implemented, thoughtful and extremely high effort approach which e.g. actively intervenes with hardware and is aggressive on international efforts. So, I’m not very into pausing “for the wrong reasons”.)
(Edited to add)
I’d like to argue that there is a lot of helpful stuff which isn’t conceptually bottlenecked.
Concretely, let’s imagine targeting aligning AI systems which are capable enough to dominate top human experts in speed and cost across all relevant domains (including philosophy/conceptual work). In particular, we want to align them well enough that we’re happy to totally defer to them on really tricky questions like “what should be our high level approach for handling risks for further AI systems”[2]. I think this is a pretty good target for a rational actor, then you can actually hand off decision making to such AIs. (They might decides to work on alignment of smarter systems or pursue totally different routes.)
It seems difficult though not impossible that we could reasonably confidently (95%)[3] reach this target via very prosaic/empirical approaches using huge amounts of AI labor (e.g. a combination of testing things on huge numbers of model organisms, doing some high level interp (both white and black box) to get a bit of a sense for what is going on in AI systems in general, and verifying that AIs seem to do a good job in held out conceptual domains where we happen to know the answers). And, I think prosaic approaches with huge amounts labor could also substantially increase chances of hitting this target (e.g. 4x risk reduction) even if we can’t hit this level of confidence.
This doesn’t really look like “solve alignment”, but in practice it reduces risk a bunch.
It’s also possible that better understanding of risks will indicate that proceeding to wildly superhuman AI immediately is fine, though I’m skepitical.
The AIs might ask us questions or whatever to figure out our preferences.
Obviously, 95% confidence is objectively unreasonable, but it is notable improvement over doing nothing, particularly in worlds where we find evidence for misalignment.
My vague plan along these lines is to attempt as hard as possible to defer all philosophically confusing questions to the “long reflection”, and to use AI control as a tool to help produce AIs that can help preserve long term option value (including philosophical option value) as best as possible.
I seperately have hope we can solve “the entire problem” at some point, e.g. through ARC’s agenda (which I spend most of my time trying to derisk and advance).
People interested in a discussion about control with someone who is maybe closer to Wei Dai’s perspective might be interested in my dialogue with habyrka.
I think if the first powerful unaligned AI remained in control instead of escaping, it might make a good difference, because we can engineer and test alignment ideas on it, rather than develop alignment ideas on an unknown future AI. This assumes at least some instances of it do not hide their misalignment very well.
I think a key difference is I do believe the technical alignment/control problem as defined essentially requires no philosophical progress or solving philosophical problems like the hard problem of consciousness, and I believe the reason for this comes down to both a general point and a specific point.
In general, one of the reasons I believe philosophy tends not to be a productive area compared to other branches of science is that usually they either solve essentially proven to be intractable problems nowadays, or they straight up tried to solve a problem in far too much generality without doing any experiments, and that’s when they aren’t straight up solving fictional problems (I believe a whole lot of possible world philosophizing is in that category).
This is generally because philosophers do far too much back-chaining compared to front-chaining on a lot of problems.
For the specific point of alignment/control agendas, it’s because that the problem of AI alignment isn’t a problem about what goals you should assign it, but rather whether you can put in goals into the AI system such that the AI will reliably follow your goals at all.