Alignment researchers should should think hard about switching to working on AI Control
I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50⁄50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!
For this post, my definitions are roughly:
AI alignment is the task of ensuring the AIs “do what you want them to do”
AI control is the task of ensuring that if the AIs are not aligned (e.g. don’t always “do what you want” and potentially want to mess with you), then you are still OK and can use them for economically productive tasks (an important one of which is doing more alignment/control research.)
Here are some thoughts, arguments, and analogies (epistemic status: there is no “hidden content”, if you don’t find the literal words I wrote persuasive you shouldn’t update. In particular, just update on the words and don’t update about what my words imply about my beliefs.):
Everything is in degrees. We can “partially align” some AIs, and things will be better if we can use those AIs for productive tasks, like helping with alignment research. The thing that actually matters is “how aligned are the AIs” + “how aligned to they need to be to use them for stuff”, so we should also focus on the 2nd thing.
If you were a hedge fund, and your strategy for preventing people from stealing your data was and starting new hedge fund was “we will make the hedge fund a super fun place to work and interview people carefully to make sure they really care about working for us”, and you also did not do things like “maybe people shouldn’t be able to download all the trading strategies onto a USB stick which we let them take home”, “maybe people should be forced to keep their work computers in the office” and “maybe we should have some people checking if other employees are trying to steal data and make some plans about how to stop them from starting a rival hedge fund if they do”, then all the other hedge funds would think you were an idiot. You should obviously try both to check if people want to defect, and also make it harder for them to defect.
I think a lot of the neglect of AI control can be traced to the “pivotal act” based framings of how to make aligned AI go well. I think these views are mistaken, but I won’t provide a great argument here (quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”). However, I don’t think things are shaping up for single pivotal acts to be very important, and being able to productively use early AI labor to help with alignment seems like it is going to be an important part of the future going well.
Working on AI control is really tractable compared to working on alignment. There is a lot of prior art in people thinking about how to e.g. prevent data from being stolen by insider threats, secure various information against adversaries, generally ensure that national security secrets don’t get leaked, etc. We can set up various plausibly looking scenarios and try to “game them out” under certain plausibly seeming assumptions about how much alignment we’ll be able to achieve. We can make assumptions about AI capabilities and alignment without really getting into the nitty gritty of philosophically confusing topics like “what is deception really?” and “how likely is deception?” and “when will this all go down?”
Note that this means control might be easier for the AIs to automate, and easier to do “during crunch time.”
It also means that AI control is a bit contiguous with work people already do to protect from adversaries. I think the AI thing qualitatively changes the picture though, so it’s still worth thinking about.
If you succeed at AI control, there is a more compelling story you can tell for why you should implement the specific scheme that you thought of. In comparison, a lot of alignment proposals depend on philosophically dubious and confusing assumptions about the nature of cognition/SGD/etc. That is to say, a successful proposal in AI control comes with an argument that might be convincing to established security professionals if they thought about it, but there is likely no such class of professionals capable of lending your proposal external legitimacy in AI alignment.
The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.
There is a bunch of research that is useful for both alignment and control, e.g. model organisms of deception, interpretability techniques, oversight techniques, etc. More people should analyze such research from the perspective of “how can this be helpful even if it fails to produce an aligned AI?”
This topic is important enough that you could consider making a full post.
My belief is that this would improve reach, and also make it easier for people to reference your arguments.
Consider, you believe there is a 45% chance that alignment researchers would be better suited pivoting to control research. I do not suspect a quick take will reach anywhere close to that number, and has a low chance of catalysing dramatic institutional level change.
I directionally agree with this (and think it’s good to write about this more, strongly upvoted!)
For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:
“ensuring that if the AIs are not aligned [...], then you are still OK” (which I think is the main meaning of “AI control”)
Making ~worst-case assumptions about things like neural representations or inductive biases (which in practice means you likely rely on black-box methods, as in Redwood’s existing work on control).
I think 2. is arguably the most promising strategy for 1., but I’ve occasionally noticed myself conflating them more than I should.
1. gives you the naive 50⁄50 equilibrium, i.e. 50% of people should naively work on this broad notion of control. But I think other reasons in favor apply more strongly to 2. (e.g. the tractability arguments are significantly weaker for model internals-based approaches to 1.)
I also think (non-confidently) that 2. is what’s really very different from most existing research. For control in the first, broad sense, some research seems less clearly on either the control or alignment side.
But I do agree that safety-motivated researchers should evaluate approaches from a control perspective (in the broad sense) more on the margin. And I also really like the narrower black-box approach to control!
I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there’s some under way, and I’m excited for more).
We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.
I plan to write more about white box control soon (but I plan to write about a lot of stuff…)
I disagree with this take. A.I. control will only be important in a multipolar situation in which no single A.I. system can create a gray goo catastrophe etc. But if such pivotal acts are impossible and no singular A.I. takes control, but instead many A.I.’s are competing, than some groups will develop better or worse control for economic reasons and it won’t affect existential risk much to work on it now. I don’t think I can see a situation where control matters—only a few players have A.G.I. for a very long time and none escape or are open sourced but also none gain a decisive advantage?
I do see advantages to hardening important institutions against cyberattacks and increasing individual and group rationality so that humans remain agentic for as long as possible.
Hard to be sure without more detail, but your comment gives me the impression that you haven’t thought through the various different branches of how AI and geopolitics might go in the next 10 years.
I, for one, am pretty sure AI control and powerful narrow AI tools will both be pretty key for humanity surviving the next 10 years. I don’t expect us to have robustly solved ASI-aligment in that timeframe.
I agree with bullet points 1, 2, 3, 6 and 7, partially agree with bullet point 5, and disagree with bullet point 4.
Thus, I agree with the central claim here:
I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50⁄50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!
For more of my analysis on the bullet points, read the rest of the comment.
For bullet point 1, I basically agree with this, mostly due to not favoring binary assumptions and instead prefer continuous quantity reasoning, which tends to both be a better match for the IRL world, and also continuous quantity reasoning gives you more information than binary outcomes.
I really like bullet point 2, and also think that even in a scenario where it’s easy to prevent defection, you should still have controls that make defecting employees have much less reward and much more punishment for subversive actions.
I deeply agree with point 3, and I’d frame AI control in one of 2 ways:
As a replacement for the pivotal act concept.
As a pivotal act that doesn’t require destruction or death, and doesn’t require you to overthrow nations in your quest.
A nitpick: AI labor will be the huge majority of alignment progress in every stage, not just the early stage.
I think one big reason the pivotal act frame dominated a lot of discussions is the assumption that we would get a pure software singularity which would FOOM in several weeks, but reality is shaping up to not be a pure software-singularity, since physical stuff like robotics and data centers still matters.
There’s a reason why every hyperscaler is trying to get large amounts of power and datacenter compute contracts, because they realize that the singularity is bottlenecked currently on power and to a lesser extent compute.
I disagree with 4, but that’s due to my views on alignment, which tend to view it as a significantly easier problem than the median LWer does, and in particular I view essentially 0 need for philosophical deconfusion to make the future go well.
I agree that AI control enhances alignment arguments universally, and provides more compelling stories. I disagree with the assumption that all alignment plans depend on dubious philosophical assumptions about the nature of cognition/SGD.
I definitely agree with bullet point 6 that superhuman savant AI could well play a big part in the intelligence explosion, and I believe this most for formal math theorem provers/AI coders.
Agree with bullet point 7, and think it would definitely be helpful if people focused more on “how can this be helpful even if it fails to produce an aligned AI?”
I feel like our viewpoints have converged a lot over the past couple years Noosphere. Which I suppose makes sense, since we’ve both been updating on similar evidence!
The one point I’d disagree with, although also wanting to point out that the disagreement seems irrelevant to short term strategy, is that I do think that philosophy and figuring out values is going to be pretty key in getting from a place of “shakey temporary safety” to a place of “long-term stable safety”.
But I think our views on the sensible next steps to get to that initial at-least-reasonable-safety sound quite similar.
Since I’m pretty sure we’re currently in a quite fragile place as a species, I think it’s worth putting off thinking about long term safety (decades) to focus on short/medium term safety (months/years).
Alignment researchers should should think hard about switching to working on AI Control
I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50⁄50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!
For this post, my definitions are roughly:
AI alignment is the task of ensuring the AIs “do what you want them to do”
AI control is the task of ensuring that if the AIs are not aligned (e.g. don’t always “do what you want” and potentially want to mess with you), then you are still OK and can use them for economically productive tasks (an important one of which is doing more alignment/control research.)
Here are some thoughts, arguments, and analogies (epistemic status: there is no “hidden content”, if you don’t find the literal words I wrote persuasive you shouldn’t update. In particular, just update on the words and don’t update about what my words imply about my beliefs.):
Everything is in degrees. We can “partially align” some AIs, and things will be better if we can use those AIs for productive tasks, like helping with alignment research. The thing that actually matters is “how aligned are the AIs” + “how aligned to they need to be to use them for stuff”, so we should also focus on the 2nd thing.
If you were a hedge fund, and your strategy for preventing people from stealing your data was and starting new hedge fund was “we will make the hedge fund a super fun place to work and interview people carefully to make sure they really care about working for us”, and you also did not do things like “maybe people shouldn’t be able to download all the trading strategies onto a USB stick which we let them take home”, “maybe people should be forced to keep their work computers in the office” and “maybe we should have some people checking if other employees are trying to steal data and make some plans about how to stop them from starting a rival hedge fund if they do”, then all the other hedge funds would think you were an idiot. You should obviously try both to check if people want to defect, and also make it harder for them to defect.
I think a lot of the neglect of AI control can be traced to the “pivotal act” based framings of how to make aligned AI go well. I think these views are mistaken, but I won’t provide a great argument here (quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”). However, I don’t think things are shaping up for single pivotal acts to be very important, and being able to productively use early AI labor to help with alignment seems like it is going to be an important part of the future going well.
Working on AI control is really tractable compared to working on alignment. There is a lot of prior art in people thinking about how to e.g. prevent data from being stolen by insider threats, secure various information against adversaries, generally ensure that national security secrets don’t get leaked, etc. We can set up various plausibly looking scenarios and try to “game them out” under certain plausibly seeming assumptions about how much alignment we’ll be able to achieve. We can make assumptions about AI capabilities and alignment without really getting into the nitty gritty of philosophically confusing topics like “what is deception really?” and “how likely is deception?” and “when will this all go down?”
Note that this means control might be easier for the AIs to automate, and easier to do “during crunch time.”
It also means that AI control is a bit contiguous with work people already do to protect from adversaries. I think the AI thing qualitatively changes the picture though, so it’s still worth thinking about.
If you succeed at AI control, there is a more compelling story you can tell for why you should implement the specific scheme that you thought of. In comparison, a lot of alignment proposals depend on philosophically dubious and confusing assumptions about the nature of cognition/SGD/etc. That is to say, a successful proposal in AI control comes with an argument that might be convincing to established security professionals if they thought about it, but there is likely no such class of professionals capable of lending your proposal external legitimacy in AI alignment.
The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.
There is a bunch of research that is useful for both alignment and control, e.g. model organisms of deception, interpretability techniques, oversight techniques, etc. More people should analyze such research from the perspective of “how can this be helpful even if it fails to produce an aligned AI?”
This topic is important enough that you could consider making a full post.
My belief is that this would improve reach, and also make it easier for people to reference your arguments.
Consider, you believe there is a 45% chance that alignment researchers would be better suited pivoting to control research. I do not suspect a quick take will reach anywhere close to that number, and has a low chance of catalysing dramatic institutional level change.
I directionally agree with this (and think it’s good to write about this more, strongly upvoted!)
For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:
“ensuring that if the AIs are not aligned [...], then you are still OK” (which I think is the main meaning of “AI control”)
Making ~worst-case assumptions about things like neural representations or inductive biases (which in practice means you likely rely on black-box methods, as in Redwood’s existing work on control).
I think 2. is arguably the most promising strategy for 1., but I’ve occasionally noticed myself conflating them more than I should.
1. gives you the naive 50⁄50 equilibrium, i.e. 50% of people should naively work on this broad notion of control. But I think other reasons in favor apply more strongly to 2. (e.g. the tractability arguments are significantly weaker for model internals-based approaches to 1.)
I also think (non-confidently) that 2. is what’s really very different from most existing research. For control in the first, broad sense, some research seems less clearly on either the control or alignment side.
But I do agree that safety-motivated researchers should evaluate approaches from a control perspective (in the broad sense) more on the margin. And I also really like the narrower black-box approach to control!
I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there’s some under way, and I’m excited for more).
We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.
I plan to write more about white box control soon (but I plan to write about a lot of stuff…)
I disagree with this take. A.I. control will only be important in a multipolar situation in which no single A.I. system can create a gray goo catastrophe etc. But if such pivotal acts are impossible and no singular A.I. takes control, but instead many A.I.’s are competing, than some groups will develop better or worse control for economic reasons and it won’t affect existential risk much to work on it now. I don’t think I can see a situation where control matters—only a few players have A.G.I. for a very long time and none escape or are open sourced but also none gain a decisive advantage?
I do see advantages to hardening important institutions against cyberattacks and increasing individual and group rationality so that humans remain agentic for as long as possible.
Hard to be sure without more detail, but your comment gives me the impression that you haven’t thought through the various different branches of how AI and geopolitics might go in the next 10 years.
I, for one, am pretty sure AI control and powerful narrow AI tools will both be pretty key for humanity surviving the next 10 years. I don’t expect us to have robustly solved ASI-aligment in that timeframe.
I agree with bullet points 1, 2, 3, 6 and 7, partially agree with bullet point 5, and disagree with bullet point 4.
Thus, I agree with the central claim here:
For more of my analysis on the bullet points, read the rest of the comment.
For bullet point 1, I basically agree with this, mostly due to not favoring binary assumptions and instead prefer continuous quantity reasoning, which tends to both be a better match for the IRL world, and also continuous quantity reasoning gives you more information than binary outcomes.
I really like bullet point 2, and also think that even in a scenario where it’s easy to prevent defection, you should still have controls that make defecting employees have much less reward and much more punishment for subversive actions.
I deeply agree with point 3, and I’d frame AI control in one of 2 ways:
As a replacement for the pivotal act concept.
As a pivotal act that doesn’t require destruction or death, and doesn’t require you to overthrow nations in your quest.
A nitpick: AI labor will be the huge majority of alignment progress in every stage, not just the early stage.
I think one big reason the pivotal act frame dominated a lot of discussions is the assumption that we would get a pure software singularity which would FOOM in several weeks, but reality is shaping up to not be a pure software-singularity, since physical stuff like robotics and data centers still matters.
There’s a reason why every hyperscaler is trying to get large amounts of power and datacenter compute contracts, because they realize that the singularity is bottlenecked currently on power and to a lesser extent compute.
I disagree with 4, but that’s due to my views on alignment, which tend to view it as a significantly easier problem than the median LWer does, and in particular I view essentially 0 need for philosophical deconfusion to make the future go well.
I agree that AI control enhances alignment arguments universally, and provides more compelling stories. I disagree with the assumption that all alignment plans depend on dubious philosophical assumptions about the nature of cognition/SGD.
I definitely agree with bullet point 6 that superhuman savant AI could well play a big part in the intelligence explosion, and I believe this most for formal math theorem provers/AI coders.
Agree with bullet point 7, and think it would definitely be helpful if people focused more on “how can this be helpful even if it fails to produce an aligned AI?”
I feel like our viewpoints have converged a lot over the past couple years Noosphere. Which I suppose makes sense, since we’ve both been updating on similar evidence! The one point I’d disagree with, although also wanting to point out that the disagreement seems irrelevant to short term strategy, is that I do think that philosophy and figuring out values is going to be pretty key in getting from a place of “shakey temporary safety” to a place of “long-term stable safety”. But I think our views on the sensible next steps to get to that initial at-least-reasonable-safety sound quite similar.
Since I’m pretty sure we’re currently in a quite fragile place as a species, I think it’s worth putting off thinking about long term safety (decades) to focus on short/medium term safety (months/years).