Thanks for writing this. I think it does a good job laying out the assumptions needed for control.
It seems like it could be challenging to justify that a red team is competitive -- especially if you are aiming to extract massive amounts of work from AI systems.
For example, if your AI system is like a human sw engineer but 30X faster, then it might come up with effectively superhuman plans just because it has lots of time to think. Externalized reasoning can help, but it’s not very clear to me that externalized reasoning can scale to domains where AI systems are doing a bunch of research that humans (and GPT-4) can’t readily disentangle and understand. I’m imagining that chain of thought will amount to thousands of pages of technical content that is written like rambly google docs. It might be hard to pick out the AI safety research from the cyberoffense planning in this case -- especially with dumb models like GPT-4.
It seems like to the extent you want to use AI systems for work that is hard to evaluate, control measures are difficult to trust. Maybe a key crux is whether AI systems will have to do work that is hard to evaluate to yield a 30x acceleration factor. I’d imagine this would apply for AI systems to achieve fundemental research advances -- e.g. in interpretability, formal verification, etc.
First, here’s a general point which is relevant in responding to your comment.
There are different types of failures related to control:
We realize we can’t reliably test and ensure control. This could be either:
a. We realize that the necessary assumptions for control evaluations to work aren’t met. In particular, we can’t ensure the red team is sufficiently competitive and thus we can’t trust control evaluations.
b. We run control evaluations and we aren’t able to develop sufficient countermeasures to ensure safety as measured by these evaluations.
We think the situation is fine based on control evaluations, but it’s actually not and AIs succeed at egregiously bad actions.
Failure (1) is plausibly less bad than failure (2). And, (1.a) looks different than (1.b).
(See here for more discussion of different failure modes, but note that this is still an early version which is subject to change.)
So for failure (1) there is an important distinction between justifying/determining a red team is competitive (1.a) and ensuring that a red team is competitive (1.b).
We might be able to have reasonable-ish but somewhat approximate tests to get a rough sense of red team competitiveness and control evaluation viability more generally.
These tests are things like:
How much problematic planning can AIs do without this being detected in practice?
Is it likely that AIs can sandbag on our capability evaluations? More specifically, on which evaluations can they sandbag and to what extent?
Do AIs have qualitatively wildly superhuman abilities in some contexts to an extent that we should think that they likely have such abilities in other cases?
(A forthcoming post of ours will describe these tests in more detail. An early version of this post is here, but it doesn’t describe these tests in sufficient detail yet and our thinking is still evolving.)
Some of these tests looks like standard control red teaming + capability evaluations (can the AI do problematic planning?) while some tests look more adhoc and might use things like model organisms, slightly related capability evalautions, and extrapolation based measurements (in particular testing for the ability to sandbag probably looks like this, we have some forthcoming work demonstrating this methodology).
Note that we expect that these “adhoc tests” are probably much less precise than the core capability evaluations of control. But this might be fine because in many cases (e.g. sandbagging), I expect we’ll be able to have large safety margins.
Overall, I think we have some (IMO pretty reasonable) hopes for at least knowing that our red team won’t realistically be competitive which will help substantially with avoiding failure (1). But, I can totally understand people being skeptical based on what we’ve legibly written up so far.
More specific responses.
Here’s a summary of my understanding of your comment:
It could be difficult to justify (or to ensure?) that a red team is competitive. One specific case is that it could be hard to avoid AIs being able to come up with effectively superhuman plans due to total thinking time.
This could apply even with heavily externalized reasoning: perhaps this externalized reasoning looks like a huge amount of overall words.
Also, it’s not clear that externalized reasoning is viable and thus might cause further issues. (I think this was something you were commenting on? I’m unsure?)
Hard to evalute work could break control (and maybe also break control evaluations?)
On avoiding effectively superhuman plans
As far as total thinking time resulting in superhuman plans which break red team competitiveness, let’s first consider the case where there are absolutely no natural language bottlenecks (the AI isn’t force to regularly bottleneck serial reasoning in natual language (externalized reasoning)). I address this in my earlier comment responding to Lukas. Broadly, our hope will be to restrict effective time between natural language bottlenecks to a degree where maintaining competitiveness is doable.
(Also note that if SGD is decently sample efficient (which could be false!), AIs probably can’t spend a high fraction of their cognitive resources in expectation thinking of plans for causing bad outcomes: this will be train out. They can potentially rarely spend a high fraction of resources and/or often spend a small/moderate fraction of resources.)
Now, suppose that we do have natural language bottlenecks. One concern you raised is that this will look like a huge amount of overall words of externalized reasoning and thus it will be hard to detect malign planning. We could hope to use GPT-4 (or our best trusted model), but this model might be too dumb. (Separately, I’m reasonably optimistic about finetuned version of GPT-4 level models being decent at detecting problematic reasoning. Mechanisms like untrusted monitoring could also help (discussed in our recent paper).)
Our hope for ruling out the AI often or always doing some reasoning about malign plans in externalized reasoning will be to randomly select traces of reasoning for detailed human review.
Suppose we select 1% of reasoning “chunks” for review and our goal is to hit a 30x overall speed up on our researchers. Suppose each chunk is the value weighted equivalent of X hours of human researcher work. Then, the human reviewer needs to review it in less than ~3 X hours (assuming this is the vast majority of our human review budget).
One relevant concern here is that the AI will have way more context on the chunk of reasoning such that a human reviewer will have a very hard time understanding it. There are a variety of approaches to that could help mitigate this: make chunks bigger to amoritize the time the human spends obtaining context, have human specialize, force AIs to operate in a more legible fashion, and use AI assistance or other education/tools to speed up the human.
Note that this only rules out the AI often or always doing this malign natural language reasoning. The AI could still get away with rarely doing this (or doing this in a concentrated periord of time). (See concentrated failures vs auditing failures for a more detailed description of concentrated failures.)
But, if the AI can only do this reasoning rarely, then it won’t be able to think carefully about when it should be doing this reasoning. This makes it much harder for AIs to use complex strategies which involve specific opportunities or which involve coordination.
(We might worry about reasoning which is only done once but which is seen by a huge number of
instances; our hope here will be to review anything which is seen by a huge number of instances. This should be ~viable as less than 1% of things will need to be seen by a huge number of instances.)
As part of our control evaluation, we would need to red team these countermeasures and try to verify that the AI can’t spend sufficiently long planning that we’re in trouble.
Now, indeed hard to evaluate work can make control difficult. At least, it makes it difficult to ensure high levels of usefulness in the case where models are scheming, though maintaining direct safety might not be that that much harder.
I think one relevant decomposition is:
Can humans evaluate the work of adversarial humans in the key areas we care about? (Let’s suppose that all written artifacts are preserved, e.g. the full trace of all of the work include all intermediate notes and code. We won’t force externalized reasoning for our humans.) Note that this allows for interventions like forcing the work to appear legible etc. I can discuss a concrete operationalization for this overall experiment.
Will AI work be importantly harder to evaluate than the above human case?
Personally, I feel like work in key areas is reasonably doable to evaluate when I think through specific cases.
Thanks, I’ll summarize my takeaways in the form of an argument that control is sufficient to safely accelerate alignment research by 30x. This is certainly not the only argument one could make, but I like it because it is clear and concrete.
I’ll specifically focus on justifying that the red team is competitive.
If your AI systems can accelerate AI safety research by 30x, then they are producing effectively superhuman outputs; however, this does not imply they can produce superhuman evasion plans because AI systems can’t spend as much of their thinking on how to evade control measures as they can on safety research.
Suppose that the AI systems have approximately human capabilities and are sped up by 30x.
Suppose also that the human red team has spent 2 months identifying evasion strategies (call this ‘coup research’ following your control paper)
I’ll argue that the AI systems will not be able to do more than the equivalent of 2 months of coup research. Since our red team has lots of advantages over the AI systems, this is a conservative argument that the red team is competitive.
Suppose that we only want to justify that the red team is competitive for the next two months (and we’ll continuously search for attack strategies in order to maintain this 1 to 1 ratio).
Two months of deployment is equivalent to 5 years of human thinking since our AI system is sped up by 30x. We need to show that no more than 1 / 30th of this thinking is coup research.
Training algorithms might pressure AI systems to be at least somewhat efficient about doing useful work (and coup research isn’t useful for the task at hand). Perhaps this implies a 1 / N coup research budget.
Furthermore, suppose the AI systems’ reasoning is externalized. Monitoring setups are very likely to catch coup research if it is present at a frequency greater than 1 / M.
min(1 / N, 1 / M) ⇐ 1 / 30
I think I buy this argument or some variation of it.
Good point. I think it’s helpful when people working schemes with the rough flavor of “we do X, and then X helps us get to a useful AI that does not takeover” try to specify roughly how capable they expect the “useful AI does not takeover” to be.
Would be curious to hear more about the kinds of tasks that Ryan and Buck expect the first “transformatively useful but still controllable” AI will be able to do (perhaps expressed in multiplier terms, like 30X human SW engineer, perhaps expressed in terms of the kinds of novel reasoning it can perform, and perhaps expressed in terms of the kinds of research tasks or policy tasks it would be able to meaningfully help with).
Thanks for writing this. I think it does a good job laying out the assumptions needed for control.
It seems like it could be challenging to justify that a red team is competitive -- especially if you are aiming to extract massive amounts of work from AI systems.
For example, if your AI system is like a human sw engineer but 30X faster, then it might come up with effectively superhuman plans just because it has lots of time to think. Externalized reasoning can help, but it’s not very clear to me that externalized reasoning can scale to domains where AI systems are doing a bunch of research that humans (and GPT-4) can’t readily disentangle and understand. I’m imagining that chain of thought will amount to thousands of pages of technical content that is written like rambly google docs. It might be hard to pick out the AI safety research from the cyberoffense planning in this case -- especially with dumb models like GPT-4.
It seems like to the extent you want to use AI systems for work that is hard to evaluate, control measures are difficult to trust. Maybe a key crux is whether AI systems will have to do work that is hard to evaluate to yield a 30x acceleration factor. I’d imagine this would apply for AI systems to achieve fundemental research advances -- e.g. in interpretability, formal verification, etc.
Curious about your thoughts.
Justifying control vs ensuring control
First, here’s a general point which is relevant in responding to your comment.
There are different types of failures related to control:
We realize we can’t reliably test and ensure control. This could be either: a. We realize that the necessary assumptions for control evaluations to work aren’t met. In particular, we can’t ensure the red team is sufficiently competitive and thus we can’t trust control evaluations. b. We run control evaluations and we aren’t able to develop sufficient countermeasures to ensure safety as measured by these evaluations.
We think the situation is fine based on control evaluations, but it’s actually not and AIs succeed at egregiously bad actions.
Failure (1) is plausibly less bad than failure (2). And, (1.a) looks different than (1.b).
(See here for more discussion of different failure modes, but note that this is still an early version which is subject to change.)
So for failure (1) there is an important distinction between justifying/determining a red team is competitive (1.a) and ensuring that a red team is competitive (1.b).
We might be able to have reasonable-ish but somewhat approximate tests to get a rough sense of red team competitiveness and control evaluation viability more generally.
These tests are things like:
How much problematic planning can AIs do without this being detected in practice?
Is it likely that AIs can sandbag on our capability evaluations? More specifically, on which evaluations can they sandbag and to what extent?
Do AIs have qualitatively wildly superhuman abilities in some contexts to an extent that we should think that they likely have such abilities in other cases?
(A forthcoming post of ours will describe these tests in more detail. An early version of this post is here, but it doesn’t describe these tests in sufficient detail yet and our thinking is still evolving.)
Some of these tests looks like standard control red teaming + capability evaluations (can the AI do problematic planning?) while some tests look more adhoc and might use things like model organisms, slightly related capability evalautions, and extrapolation based measurements (in particular testing for the ability to sandbag probably looks like this, we have some forthcoming work demonstrating this methodology).
Note that we expect that these “adhoc tests” are probably much less precise than the core capability evaluations of control. But this might be fine because in many cases (e.g. sandbagging), I expect we’ll be able to have large safety margins.
Overall, I think we have some (IMO pretty reasonable) hopes for at least knowing that our red team won’t realistically be competitive which will help substantially with avoiding failure (1). But, I can totally understand people being skeptical based on what we’ve legibly written up so far.
More specific responses.
Here’s a summary of my understanding of your comment:
It could be difficult to justify (or to ensure?) that a red team is competitive. One specific case is that it could be hard to avoid AIs being able to come up with effectively superhuman plans due to total thinking time.
This could apply even with heavily externalized reasoning: perhaps this externalized reasoning looks like a huge amount of overall words.
Also, it’s not clear that externalized reasoning is viable and thus might cause further issues. (I think this was something you were commenting on? I’m unsure?)
Hard to evalute work could break control (and maybe also break control evaluations?)
On avoiding effectively superhuman plans
As far as total thinking time resulting in superhuman plans which break red team competitiveness, let’s first consider the case where there are absolutely no natural language bottlenecks (the AI isn’t force to regularly bottleneck serial reasoning in natual language (externalized reasoning)). I address this in my earlier comment responding to Lukas. Broadly, our hope will be to restrict effective time between natural language bottlenecks to a degree where maintaining competitiveness is doable.
(Also note that if SGD is decently sample efficient (which could be false!), AIs probably can’t spend a high fraction of their cognitive resources in expectation thinking of plans for causing bad outcomes: this will be train out. They can potentially rarely spend a high fraction of resources and/or often spend a small/moderate fraction of resources.)
Now, suppose that we do have natural language bottlenecks. One concern you raised is that this will look like a huge amount of overall words of externalized reasoning and thus it will be hard to detect malign planning. We could hope to use GPT-4 (or our best trusted model), but this model might be too dumb. (Separately, I’m reasonably optimistic about finetuned version of GPT-4 level models being decent at detecting problematic reasoning. Mechanisms like untrusted monitoring could also help (discussed in our recent paper).)
Our hope for ruling out the AI often or always doing some reasoning about malign plans in externalized reasoning will be to randomly select traces of reasoning for detailed human review.
(I’m going to rehash some arguments we make in concentrated failures vs auditing failures, so readers should consider looking at that post for more context.)
Suppose we select 1% of reasoning “chunks” for review and our goal is to hit a 30x overall speed up on our researchers. Suppose each chunk is the value weighted equivalent of X hours of human researcher work. Then, the human reviewer needs to review it in less than ~3 X hours (assuming this is the vast majority of our human review budget).
One relevant concern here is that the AI will have way more context on the chunk of reasoning such that a human reviewer will have a very hard time understanding it. There are a variety of approaches to that could help mitigate this: make chunks bigger to amoritize the time the human spends obtaining context, have human specialize, force AIs to operate in a more legible fashion, and use AI assistance or other education/tools to speed up the human.
Note that this only rules out the AI often or always doing this malign natural language reasoning. The AI could still get away with rarely doing this (or doing this in a concentrated periord of time). (See concentrated failures vs auditing failures for a more detailed description of concentrated failures.)
But, if the AI can only do this reasoning rarely, then it won’t be able to think carefully about when it should be doing this reasoning. This makes it much harder for AIs to use complex strategies which involve specific opportunities or which involve coordination.
(We might worry about reasoning which is only done once but which is seen by a huge number of instances; our hope here will be to review anything which is seen by a huge number of instances. This should be ~viable as less than 1% of things will need to be seen by a huge number of instances.)
As part of our control evaluation, we would need to red team these countermeasures and try to verify that the AI can’t spend sufficiently long planning that we’re in trouble.
On hard to evaluate work
First, note that hard to evaluate work doesn’t seem to break control evaluations. See Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy and the “Evaluating for the occurrence of proxy bad outcomes” section of a longer (draft) post on making safety cases with control for more discussion. (In the terminology discussed here, imagine testing out control on string theory and using incorrect string theory results as the proxy bad outcome.) I could try to explain this if there was sufficient interest.
Now, indeed hard to evaluate work can make control difficult. At least, it makes it difficult to ensure high levels of usefulness in the case where models are scheming, though maintaining direct safety might not be that that much harder.
I think one relevant decomposition is:
Can humans evaluate the work of adversarial humans in the key areas we care about? (Let’s suppose that all written artifacts are preserved, e.g. the full trace of all of the work include all intermediate notes and code. We won’t force externalized reasoning for our humans.) Note that this allows for interventions like forcing the work to appear legible etc. I can discuss a concrete operationalization for this overall experiment.
Will AI work be importantly harder to evaluate than the above human case?
Personally, I feel like work in key areas is reasonably doable to evaluate when I think through specific cases.
Thanks, I’ll summarize my takeaways in the form of an argument that control is sufficient to safely accelerate alignment research by 30x. This is certainly not the only argument one could make, but I like it because it is clear and concrete.
I’ll specifically focus on justifying that the red team is competitive.
If your AI systems can accelerate AI safety research by 30x, then they are producing effectively superhuman outputs; however, this does not imply they can produce superhuman evasion plans because AI systems can’t spend as much of their thinking on how to evade control measures as they can on safety research.
Suppose that the AI systems have approximately human capabilities and are sped up by 30x.
Suppose also that the human red team has spent 2 months identifying evasion strategies (call this ‘coup research’ following your control paper)
I’ll argue that the AI systems will not be able to do more than the equivalent of 2 months of coup research. Since our red team has lots of advantages over the AI systems, this is a conservative argument that the red team is competitive.
Suppose that we only want to justify that the red team is competitive for the next two months (and we’ll continuously search for attack strategies in order to maintain this 1 to 1 ratio).
Two months of deployment is equivalent to 5 years of human thinking since our AI system is sped up by 30x. We need to show that no more than 1 / 30th of this thinking is coup research.
Training algorithms might pressure AI systems to be at least somewhat efficient about doing useful work (and coup research isn’t useful for the task at hand). Perhaps this implies a 1 / N coup research budget.
Furthermore, suppose the AI systems’ reasoning is externalized. Monitoring setups are very likely to catch coup research if it is present at a frequency greater than 1 / M.
min(1 / N, 1 / M) ⇐ 1 / 30
I think I buy this argument or some variation of it.
Good point. I think it’s helpful when people working schemes with the rough flavor of “we do X, and then X helps us get to a useful AI that does not takeover” try to specify roughly how capable they expect the “useful AI does not takeover” to be.
Would be curious to hear more about the kinds of tasks that Ryan and Buck expect the first “transformatively useful but still controllable” AI will be able to do (perhaps expressed in multiplier terms, like 30X human SW engineer, perhaps expressed in terms of the kinds of novel reasoning it can perform, and perhaps expressed in terms of the kinds of research tasks or policy tasks it would be able to meaningfully help with).