If sample efficiency was reasonably high, then we overall feel like control could be workable in the regime where for some tasks we need AIs to use huge (e.g. the equivalent of 3 months of human thinking) but not insane (e.g. the equivalent of 1,000 years of human thinking) amounts of uninterpretable serial reasoning.
Let’s assume that an AI lab is in a fast-paced, rapidly changing state of affairs. (As you might expect during a moderately fast takeoff.) If an AI then gets to do the equivalent of 3 months of thinking, it could use that to do notably more serial thinking about how to exploit the current situation than any human has done (because some important features of the current situation is less than 3 months old: e.g. you had to switch to an importantly different programming language, or some different type of communication protocol between AIs, or change your AI safety research paradigm in a way you didn’t expect). I’m curious how you’d ensure (or evaluate for) red-team competitiveness in this case.
A key hope here is that with some amount of conservativeness our evaluation[1] and our protocol can generalize to future setups. In particular, our countermeasures could be sufficiently generic that they would work in new situations. And, in our control evaluation, we could test a variety of situations which we might end up in the future (even if they aren’t the current situation). So, even if the red team doesn’t have enough time to think about the exact situation in which we end up in the future, we might be able to verify that the protocol can generalize to different contexts in the future.
Alteratively, we can at least try to define operating conditions under which the situation is safe without further evaluations or reasoning.
For further discussion on this topic, see here. (This link is to a section of a draft version of a doc which we’ve made public. We’ll post a final version of this doc at some point.)
Overall, we don’t think the interventions I’ve described here make the situation overall clearly fine, but they seem likely to suffice for substantial reductions in risk given a large effort. So, the situation is indeed considerably worse if AIs can do a huge amount of uninterpretable serial reasoning.
(And our overall view on control is also something like “we don’t think control alone makes the situation overall clearly fine, but control seems likely to suffice for substantial reductions in risk given a large effort”. And there are many good objections to control (like this one), and our response to these good objections is typically “we have some countermeasures to this issue, these countermeasures make the situation considerably better than you might have thought, but they don’t fully eliminate the problem”.)
I like this direction and this write-up of it!
Let’s assume that an AI lab is in a fast-paced, rapidly changing state of affairs. (As you might expect during a moderately fast takeoff.) If an AI then gets to do the equivalent of 3 months of thinking, it could use that to do notably more serial thinking about how to exploit the current situation than any human has done (because some important features of the current situation is less than 3 months old: e.g. you had to switch to an importantly different programming language, or some different type of communication protocol between AIs, or change your AI safety research paradigm in a way you didn’t expect). I’m curious how you’d ensure (or evaluate for) red-team competitiveness in this case.
This is indeed a serious concern.
A key hope here is that with some amount of conservativeness our evaluation[1] and our protocol can generalize to future setups. In particular, our countermeasures could be sufficiently generic that they would work in new situations. And, in our control evaluation, we could test a variety of situations which we might end up in the future (even if they aren’t the current situation). So, even if the red team doesn’t have enough time to think about the exact situation in which we end up in the future, we might be able to verify that the protocol can generalize to different contexts in the future.
Alteratively, we can at least try to define operating conditions under which the situation is safe without further evaluations or reasoning.
For further discussion on this topic, see here. (This link is to a section of a draft version of a doc which we’ve made public. We’ll post a final version of this doc at some point.)
Overall, we don’t think the interventions I’ve described here make the situation overall clearly fine, but they seem likely to suffice for substantial reductions in risk given a large effort. So, the situation is indeed considerably worse if AIs can do a huge amount of uninterpretable serial reasoning.
(And our overall view on control is also something like “we don’t think control alone makes the situation overall clearly fine, but control seems likely to suffice for substantial reductions in risk given a large effort”. And there are many good objections to control (like this one), and our response to these good objections is typically “we have some countermeasures to this issue, these countermeasures make the situation considerably better than you might have thought, but they don’t fully eliminate the problem”.)
The red teaming in particular.