A key hope here is that with some amount of conservativeness our evaluation[1] and our protocol can generalize to future setups. In particular, our countermeasures could be sufficiently generic that they would work in new situations. And, in our control evaluation, we could test a variety of situations which we might end up in the future (even if they aren’t the current situation). So, even if the red team doesn’t have enough time to think about the exact situation in which we end up in the future, we might be able to verify that the protocol can generalize to different contexts in the future.
Alteratively, we can at least try to define operating conditions under which the situation is safe without further evaluations or reasoning.
For further discussion on this topic, see here. (This link is to a section of a draft version of a doc which we’ve made public. We’ll post a final version of this doc at some point.)
Overall, we don’t think the interventions I’ve described here make the situation overall clearly fine, but they seem likely to suffice for substantial reductions in risk given a large effort. So, the situation is indeed considerably worse if AIs can do a huge amount of uninterpretable serial reasoning.
(And our overall view on control is also something like “we don’t think control alone makes the situation overall clearly fine, but control seems likely to suffice for substantial reductions in risk given a large effort”. And there are many good objections to control (like this one), and our response to these good objections is typically “we have some countermeasures to this issue, these countermeasures make the situation considerably better than you might have thought, but they don’t fully eliminate the problem”.)
This is indeed a serious concern.
A key hope here is that with some amount of conservativeness our evaluation[1] and our protocol can generalize to future setups. In particular, our countermeasures could be sufficiently generic that they would work in new situations. And, in our control evaluation, we could test a variety of situations which we might end up in the future (even if they aren’t the current situation). So, even if the red team doesn’t have enough time to think about the exact situation in which we end up in the future, we might be able to verify that the protocol can generalize to different contexts in the future.
Alteratively, we can at least try to define operating conditions under which the situation is safe without further evaluations or reasoning.
For further discussion on this topic, see here. (This link is to a section of a draft version of a doc which we’ve made public. We’ll post a final version of this doc at some point.)
Overall, we don’t think the interventions I’ve described here make the situation overall clearly fine, but they seem likely to suffice for substantial reductions in risk given a large effort. So, the situation is indeed considerably worse if AIs can do a huge amount of uninterpretable serial reasoning.
(And our overall view on control is also something like “we don’t think control alone makes the situation overall clearly fine, but control seems likely to suffice for substantial reductions in risk given a large effort”. And there are many good objections to control (like this one), and our response to these good objections is typically “we have some countermeasures to this issue, these countermeasures make the situation considerably better than you might have thought, but they don’t fully eliminate the problem”.)
The red teaming in particular.