I think those pessimistic about control evals could say something like “the basic problem with evaluating control is that no matter what techniques your red-team uses, you have to worry that your model is better at finding attacks than your red-team.” Of course, you note in the post some reason why we should expect our red-team to have advantages over models, but also you recognize that this won’t scale toward arbitrarily powerful AIs.
A key hope here is that we can use various test to get a general sense for whether AIs can importantly outcompete our red team. See this other comment for more discussion.
So, here’s how I would hope the actual situation with Alice and Bob goes:
Alice from the Scaling Department: Look, our current control techniques are working for GPT-5. So let’s scale up more! I get that at some point our control techniques aren’t going to be able to control arbitrarily powerful AI, but we’re not there yet, so we should be able to keep going.
Bob from the Control Red Team: Indeed, our current control techniques are very likely to be working. Our protocol suffices for controlling current models on the currently approved applications assuming we remain vigilant and maintain the necessary invariants (our design basis).
Part of why we think control evaluations currently work for assessing safety is that we we performed a variety of capability evaluations which indicate that as long as we prevent steganography in some key places, AIs are pretty unlikely to think of better attack plans than we thought of as the red team. This involves looking at how good AIs are at generating strategies in a bunch of different contexts and then doing some analysis which allows for us to transfer this to an overall conservative guess about the AI’s abilities to create good attack plans against our actual protocol (but we can’t necessarily produce the AI’s best guesses at attack plans, we’re utilizing that capability evaluation is easier than elicitation as discussed in footnote 8). We also have rough and conservative tests for the extent to which AIs can sandbag on our current evaluations and that seems to be fine for now.
From our current rough extrapolations, the first place where we’ll start running into issues with control as we build more powerful models is that we’ll need to start imposing heavy restrictions on AI affordances in domains XYZ which will really cut into the usefulness (relative to completely YOLO-ing it). The blue team is hoping that they can finish developing some countermeasures that will handle this problem, though it’s unclear when this will be ready. After that it looks like exploration hacking concerns will force us to be much more conservative in the evaluation in a few places which will probably put a considerable dent in usefulness in a few other domains. Shortly after that, building more powerful models probably won’t be worth it because the costs associated with training and sufficient containment will overwhelm the marginal improvements in usefuless we can get (given our current safety targets); the AI will only be usable in extremely restricted domains like submitting formally verified code or doing narrow mathematics. Of course, our larger scale extrapolations are crude and we’ll have a much better sense of the overall situation once we run our tests on the next few generations of model.
Alice from the Scaling Department: OK, well let me know when these evaluations start indicating that training further models isn’t worth it.
(Ideal) Alice from the Scaling Department: We’re already starting to pivot into making easier to control AI architectures more scalable and performant (easier to restrict and more bottlenecked on natural language reasoning). We’re also working on making inference cheaper at a fixed level of capability so that in some cases we can run dumber models for longer rather than using smarter models. This seems like it will be safer based on our current control evaluations. Of course, we’re worried about the potential overhang due to this algorithmic progress, and we’re pretty confident that competent adversaries couldn’t steal it.
A key hope here is that we can use various test to get a general sense for whether AIs can importantly outcompete our red team. See this other comment for more discussion.
So, here’s how I would hope the actual situation with Alice and Bob goes:
Alice from the Scaling Department: Look, our current control techniques are working for GPT-5. So let’s scale up more! I get that at some point our control techniques aren’t going to be able to control arbitrarily powerful AI, but we’re not there yet, so we should be able to keep going.
Bob from the Control Red Team: Indeed, our current control techniques are very likely to be working. Our protocol suffices for controlling current models on the currently approved applications assuming we remain vigilant and maintain the necessary invariants (our design basis).
Part of why we think control evaluations currently work for assessing safety is that we we performed a variety of capability evaluations which indicate that as long as we prevent steganography in some key places, AIs are pretty unlikely to think of better attack plans than we thought of as the red team. This involves looking at how good AIs are at generating strategies in a bunch of different contexts and then doing some analysis which allows for us to transfer this to an overall conservative guess about the AI’s abilities to create good attack plans against our actual protocol (but we can’t necessarily produce the AI’s best guesses at attack plans, we’re utilizing that capability evaluation is easier than elicitation as discussed in footnote 8). We also have rough and conservative tests for the extent to which AIs can sandbag on our current evaluations and that seems to be fine for now.
From our current rough extrapolations, the first place where we’ll start running into issues with control as we build more powerful models is that we’ll need to start imposing heavy restrictions on AI affordances in domains XYZ which will really cut into the usefulness (relative to completely YOLO-ing it). The blue team is hoping that they can finish developing some countermeasures that will handle this problem, though it’s unclear when this will be ready. After that it looks like exploration hacking concerns will force us to be much more conservative in the evaluation in a few places which will probably put a considerable dent in usefulness in a few other domains. Shortly after that, building more powerful models probably won’t be worth it because the costs associated with training and sufficient containment will overwhelm the marginal improvements in usefuless we can get (given our current safety targets); the AI will only be usable in extremely restricted domains like submitting formally verified code or doing narrow mathematics. Of course, our larger scale extrapolations are crude and we’ll have a much better sense of the overall situation once we run our tests on the next few generations of model.
Alice from the Scaling Department: OK, well let me know when these evaluations start indicating that training further models isn’t worth it.
(Ideal) Alice from the Scaling Department: We’re already starting to pivot into making easier to control AI architectures more scalable and performant (easier to restrict and more bottlenecked on natural language reasoning). We’re also working on making inference cheaper at a fixed level of capability so that in some cases we can run dumber models for longer rather than using smarter models. This seems like it will be safer based on our current control evaluations. Of course, we’re worried about the potential overhang due to this algorithmic progress, and we’re pretty confident that competent adversaries couldn’t steal it.