I think that phased testing should be used during frontier model training runs. By this, I mean a testing approach which starts off with extremely low surface area tests, and gradually increases surface area. This makes it easy to notice sudden capability gains while decreasing the likelihood that the model takes over.
I actually think the proposal is more general than just for preventing AI escapes during diverse evals—you want to start with low surface area tests because they’re cheaper anyways, and you can use the performance on low surface area tests to decide if you want to do more thorough evals.
I imagine a proper approach is something like:
Very cheap qualification evals, such that if the model can’t pass them (and isn’t sandbagging) it is very unlikely to have had a capabilities jump.
Very low-surface area safety evals—does the model capabilities imply that it’s not safe to do the thorough evals?
Very thorough evals for measurement—what it says on the tin.
You’d want to run 1 + 2 all the time, and begin running 3 once the model passes 1 until it passes 2.
I think this point is completely correct right now but will become less correct in the future, as some measures to lower a model’s surface area might be quite costly to implement. I’m mostly thinking of “AI boxing” measures here, like using a Faraday-caged cluster, doing a bunch of monitoring, and minimizing direct human contact with the model.
Thanks for writing this!
I actually think the proposal is more general than just for preventing AI escapes during diverse evals—you want to start with low surface area tests because they’re cheaper anyways, and you can use the performance on low surface area tests to decide if you want to do more thorough evals.
I imagine a proper approach is something like:
Very cheap qualification evals, such that if the model can’t pass them (and isn’t sandbagging) it is very unlikely to have had a capabilities jump.
Very low-surface area safety evals—does the model capabilities imply that it’s not safe to do the thorough evals?
Very thorough evals for measurement—what it says on the tin.
You’d want to run 1 + 2 all the time, and begin running 3 once the model passes 1 until it passes 2.
I think this point is completely correct right now but will become less correct in the future, as some measures to lower a model’s surface area might be quite costly to implement. I’m mostly thinking of “AI boxing” measures here, like using a Faraday-caged cluster, doing a bunch of monitoring, and minimizing direct human contact with the model.