Protecting against sudden capability jumps during training
TL;DR: If an LLM has a sudden capabilities jump and is tested using a diverse test suite (e.g. one where it can execute code or present lots of text to human observers), it might be able to escape mid-training. In order to prevent this, I propose a phased testing approach, where the model is first tested for superhuman capabilities with an extremely constrained test suite before it’s tested with the broader test suite.
The failure mode
Let’s imagine GPT-6 suddenly gaining vastly superhuman hacking and manipulation skills. During training, some standard tests are performed to measure its performance on various benchmarks. Some of the tests include code execution on virtual machines, others include generating text which humans inspect. In total, thousands of queries are made to the model. The model finds a way to escape by taking advantage of vulnerabilities in the testing setup (e.g. by blackmailing a human evaluator or by finding an exploit in a virtual machine), before the researchers can even notice that it’s superintelligent.
This highlights a failure mode I’m concerned about: a sudden capabilities jump leading to a misaligned and powerful AI which, when tested with a diverse test suite, finds a way to escape before the lab notices how capable it is.
The primary way this goes wrong is that the AI has lots of surface area with the real world – it has ways to influence the humans and devices it interacts with. The more a model is allowed to influence the world (whether that be through writing text which gets inspected by humans, or running code on virtual machines) the easier it is for the model to take over. I would imagine that it’s extremely hard for a model to take over if it’s only ever allowed to predict a sequence of 10 tokens and nothing else.
Ideally, the lab would have some way to notice that the model had a sharp capabilities jump without giving it many opportunities to take over. In case they were suddenly and unexpectedly dealing with a superintelligence, hopefully they would be wise enough to enter a pause until they’re confident that they can proceed safely.
Phased testing
I think that phased testing should be used during frontier model training runs. By this, I mean a testing approach which starts off with extremely low surface area tests, and gradually increases surface area. This makes it easy to notice sudden capability gains while decreasing the likelihood that the model takes over.
Similar practices exist in other industries, where new tools are tested in environments where their damage can be limited, and then progressively rolled out to less limited environments.
In practice, this might look like:
Start off with a set of tests which minimizes both the number of tokens that the model is allowed to predict, and the set of tokens it’s allowed to sample from.
For instance, it could be a set of 100 difficult multiple-choice questions, where the model can literally only output one of {A,B,C,D} as its answer, and all other token probabilities are set to 0.
If the performance is high enough to scare the lab, they can pause development before the model is able to do any damage.
If the performance is not high enough to scare the lab, they can repeat using a slightly less constrained test suite.
This might not even require that much additional work, as simply reordering already existing tests and adding alerts for high test results probably gets most of the way there.
Note that this does not work for superhuman LLMs which pretend not to be superhuman. If such an LLM were to be tested in phases, it might know to act less capable than it actually is. If the “no sandbagging on checkable tasks” hypothesis is true, then I’m not very worried about this, as we could just use existing techniques to elicit superintelligent capabilities.
Also note that this does not work for vastly superhuman LLMs, as they might be able to escape even during or before an extremely constrained test. There might be galaxy-brained tricks like writing code injections inside the logit output numbers which would be extremely hard to predict and protect against.
Thanks for writing this!
I actually think the proposal is more general than just for preventing AI escapes during diverse evals—you want to start with low surface area tests because they’re cheaper anyways, and you can use the performance on low surface area tests to decide if you want to do more thorough evals.
I imagine a proper approach is something like:
Very cheap qualification evals, such that if the model can’t pass them (and isn’t sandbagging) it is very unlikely to have had a capabilities jump.
Very low-surface area safety evals—does the model capabilities imply that it’s not safe to do the thorough evals?
Very thorough evals for measurement—what it says on the tin.
You’d want to run 1 + 2 all the time, and begin running 3 once the model passes 1 until it passes 2.
I think this point is completely correct right now but will become less correct in the future, as some measures to lower a model’s surface area might be quite costly to implement. I’m mostly thinking of “AI boxing” measures here, like using a Faraday-caged cluster, doing a bunch of monitoring, and minimizing direct human contact with the model.