I’m writing a quick and dirty post because the alternative is that I wait for months and maybe not write it after all. I am broadly familiar with the state of interpretability research but do not know what the state of model evaluations is at the moment.
The interpretability win screen comes after the game over.
There is an interpretability-enabled win condition at the point where billion-parameter networks become transparent enough that we robustly can detect deception. We are paperclipped long before that since their adjacant insights predictably lead to quicker iteration cycles, new architecture discoveries, insight into strategic planning and other capabilities AGI labs are looking for.
Tripwires stop working when you train models to avoid them.
Current interpretability research solely (as far as I’m aware) produces tools that melt under slight optimization pressure. Optimising on an interpretability tool, optimizes against being interpretable. It would be easy to fool others and oneself into thinking some model is safe because the non-optimisation-resistant interpretability tool showed your model was safe, after the model was optimised on it. If not that, then you could still be fooled into thinking you didn’t optimise on the interpretability tool or its ramifications.
You cannot learn by trial and error if you are not allowed to fail.
We could fire rockets at the moon because we could fail at many intermediate points, with AGI, we’re no longer allowed to fail after a certain threshold. Continuing to rely on the useful insights from failur,e, is thus a doomed approach[1] and interpretability increases the use AGI labs get out of failure. To start preparing for a world in which we’re not allowed to fail, we should build one where failing hurts instead of helps AGI-creators.
Interpretability should go to closed-source evaluators.
We should close-source interpretability and have trustworthy evaluators buy new tools up and solely use them to evaluate frontier models in a tripwire approach. These evaluators should then not tell AGI labs what failed, just that it failed[2]. Ideally evaluators get to block model deployments of course, but them having a good track record of warning against upcoming model failures[3] is a good start.
AGI labs become incentivised to anticipate failure and lose ability to argue anyone into thinking their models will be safe. They have to pass their tests like everyone else. Evaluators get high quality feedback on how well their tools predict model behavior, since those tools are now being wielded as intended, and they learn which failure modes are still uncovered.
How to monitor the evaluators is out of scope, I’ll just say that I will bet that it’s easier to monitor them, than it is to make optimisation-resistant interpretability tooling.
There is a balance between the information-security of the used interpretability techniques and the prestige-gain from detailed warnings. But intelligence agencies demonstrate that the balance can be struck.
Closed-Source Evaluations
Public tripwires are no tripwires.
I’m writing a quick and dirty post because the alternative is that I wait for months and maybe not write it after all. I am broadly familiar with the state of interpretability research but do not know what the state of model evaluations is at the moment.
The interpretability win screen comes after the game over.
There is an interpretability-enabled win condition at the point where billion-parameter networks become transparent enough that we robustly can detect deception.
We are paperclipped long before that since their adjacant insights predictably lead to quicker iteration cycles, new architecture discoveries, insight into strategic planning and other capabilities AGI labs are looking for.
Tripwires stop working when you train models to avoid them.
Current interpretability research solely (as far as I’m aware) produces tools that melt under slight optimization pressure. Optimising on an interpretability tool, optimizes against being interpretable.
It would be easy to fool others and oneself into thinking some model is safe because the non-optimisation-resistant interpretability tool showed your model was safe, after the model was optimised on it. If not that, then you could still be fooled into thinking you didn’t optimise on the interpretability tool or its ramifications.
You cannot learn by trial and error if you are not allowed to fail.
We could fire rockets at the moon because we could fail at many intermediate points, with AGI, we’re no longer allowed to fail after a certain threshold. Continuing to rely on the useful insights from failur,e, is thus a doomed approach[1] and interpretability increases the use AGI labs get out of failure.
To start preparing for a world in which we’re not allowed to fail, we should build one where failing hurts instead of helps AGI-creators.
Interpretability should go to closed-source evaluators.
We should close-source interpretability and have trustworthy evaluators buy new tools up and solely use them to evaluate frontier models in a tripwire approach. These evaluators should then not tell AGI labs what failed, just that it failed[2]. Ideally evaluators get to block model deployments of course, but them having a good track record of warning against upcoming model failures[3] is a good start.
AGI labs become incentivised to anticipate failure and lose ability to argue anyone into thinking their models will be safe. They have to pass their tests like everyone else.
Evaluators get high quality feedback on how well their tools predict model behavior, since those tools are now being wielded as intended, and they learn which failure modes are still uncovered.
and a bad culture
How to monitor the evaluators is out of scope, I’ll just say that I will bet that it’s easier to monitor them, than it is to make optimisation-resistant interpretability tooling.
There is a balance between the information-security of the used interpretability techniques and the prestige-gain from detailed warnings. But intelligence agencies demonstrate that the balance can be struck.