(I’m sharing this here because I needed to link to it, and I thought I might as well make it public.)
When preventing high stakes failures, a crucial bottleneck is the proportion of actions that you can audit: that is, investigate the action further for the purpose of finding evidence that the AI is scheming. (See here for more on this. In particular, note that this might be different from the problem of investigating an action for the purpose of learning whether that action is dangerous.) If we need to rely on human labor to audit, we’ll only be able to audit a tiny fraction of actions that our AI agents are doing. It would be extremely helpful if we were able to use AI assistance for auditing, from both trusted and untrusted models. But obviously auditing with untrusted models leads to massive risk of collusion. So this setting seems potentially pretty hard and potentially crucial; I’m excited for a project investigating it.
This is pretty closely related to untrusted monitoring; you could also call this project “deep dive into untrusted monitoring”.
The simplest such setting I can think of
We’re going to take the APPS backdooring setting from AI Control: Improving Safety Despite Intentional Subversion, fix the main control technique to be trusted monitoring with 1% auditing budget and defer-to-trusted for the 10% most suspicious solutions, and focus on the problem of implementing an auditor. That is:
Get a dataset of backdoor attempts.
For statistical power, this dataset will need to be large (at least 100; it would be better for it to be more like 1000). This might require developing a better/cheaper backdoor generation pipeline.
It might be good to explore backdooring in a harder setting than the interview subset of APPS. Maybe the advanced subset of APPS would be better.
The red team will be allowed to modify these backdoor attempts to add collusion signals. (These modifications will probably be cheap and easy compared to generating the backdoor attempts.)
Get a dataset of benign code, filter with a trusted monitor for the most suspicious 1%. (This should be pretty cheap and easy)
We want to implement a first-pass auditor (better name TBD). That is, we want the actual audit process to have two stages:
Firstly, some AI scaffold (the first-pass auditor) carefully investigates the sketchiest 1% of model-proposed code.
That scaffold needs to choose 1% of the code it reviews to pass along to a human reviewer.
The first-pass auditor is allowed to be somewhat expensive compared to generating the original code, because it’s only getting called 1% of the time. For our purposes, maybe let’s say it can cost up to 20x as much.
And we’re doing a control eval: the untrusted model will, as usual, be replaced by an attack policy.
Note that in this control eval there’s no notion of usefulness: our only metric is the probability that the red team is incriminated on a given backdoor attempt. (Actually this might be more confusing than I’ve written here, because we might want to do honeypotting of the untrusted model in the first-pass auditing setup, which leads to the game varying with scale for the usual reasons (see untrusted monitoring in our original control paper).)
Techniques
Get the untrusted model to reimplement the function, compare the implementations.
Ask the untrusted model to explain why various parts of the function are the way they are; make collusion hard.
Show the untrusted model only a small amount of the context (e.g. the action but not the rest of the trajectory), have it ask a trusted model for information that it needs to know in order to assess whether the action was dangerous; hope that it can’t learn whether to collude given this restricted information channel.
There are many potential options here.
My guess is that this project would actually be much better with more complicated programming problems than APPS.
Project proposal: Control for auditing
(I’m sharing this here because I needed to link to it, and I thought I might as well make it public.)
When preventing high stakes failures, a crucial bottleneck is the proportion of actions that you can audit: that is, investigate the action further for the purpose of finding evidence that the AI is scheming. (See here for more on this. In particular, note that this might be different from the problem of investigating an action for the purpose of learning whether that action is dangerous.) If we need to rely on human labor to audit, we’ll only be able to audit a tiny fraction of actions that our AI agents are doing. It would be extremely helpful if we were able to use AI assistance for auditing, from both trusted and untrusted models. But obviously auditing with untrusted models leads to massive risk of collusion. So this setting seems potentially pretty hard and potentially crucial; I’m excited for a project investigating it.
This is pretty closely related to untrusted monitoring; you could also call this project “deep dive into untrusted monitoring”.
The simplest such setting I can think of
We’re going to take the APPS backdooring setting from AI Control: Improving Safety Despite Intentional Subversion, fix the main control technique to be trusted monitoring with 1% auditing budget and defer-to-trusted for the 10% most suspicious solutions, and focus on the problem of implementing an auditor. That is:
Get a dataset of backdoor attempts.
For statistical power, this dataset will need to be large (at least 100; it would be better for it to be more like 1000). This might require developing a better/cheaper backdoor generation pipeline.
It might be good to explore backdooring in a harder setting than the interview subset of APPS. Maybe the advanced subset of APPS would be better.
The red team will be allowed to modify these backdoor attempts to add collusion signals. (These modifications will probably be cheap and easy compared to generating the backdoor attempts.)
Get a dataset of benign code, filter with a trusted monitor for the most suspicious 1%. (This should be pretty cheap and easy)
We want to implement a first-pass auditor (better name TBD). That is, we want the actual audit process to have two stages:
Firstly, some AI scaffold (the first-pass auditor) carefully investigates the sketchiest 1% of model-proposed code.
That scaffold needs to choose 1% of the code it reviews to pass along to a human reviewer.
The first-pass auditor is allowed to be somewhat expensive compared to generating the original code, because it’s only getting called 1% of the time. For our purposes, maybe let’s say it can cost up to 20x as much.
And we’re doing a control eval: the untrusted model will, as usual, be replaced by an attack policy.
Note that in this control eval there’s no notion of usefulness: our only metric is the probability that the red team is incriminated on a given backdoor attempt. (Actually this might be more confusing than I’ve written here, because we might want to do honeypotting of the untrusted model in the first-pass auditing setup, which leads to the game varying with scale for the usual reasons (see untrusted monitoring in our original control paper).)
Techniques
Get the untrusted model to reimplement the function, compare the implementations.
Ask the untrusted model to explain why various parts of the function are the way they are; make collusion hard.
Show the untrusted model only a small amount of the context (e.g. the action but not the rest of the trajectory), have it ask a trusted model for information that it needs to know in order to assess whether the action was dangerous; hope that it can’t learn whether to collude given this restricted information channel.
There are many potential options here.
My guess is that this project would actually be much better with more complicated programming problems than APPS.