I think Control has similar problems to RLHF, where both might fail to generalize out of distribution in similar ways.
This seems important because Control has a central goal of being a fallback for that kind of failure mode.
I want to share my thoughts, including something nice Control does about this problem that I think RLHF could easily incorporate (investigate failures instead of always training against them).
What do I mean by Control generalizing out of distribution:
Our Control experiment might involve giving a model 100 leetcode problems, and we hope to show that (even with help) the model can’t successfully insert a back door without being caught by our Control setup[1].
Will this generalize to other leetcode problems? Probably.
How about more realistic coding tasks that involve confusing legacy code? Who knows.
How about more open-ended research tasks, like “here’s Ryan’s pitch for a faking-alignment paper, fill in the details and run the experiment”? I, eh, doubt it? Remember—we’re dealing with an AI that might be trying to scheme and mess up our research on purpose.
If we tested whether the AI can leak sensitive data without being caught, will it generalize to other bad things it might try to do?
And so on
Control’s soft guarantees could be somewhat achieved using RLHF
I’m writing this both to emphasise the similarity, but also to suggest nice things we could do with RLHF:
We generate scenarios where an AI might scheme or not scheme.
We might use a similar Control setup (“adding back doors to code”?), but the monitoring (“control protocol”) doesn’t have to be cheap+fast+scalable enough to run in production.
Given e.g 100 scenarios, use RLHF to train on a fraction of them (e.g 20), and check if the RLHF generalized to the rest.
If it didn’t generalize, pretend you caught the AI red handed (investigate what happened, publish it publicly, don’t just train against these scenarios)
In general, I’d be excited about labs publishing a naive metric like “what fraction of our RLHF results were needed in order to generalize to the rest”, as well as examples for scenarios that the AI doesn’t seem to generalize to unless you train against them specifically.
For similar reasons, I’d be excited about “we got our Control protocol working on 20% of the scenarios and it successfully generalized to the other 80%” over “we fixed our Control protocol until it worked[2] on 100% of the scenarios we had”.
Oh, and for both RLHF and Control, I think we should be very interested in “what is the distribution they used”.
Replies I got to this:
“We will only use the Controlled AI to write pretty normal code, so the deployment setup is in-distribution”
Me: “So how would Control help Safety research more than Capabilities? Seems like labs will try to extract AI R&D work regardless.”
“Control will let us run an AI that is smart[3] enough to cause significant damage to humanity, and extracting coding-work from such an AI is valuable”
Me: “That scenario implies an AI that can scheme in many creative ways. The Control scenario above might have (metaphorically) found the “backdoored code” neuron. That doesn’t seem enough.”
My steelman of Redwood:
“Yes, the distribution of scheming examples is super important, here are 2 posts we wrote about it and you missed, and we’ll publish 5 more next week with 14 more distributions of tasks relevant for the kind of alignment research we plan to do with a Controlled AI” (seems like Redwood always have a post directly addressing my concerns..)
“We need to start somewhere, and it’s easy to start with backdoored code or with a simple scheming objective like making an HTTP request”
“It’s useful to focus on scheming (whether using Control or RLHF) because that leads to other great assumptions”
“People are already implementing those RLHF suggestions, here are 20 arxiv papers about it that you missed”
“Control gives us less-capabilities with more-failing-loudly, which is a tradeoff we like”
“We are trying to have Control give (partial) guarantees that are pretty uncorrelated with RLHF, just as another layer of safety, because it’s the best idea we have so far, not because we actually think this will prevent the AI from scheming in a somewhat general way”
“worked” in the context of Control would probably mean “we detected scheming in a very high percent of the scenarios we tested”, but I’d like to even assume the Control protocol detected scheming in 100% of the scenarios tested
I think Control has similar problems to RLHF, where both might fail to generalize out of distribution in similar ways.
This seems important because Control has a central goal of being a fallback for that kind of failure mode.
I want to share my thoughts, including something nice Control does about this problem that I think RLHF could easily incorporate (investigate failures instead of always training against them).
What do I mean by Control generalizing out of distribution:
Our Control experiment might involve giving a model 100 leetcode problems, and we hope to show that (even with help) the model can’t successfully insert a back door without being caught by our Control setup[1].
Will this generalize to other leetcode problems? Probably.
How about more realistic coding tasks that involve confusing legacy code? Who knows.
How about more open-ended research tasks, like “here’s Ryan’s pitch for a faking-alignment paper, fill in the details and run the experiment”? I, eh, doubt it? Remember—we’re dealing with an AI that might be trying to scheme and mess up our research on purpose.
If we tested whether the AI can leak sensitive data without being caught, will it generalize to other bad things it might try to do?
And so on
Control’s soft guarantees could be somewhat achieved using RLHF
I’m writing this both to emphasise the similarity, but also to suggest nice things we could do with RLHF:
We generate scenarios where an AI might scheme or not scheme.
We might use a similar Control setup (“adding back doors to code”?), but the monitoring (“control protocol”) doesn’t have to be cheap+fast+scalable enough to run in production.
Given e.g 100 scenarios, use RLHF to train on a fraction of them (e.g 20), and check if the RLHF generalized to the rest.
If it didn’t generalize, pretend you caught the AI red handed (investigate what happened, publish it publicly, don’t just train against these scenarios)
In general, I’d be excited about labs publishing a naive metric like “what fraction of our RLHF results were needed in order to generalize to the rest”, as well as examples for scenarios that the AI doesn’t seem to generalize to unless you train against them specifically.
For similar reasons, I’d be excited about “we got our Control protocol working on 20% of the scenarios and it successfully generalized to the other 80%” over “we fixed our Control protocol until it worked[2] on 100% of the scenarios we had”.
Oh, and for both RLHF and Control, I think we should be very interested in “what is the distribution they used”.
Replies I got to this:
“We will only use the Controlled AI to write pretty normal code, so the deployment setup is in-distribution”
Me: “So how would Control help Safety research more than Capabilities? Seems like labs will try to extract AI R&D work regardless.”
“Control will let us run an AI that is smart[3] enough to cause significant damage to humanity, and extracting coding-work from such an AI is valuable”
Me: “That scenario implies an AI that can scheme in many creative ways. The Control scenario above might have (metaphorically) found the “backdoored code” neuron. That doesn’t seem enough.”
My steelman of Redwood:
“Yes, the distribution of scheming examples is super important, here are 2 posts we wrote about it and you missed, and we’ll publish 5 more next week with 14 more distributions of tasks relevant for the kind of alignment research we plan to do with a Controlled AI” (seems like Redwood always have a post directly addressing my concerns..)
“We need to start somewhere, and it’s easy to start with backdoored code or with a simple scheming objective like making an HTTP request”
“It’s useful to focus on scheming (whether using Control or RLHF) because that leads to other great assumptions”
“People are already implementing those RLHF suggestions, here are 20 arxiv papers about it that you missed”
“Control gives us less-capabilities with more-failing-loudly, which is a tradeoff we like”
“We are trying to have Control give (partial) guarantees that are pretty uncorrelated with RLHF, just as another layer of safety, because it’s the best idea we have so far, not because we actually think this will prevent the AI from scheming in a somewhat general way”
The “control protocol”
“worked” in the context of Control would probably mean “we detected scheming in a very high percent of the scenarios we tested”, but I’d like to even assume the Control protocol detected scheming in 100% of the scenarios tested
Not an ASI, but still very dangerous