We initially called these experiments “control evaluations”; my guess is that it was a mistake to use this terminology, and we’ll probably start calling them “black-box plan-conservative control evaluations” or something.
also seems worth disambiguating “conservative evaluations” from “control evaluations”—in particular, as you suggest, we might want to assess scalable oversight methods under conservative assumptions (to be fair, the distinction isn’t all that clean—your oversight process can both train and monitor a policy. Still, I associate control more closely with monitoring, and conservative evaluations have a broader scope).
We would consider “black-box conservative evaluations on whether our AIs could subvert our (scalable) oversight techniques” to be a special case of black-box conservative control evaluations.
Insofar as you are exploring assumptions other than “the AI is trying to subvert our safeguards”, I wouldn’t consider it to be control.
makes sense—I think I had in mind something like “estimate P(scheming | high reward) by evaluating P(high reward | scheming)”. But evaluating P(high reward | scheming) is a black-box conservative control evaluation—the possible updates to P(scheming) are just a nice byproduct
also seems worth disambiguating “conservative evaluations” from “control evaluations”—in particular, as you suggest, we might want to assess scalable oversight methods under conservative assumptions (to be fair, the distinction isn’t all that clean—your oversight process can both train and monitor a policy. Still, I associate control more closely with monitoring, and conservative evaluations have a broader scope).
We would consider “black-box conservative evaluations on whether our AIs could subvert our (scalable) oversight techniques” to be a special case of black-box conservative control evaluations.
Insofar as you are exploring assumptions other than “the AI is trying to subvert our safeguards”, I wouldn’t consider it to be control.
makes sense—I think I had in mind something like “estimate P(scheming | high reward) by evaluating P(high reward | scheming)”. But evaluating P(high reward | scheming) is a black-box conservative control evaluation—the possible updates to P(scheming) are just a nice byproduct