Yep, indeed I would consider “control evaluations” to be a method of “AI control”. I consider the evaluation and the technique development to be part of a unified methodology (we’ll describe this more in a forthcoming post).
It’s “a unified methodology” but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.
(Agreed except that “inference-time safety techiques” feels overly limiting. It’s more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn’t discriminated by our validation set and other measurements. I hope this isn’t too incomprehensible, but don’t worry if it is, this point isn’t that important.)
Yep, indeed I would consider “control evaluations” to be a method of “AI control”. I consider the evaluation and the technique development to be part of a unified methodology (we’ll describe this more in a forthcoming post).
(I work at RR)
It’s “a unified methodology” but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.
(Agreed except that “inference-time safety techiques” feels overly limiting. It’s more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn’t discriminated by our validation set and other measurements. I hope this isn’t too incomprehensible, but don’t worry if it is, this point isn’t that important.)