I think there’s another agenda like make untrusted models safe but useful by putting them in a scaffolding/bureaucracy—of filters, classifiers, LMs, humans, etc.—such that at inference time, takeover attempts are less likely to succeed and more likely to be caught. See Untrusted smart models and trusted dumb models (Shlegeris 2023). Other relevant work:
That post proposes calling new instances or different models to review plans and internal dialogue for alignment, but it includes discussion of the several layers of safety scaffolding that have been proposed elsewhere.
This post is amazingly useful. Integrative/overview work is often thankless, but I think it’s invaluable for understanding where effort is going, and thinking about gaps where it more should be devoted. So thank you, thank you!
Expanding on this—this whole area is probably best known as “AI Control”, and I’d lump it under “Control the thing” as its own category. I’d also move Control Evals to this category as well, though someone at RR would know better than I.
Yep, indeed I would consider “control evaluations” to be a method of “AI control”. I consider the evaluation and the technique development to be part of a unified methodology (we’ll describe this more in a forthcoming post).
It’s “a unified methodology” but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.
(Agreed except that “inference-time safety techiques” feels overly limiting. It’s more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn’t discriminated by our validation set and other measurements. I hope this isn’t too incomprehensible, but don’t worry if it is, this point isn’t that important.)
Thanks!
I think there’s another agenda like make untrusted models safe but useful by putting them in a scaffolding/bureaucracy—of filters, classifiers, LMs, humans, etc.—such that at inference time, takeover attempts are less likely to succeed and more likely to be caught. See Untrusted smart models and trusted dumb models (Shlegeris 2023). Other relevant work:
AI safety with current science (Shlegeris 2023)
Preventing Language Models From Hiding Their Reasoning (Roger and Greenblatt 2023)
Coup probes (Roger 2023)
(Some other work boosts this agenda but isn’t motivated by it — perhaps activation engineering and chain-of-thought)
[Edit: now AI Control (Shlegeris et al. 2023) and Catching AIs red-handed (Greenblatt and Shlegeris 2024).]
[Edit: sequence on this control agenda.]
Explicitly noting for the record we have some forthcoming work on AI control which should be out relatively soon.
(I work at RR)
This is an excellent description of my primary work, for example
Internal independent review for language model agent alignment
That post proposes calling new instances or different models to review plans and internal dialogue for alignment, but it includes discussion of the several layers of safety scaffolding that have been proposed elsewhere.
This post is amazingly useful. Integrative/overview work is often thankless, but I think it’s invaluable for understanding where effort is going, and thinking about gaps where it more should be devoted. So thank you, thank you!
I like this. It’s like a structural version of control evaluations. Will think where to put it in
Expanding on this—this whole area is probably best known as “AI Control”, and I’d lump it under “Control the thing” as its own category. I’d also move Control Evals to this category as well, though someone at RR would know better than I.
Yep, indeed I would consider “control evaluations” to be a method of “AI control”. I consider the evaluation and the technique development to be part of a unified methodology (we’ll describe this more in a forthcoming post).
(I work at RR)
It’s “a unified methodology” but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.
(Agreed except that “inference-time safety techiques” feels overly limiting. It’s more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn’t discriminated by our validation set and other measurements. I hope this isn’t too incomprehensible, but don’t worry if it is, this point isn’t that important.)