ryan_greenblatt comments on Shallow review of live agendas in alignment & safety

ryan_greenblatt 27 Nov 2023 21:27 UTC
LW: 4 AF: 2
2
AF
Yep, indeed I would consider “control evaluations” to be a method of “AI control”. I consider the evaluation and the technique development to be part of a unified methodology (we’ll describe this more in a forthcoming post).

(I work at RR)
What links here?
- Shallow review of live agendas in alignment & safety by technicalities (27 Nov 2023 11:10 UTC; 322 points)
- Shallow review of live agendas in alignment & safety by Gavin (EA Forum; 27 Nov 2023 11:33 UTC; 76 points)
- Zach Stein-Perlman 27 Nov 2023 23:50 UTC
  LW: 11 AF: 6
  4
  AF Parent
  It’s “a unified methodology” but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.
  - ryan_greenblatt 28 Nov 2023 0:25 UTC
    LW: 6 AF: 3
    0
    AF Parent
    (Agreed except that “inference-time safety techiques” feels overly limiting. It’s more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn’t discriminated by our validation set and other measurements. I hope this isn’t too incomprehensible, but don’t worry if it is, this point isn’t that important.)