LawrenceC comments on Shallow review of live agendas in alignment & safety

LawrenceC 27 Nov 2023 21:00 UTC
LW: 3 AF: 3
0
AF
Expanding on this—this whole area is probably best known as “AI Control”, and I’d lump it under “Control the thing” as its own category. I’d also move Control Evals to this category as well, though someone at RR would know better than I.
- ryan_greenblatt 27 Nov 2023 21:27 UTC
  LW: 4 AF: 2
  2
  AF Parent
  Yep, indeed I would consider “control evaluations” to be a method of “AI control”. I consider the evaluation and the technique development to be part of a unified methodology (we’ll describe this more in a forthcoming post).
  
  (I work at RR)
  What links here?
  - Shallow review of live agendas in alignment & safety by technicalities (27 Nov 2023 11:10 UTC; 322 points)
  - Shallow review of live agendas in alignment & safety by Gavin (EA Forum; 27 Nov 2023 11:33 UTC; 76 points)
  - Zach Stein-Perlman 27 Nov 2023 23:50 UTC
    LW: 11 AF: 6
    4
    AF Parent
    It’s “a unified methodology” but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.
    - ryan_greenblatt 28 Nov 2023 0:25 UTC
      LW: 6 AF: 3
      0
      AF Parent
      (Agreed except that “inference-time safety techiques” feels overly limiting. It’s more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn’t discriminated by our validation set and other measurements. I hope this isn’t too incomprehensible, but don’t worry if it is, this point isn’t that important.)