Zach Stein-Perlman comments on Shallow review of live agendas in alignment & safety

Zach Stein-Perlman 27 Nov 2023 17:47 UTC
LW: 10 AF: 3
3
AF
Thanks!
I think there’s another agenda like make untrusted models safe but useful by putting them in a scaffolding/bureaucracy—of filters, classifiers, LMs, humans, etc.—such that at inference time, takeover attempts are less likely to succeed and more likely to be caught. See Untrusted smart models and trusted dumb models (Shlegeris 2023). Other relevant work:
- AI safety with current science (Shlegeris 2023)
- Preventing Language Models From Hiding Their Reasoning (Roger and Greenblatt 2023)
- Coup probes (Roger 2023)
- (Some other work boosts this agenda but isn’t motivated by it — perhaps activation engineering and chain-of-thought)
[Edit: now AI Control (Shlegeris et al. 2023) and Catching AIs red-handed (Greenblatt and Shlegeris 2024).]
[Edit: sequence on this control agenda.]
- ryan_greenblatt 27 Nov 2023 21:28 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Explicitly noting for the record we have some forthcoming work on AI control which should be out relatively soon.
  
  (I work at RR)
- Seth Herd 27 Nov 2023 23:34 UTC
  3 points
  0
  Parent
  This is an excellent description of my primary work, for example
  
  Internal independent review for language model agent alignment
  
  That post proposes calling new instances or different models to review plans and internal dialogue for alignment, but it includes discussion of the several layers of safety scaffolding that have been proposed elsewhere.
  
  This post is amazingly useful. Integrative/overview work is often thankless, but I think it’s invaluable for understanding where effort is going, and thinking about gaps where it more should be devoted. So thank you, thank you!
- technicalities 27 Nov 2023 18:34 UTC
  LW: 1 AF: 1
  0
  AF Parent
  I like this. It’s like a structural version of control evaluations. Will think where to put it in
  - LawrenceC 27 Nov 2023 21:00 UTC
    LW: 3 AF: 3
    0
    AF Parent
    Expanding on this—this whole area is probably best known as “AI Control”, and I’d lump it under “Control the thing” as its own category. I’d also move Control Evals to this category as well, though someone at RR would know better than I.
    - ryan_greenblatt 27 Nov 2023 21:27 UTC
      LW: 4 AF: 2
      2
      AF Parent
      Yep, indeed I would consider “control evaluations” to be a method of “AI control”. I consider the evaluation and the technique development to be part of a unified methodology (we’ll describe this more in a forthcoming post).
      
      (I work at RR)
      What links here?
      Shallow review of live agendas in alignment & safety by technicalities (27 Nov 2023 11:10 UTC; 348 points)
      Shallow review of live agendas in alignment & safety by technicalities (EA Forum; 27 Nov 2023 11:33 UTC; 76 points)
      - Zach Stein-Perlman 27 Nov 2023 23:50 UTC
        LW: 11 AF: 6
        4
        AF Parent
        It’s “a unified methodology” but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.
        ryan_greenblatt 28 Nov 2023 0:25 UTC
        LW: 6 AF: 3
        0
        AF Parent
        (Agreed except that “inference-time safety techiques” feels overly limiting. It’s more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn’t discriminated by our validation set and other measurements. I hope this isn’t too incomprehensible, but don’t worry if it is, this point isn’t that important.)