Zach Stein-Perlman comments on Shallow review of live agendas in alignment & safety

Zach Stein-Perlman 27 Nov 2023 23:50 UTC
LW: 11 AF: 6
4
AF
It’s “a unified methodology” but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.
- ryan_greenblatt 28 Nov 2023 0:25 UTC
  LW: 6 AF: 3
  0
  AF Parent
  (Agreed except that “inference-time safety techiques” feels overly limiting. It’s more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn’t discriminated by our validation set and other measurements. I hope this isn’t too incomprehensible, but don’t worry if it is, this point isn’t that important.)