ryan_greenblatt comments on Shallow review of live agendas in alignment & safety

ryan_greenblatt 28 Nov 2023 0:25 UTC
LW: 6 AF: 3
0
AF
(Agreed except that “inference-time safety techiques” feels overly limiting. It’s more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn’t discriminated by our validation set and other measurements. I hope this isn’t too incomprehensible, but don’t worry if it is, this point isn’t that important.)