TurnTrout comments on Reframing inner alignment

TurnTrout 20 Dec 2022 19:15 UTC
LW: 2 AF: 2
0
AF
Instead: a deceptively aligned policy that is bad must concretely do bad stuff on some trajectories. $J$ can detect this by simply detecting bad stuff.
I think it extremely probable that there exist policies which exploit adversarial inputs to J such that they can do bad stuff while getting J to say “all’s fine.”
- davidad 20 Dec 2022 19:46 UTC
  LW: 2 AF: 1
  0
  AF Parent
  For most $J$ s I agree, but the existence of any adversarial examples for $J$ would be an outer alignment problem (you get what you measure). (For outer alignment, it seems necessary that there exist—and that humans discover—natural abstractions relative to formal world models that robustly pick out at least the worst stuff.)