Ofer comments on “Inner Alignment Failures” Which Are Actually Outer Alignment Failures

Ofer 3 Nov 2020 22:03 UTC
LW: 3 AF: 3
AF

you should never get deception in the limit of infinite data (since a deceptive model has to defect on some data point).

I think a model can be deceptively aligned even if formally it maps every possible input to the correct (safe) output. For example, suppose that on input X the inference execution hacks the computer on which the inference is being executed, in order to do arbitrary consequentialist stuff (while the inference logic, as a mathematical object, formally yields the correct output for X).
- evhub 4 Nov 2020 0:58 UTC
  LW: 2 AF: 2
  AF Parent
  Sure—we’re just trying to define things in the abstract here, though, so there’s no harm in just defining the model’s output to include stuff like that as well.