you should never get deception in the limit of infinite data (since a deceptive model has to defect on some data point).
I think a model can be deceptively aligned even if formally it maps every possible input to the correct (safe) output. For example, suppose that on input X the inference execution hacks the computer on which the inference is being executed, in order to do arbitrary consequentialist stuff (while the inference logic, as a mathematical object, formally yields the correct output for X).
Sure—we’re just trying to define things in the abstract here, though, so there’s no harm in just defining the model’s output to include stuff like that as well.
I think a model can be deceptively aligned even if formally it maps every possible input to the correct (safe) output. For example, suppose that on input X the inference execution hacks the computer on which the inference is being executed, in order to do arbitrary consequentialist stuff (while the inference logic, as a mathematical object, formally yields the correct output for X).
Sure—we’re just trying to define things in the abstract here, though, so there’s no harm in just defining the model’s output to include stuff like that as well.