Yes, you want the patient to appear on camera for the normal reason, but you don’t want the patient to remain healthy for the normal reason.
We describe a possible strategy for handling this issue in the appendix. I feel more confident about the choice of research focus than I do about whether that particular strategy will work out. The main reasons are: I think that ELK and deceptive alignment are already challenging and useful to solve even in the case where there is no such distributional shift, that those challenges capture at least some central alignment difficulties, that the kind of strategy described in the post is at least plausible, and that as a result it’s unlikely to be possible to say very much about the distributional shift case before solving the simpler case.
If the overall approach fails, I currently think it’s most likely either because we can’t define what we mean by explanation or that we can’t find explanations for key model behaviors.
Yes, you want the patient to appear on camera for the normal reason, but you don’t want the patient to remain healthy for the normal reason.
We describe a possible strategy for handling this issue in the appendix. I feel more confident about the choice of research focus than I do about whether that particular strategy will work out. The main reasons are: I think that ELK and deceptive alignment are already challenging and useful to solve even in the case where there is no such distributional shift, that those challenges capture at least some central alignment difficulties, that the kind of strategy described in the post is at least plausible, and that as a result it’s unlikely to be possible to say very much about the distributional shift case before solving the simpler case.
If the overall approach fails, I currently think it’s most likely either because we can’t define what we mean by explanation or that we can’t find explanations for key model behaviors.