I want to point out that I think the typical important case looks more like “wanting to do things for unusual reasons,” and if you’re worried about this approach breaking down there that seems like a pretty central obstacle. For example, suppose rather than trying to maintain a situation (the diamond stays in the vault) we’re trying to extrapolate (like coming up with a safe cancer cure). When looking at a novel medication to solve an unsolved problem, we won’t be able to say “well, it cures the cancer for the normal reason” because there aren’t any positive examples to compare to (or they’ll be identifiably different).
It might still work out, because when we ask “is the patient healthy?” there is something like “the normal reason” there. [But then maybe it doesn’t work for Dyson Sphere designs, or so on.]
Yes, you want the patient to appear on camera for the normal reason, but you don’t want the patient to remain healthy for the normal reason.
We describe a possible strategy for handling this issue in the appendix. I feel more confident about the choice of research focus than I do about whether that particular strategy will work out. The main reasons are: I think that ELK and deceptive alignment are already challenging and useful to solve even in the case where there is no such distributional shift, that those challenges capture at least some central alignment difficulties, that the kind of strategy described in the post is at least plausible, and that as a result it’s unlikely to be possible to say very much about the distributional shift case before solving the simpler case.
If the overall approach fails, I currently think it’s most likely either because we can’t define what we mean by explanation or that we can’t find explanations for key model behaviors.
I want to point out that I think the typical important case looks more like “wanting to do things for unusual reasons,” and if you’re worried about this approach breaking down there that seems like a pretty central obstacle. For example, suppose rather than trying to maintain a situation (the diamond stays in the vault) we’re trying to extrapolate (like coming up with a safe cancer cure). When looking at a novel medication to solve an unsolved problem, we won’t be able to say “well, it cures the cancer for the normal reason” because there aren’t any positive examples to compare to (or they’ll be identifiably different).
It might still work out, because when we ask “is the patient healthy?” there is something like “the normal reason” there. [But then maybe it doesn’t work for Dyson Sphere designs, or so on.]
Yes, you want the patient to appear on camera for the normal reason, but you don’t want the patient to remain healthy for the normal reason.
We describe a possible strategy for handling this issue in the appendix. I feel more confident about the choice of research focus than I do about whether that particular strategy will work out. The main reasons are: I think that ELK and deceptive alignment are already challenging and useful to solve even in the case where there is no such distributional shift, that those challenges capture at least some central alignment difficulties, that the kind of strategy described in the post is at least plausible, and that as a result it’s unlikely to be possible to say very much about the distributional shift case before solving the simpler case.
If the overall approach fails, I currently think it’s most likely either because we can’t define what we mean by explanation or that we can’t find explanations for key model behaviors.