I think I have still failed to communicate on (1). I’m not sure what the relevance of common sense morality is, and if a strong AI is thinking about finding a way to convince itself or us that’s already the situation I want to detect and stop. (Obviously it’s not clear that I can detect it, but the claim here is just that the facts of the matter are present to be detected.) But probably it’s not that worth going into detail here.
On (2), the theory of change is that you don’t get into the red area, which I agree is equivalent to “the grey area solution would be sufficient”. I’m not imagining pivotal acts here. The key point is that before you are in the red area, you can’t appeal to “but the misaligned superintelligence would just defeat your technique via X” as a reason that the technique would fail. Personally, I think it’s (non-trivially) more likely that you don’t get to the red areas the more you catch early examples of deception, which is why I like this theory of change. I expect vehement disagreement there, and I would really like to see arguments for this position that don’t go via “but the misaligned superintelligence could do X”. I’ve tried this previously (e.g. this discussion with Eliezer) and haven’t really been convinced. (Tbc, Eliezer did give arguments that don’t go through “the misaligned superintelligence could do X”; I’m just not that convinced by them.)
I basically agree with (3), (4) and (5). I do expect I’m more optimistic than you about how useful or tractable each of those things are. As a result, I expect that given your beliefs, my plans would look to you like they are relying more on (3), (4), and (5) than would be ideal (in the sense that I expect you’d want to divert some effort to other things that we probably both agree are very hard and not that likely to work but look better on the margin to you relative to the things that would be part of my plans).
I do still want to claim that this is importantly different from treating them as assumptions, even under your beliefs rather than mine.
Yes, that all seems reasonable.
I think I have still failed to communicate on (1). I’m not sure what the relevance of common sense morality is, and if a strong AI is thinking about finding a way to convince itself or us that’s already the situation I want to detect and stop. (Obviously it’s not clear that I can detect it, but the claim here is just that the facts of the matter are present to be detected.) But probably it’s not that worth going into detail here.
On (2), the theory of change is that you don’t get into the red area, which I agree is equivalent to “the grey area solution would be sufficient”. I’m not imagining pivotal acts here. The key point is that before you are in the red area, you can’t appeal to “but the misaligned superintelligence would just defeat your technique via X” as a reason that the technique would fail. Personally, I think it’s (non-trivially) more likely that you don’t get to the red areas the more you catch early examples of deception, which is why I like this theory of change. I expect vehement disagreement there, and I would really like to see arguments for this position that don’t go via “but the misaligned superintelligence could do X”. I’ve tried this previously (e.g. this discussion with Eliezer) and haven’t really been convinced. (Tbc, Eliezer did give arguments that don’t go through “the misaligned superintelligence could do X”; I’m just not that convinced by them.)
I basically agree with (3), (4) and (5). I do expect I’m more optimistic than you about how useful or tractable each of those things are. As a result, I expect that given your beliefs, my plans would look to you like they are relying more on (3), (4), and (5) than would be ideal (in the sense that I expect you’d want to divert some effort to other things that we probably both agree are very hard and not that likely to work but look better on the margin to you relative to the things that would be part of my plans).
I do still want to claim that this is importantly different from treating them as assumptions, even under your beliefs rather than mine.