I see why one might think this is a mostly safe assumption, but it doesn’t seem like one to me—it’s kind of presuming some sort of common sense morality can be used as a check here, even under Goodhart conditions, and I don’t think it would be that reliable in most doom cases? I’m trying to operationalize what this would mean in practice in a way a sufficiently strong AI wouldn’t find a way around via convincing itself or us, or via indirect action, or similar, and I can’t.
This implies that you think that if you win the grey area you know how to use that to not lose in the red area? Perhaps via some pivotal act? I notice I am confused why you believe that? The assumption here is more ‘the grey area solution would be sufficient’ for some value of grey, and I’m curious both why that would work and what value of grey would count. Would do a lot of work if I bought into this.
Yes, these are advantages and this seems closer to a correct assumption as stated than the others. I’m still worried it’s being used to carry in the other assumptions about being able to make use of the advantage, which interacts a lot with #2 I think?
Depends what value of impossible. It’s definitely not fully impossible, and yes that is more hopeful than the alternative. It seems at least shut-up-and-do-the levels of impossible.
As stated strictly it’s not an assumption, it’s more I read it as implying the ‘and when it is importantly trying to do this, our techniques will notice’ part, which I indeed hope but I don’t have much confidence in that.
I think I have still failed to communicate on (1). I’m not sure what the relevance of common sense morality is, and if a strong AI is thinking about finding a way to convince itself or us that’s already the situation I want to detect and stop. (Obviously it’s not clear that I can detect it, but the claim here is just that the facts of the matter are present to be detected.) But probably it’s not that worth going into detail here.
On (2), the theory of change is that you don’t get into the red area, which I agree is equivalent to “the grey area solution would be sufficient”. I’m not imagining pivotal acts here. The key point is that before you are in the red area, you can’t appeal to “but the misaligned superintelligence would just defeat your technique via X” as a reason that the technique would fail. Personally, I think it’s (non-trivially) more likely that you don’t get to the red areas the more you catch early examples of deception, which is why I like this theory of change. I expect vehement disagreement there, and I would really like to see arguments for this position that don’t go via “but the misaligned superintelligence could do X”. I’ve tried this previously (e.g. this discussion with Eliezer) and haven’t really been convinced. (Tbc, Eliezer did give arguments that don’t go through “the misaligned superintelligence could do X”; I’m just not that convinced by them.)
I basically agree with (3), (4) and (5). I do expect I’m more optimistic than you about how useful or tractable each of those things are. As a result, I expect that given your beliefs, my plans would look to you like they are relying more on (3), (4), and (5) than would be ideal (in the sense that I expect you’d want to divert some effort to other things that we probably both agree are very hard and not that likely to work but look better on the margin to you relative to the things that would be part of my plans).
I do still want to claim that this is importantly different from treating them as assumptions, even under your beliefs rather than mine.
I see why one might think this is a mostly safe assumption, but it doesn’t seem like one to me—it’s kind of presuming some sort of common sense morality can be used as a check here, even under Goodhart conditions, and I don’t think it would be that reliable in most doom cases? I’m trying to operationalize what this would mean in practice in a way a sufficiently strong AI wouldn’t find a way around via convincing itself or us, or via indirect action, or similar, and I can’t.
This implies that you think that if you win the grey area you know how to use that to not lose in the red area? Perhaps via some pivotal act? I notice I am confused why you believe that? The assumption here is more ‘the grey area solution would be sufficient’ for some value of grey, and I’m curious both why that would work and what value of grey would count. Would do a lot of work if I bought into this.
Yes, these are advantages and this seems closer to a correct assumption as stated than the others. I’m still worried it’s being used to carry in the other assumptions about being able to make use of the advantage, which interacts a lot with #2 I think?
Depends what value of impossible. It’s definitely not fully impossible, and yes that is more hopeful than the alternative. It seems at least shut-up-and-do-the levels of impossible.
As stated strictly it’s not an assumption, it’s more I read it as implying the ‘and when it is importantly trying to do this, our techniques will notice’ part, which I indeed hope but I don’t have much confidence in that.
Does that help with where my head is at?
Yes, that all seems reasonable.
I think I have still failed to communicate on (1). I’m not sure what the relevance of common sense morality is, and if a strong AI is thinking about finding a way to convince itself or us that’s already the situation I want to detect and stop. (Obviously it’s not clear that I can detect it, but the claim here is just that the facts of the matter are present to be detected.) But probably it’s not that worth going into detail here.
On (2), the theory of change is that you don’t get into the red area, which I agree is equivalent to “the grey area solution would be sufficient”. I’m not imagining pivotal acts here. The key point is that before you are in the red area, you can’t appeal to “but the misaligned superintelligence would just defeat your technique via X” as a reason that the technique would fail. Personally, I think it’s (non-trivially) more likely that you don’t get to the red areas the more you catch early examples of deception, which is why I like this theory of change. I expect vehement disagreement there, and I would really like to see arguments for this position that don’t go via “but the misaligned superintelligence could do X”. I’ve tried this previously (e.g. this discussion with Eliezer) and haven’t really been convinced. (Tbc, Eliezer did give arguments that don’t go through “the misaligned superintelligence could do X”; I’m just not that convinced by them.)
I basically agree with (3), (4) and (5). I do expect I’m more optimistic than you about how useful or tractable each of those things are. As a result, I expect that given your beliefs, my plans would look to you like they are relying more on (3), (4), and (5) than would be ideal (in the sense that I expect you’d want to divert some effort to other things that we probably both agree are very hard and not that likely to work but look better on the margin to you relative to the things that would be part of my plans).
I do still want to claim that this is importantly different from treating them as assumptions, even under your beliefs rather than mine.