Speaking on behalf of myself, rather than the full DeepMind alignment team:
The part that gives me the most worry is the section on General Hopes.
General hopes. Our plan is based on some general hopes:
The most harmful outcomes happen when the AI “knows” it is doing something that we don’t want, so mitigations can be targeted at this case.
Our techniques don’t have to stand up to misaligned superintelligences — the hope is that they make a difference while the training process is in the gray area, not after it has reached the red area.
In terms of directing the training process, the game is skewed in our favour: we can restart the search, examine and change the model’s beliefs and goals using interpretability techniques, choose exactly what data the model sees, etc.
Interpretability is hard but not impossible.
We can train against our alignment techniques and get evidence on whether the AI systems deceive our techniques. If we get evidence that they are likely to do that, we can use this to create demonstrations of bad behavior for decision-makers.
That seems like quite a lot of hopes, that I very much do not expect to fully get, so the question is to what extent these are effectively acting as assumptions versus simply things that we hope for because they would make things easier.
Hmm, I view most of these as clearly-true statements that should give us hope, rather than assumptions. Which ones are you worried about us not getting? For more detail:
I could see some qualms about the word “knows” here—for an elaboration see this comment and its followups. But it still seems pretty clearly true in most doom stories.
This is a claim about the theory of change, so it feels like a type error to think of it as an assumption. The reason I say “hope” is because doomers frequently say “but a misaligned superintelligence would just deceive your alignment technique” and I think this sort of theory of change should give doomers more hope if they hadn’t previously thought of it.
There is a hope here in the sense that we hope to have interpretability techniques and so on, but the underlying thing is roughly “we start out in a position of control over the AI, and have lots of affordances that can help control the AI, such as being able to edit its brain state directly, observe what it would do in counterfactual scenarios, etc.” This seems clearly true and is a pretty big disanalogy with previous situations where we try to align things (e.g. a company trying to align its employees), and that’s roughly what I mean when I say “the game is skewed in our favor”.
It seems clearly true that interpretability is not impossible. It might be well beyond our ability to do in the time we have, but it’s not impossible. (Why mention this? Because occasionally people will say things like “alignment is impossible” and I think interpretability is the most obvious way you could see how you might align a system in principle.)
This sort of thing has already been done; it isn’t a speculative thing. In some sense there’s a hope here of “and we’ll do it better in the future, and also that will matter” but that’s going to be true of all research.
I see why one might think this is a mostly safe assumption, but it doesn’t seem like one to me—it’s kind of presuming some sort of common sense morality can be used as a check here, even under Goodhart conditions, and I don’t think it would be that reliable in most doom cases? I’m trying to operationalize what this would mean in practice in a way a sufficiently strong AI wouldn’t find a way around via convincing itself or us, or via indirect action, or similar, and I can’t.
This implies that you think that if you win the grey area you know how to use that to not lose in the red area? Perhaps via some pivotal act? I notice I am confused why you believe that? The assumption here is more ‘the grey area solution would be sufficient’ for some value of grey, and I’m curious both why that would work and what value of grey would count. Would do a lot of work if I bought into this.
Yes, these are advantages and this seems closer to a correct assumption as stated than the others. I’m still worried it’s being used to carry in the other assumptions about being able to make use of the advantage, which interacts a lot with #2 I think?
Depends what value of impossible. It’s definitely not fully impossible, and yes that is more hopeful than the alternative. It seems at least shut-up-and-do-the levels of impossible.
As stated strictly it’s not an assumption, it’s more I read it as implying the ‘and when it is importantly trying to do this, our techniques will notice’ part, which I indeed hope but I don’t have much confidence in that.
I think I have still failed to communicate on (1). I’m not sure what the relevance of common sense morality is, and if a strong AI is thinking about finding a way to convince itself or us that’s already the situation I want to detect and stop. (Obviously it’s not clear that I can detect it, but the claim here is just that the facts of the matter are present to be detected.) But probably it’s not that worth going into detail here.
On (2), the theory of change is that you don’t get into the red area, which I agree is equivalent to “the grey area solution would be sufficient”. I’m not imagining pivotal acts here. The key point is that before you are in the red area, you can’t appeal to “but the misaligned superintelligence would just defeat your technique via X” as a reason that the technique would fail. Personally, I think it’s (non-trivially) more likely that you don’t get to the red areas the more you catch early examples of deception, which is why I like this theory of change. I expect vehement disagreement there, and I would really like to see arguments for this position that don’t go via “but the misaligned superintelligence could do X”. I’ve tried this previously (e.g. this discussion with Eliezer) and haven’t really been convinced. (Tbc, Eliezer did give arguments that don’t go through “the misaligned superintelligence could do X”; I’m just not that convinced by them.)
I basically agree with (3), (4) and (5). I do expect I’m more optimistic than you about how useful or tractable each of those things are. As a result, I expect that given your beliefs, my plans would look to you like they are relying more on (3), (4), and (5) than would be ideal (in the sense that I expect you’d want to divert some effort to other things that we probably both agree are very hard and not that likely to work but look better on the margin to you relative to the things that would be part of my plans).
I do still want to claim that this is importantly different from treating them as assumptions, even under your beliefs rather than mine.
Speaking on behalf of myself, rather than the full DeepMind alignment team:
Hmm, I view most of these as clearly-true statements that should give us hope, rather than assumptions. Which ones are you worried about us not getting? For more detail:
I could see some qualms about the word “knows” here—for an elaboration see this comment and its followups. But it still seems pretty clearly true in most doom stories.
This is a claim about the theory of change, so it feels like a type error to think of it as an assumption. The reason I say “hope” is because doomers frequently say “but a misaligned superintelligence would just deceive your alignment technique” and I think this sort of theory of change should give doomers more hope if they hadn’t previously thought of it.
There is a hope here in the sense that we hope to have interpretability techniques and so on, but the underlying thing is roughly “we start out in a position of control over the AI, and have lots of affordances that can help control the AI, such as being able to edit its brain state directly, observe what it would do in counterfactual scenarios, etc.” This seems clearly true and is a pretty big disanalogy with previous situations where we try to align things (e.g. a company trying to align its employees), and that’s roughly what I mean when I say “the game is skewed in our favor”.
It seems clearly true that interpretability is not impossible. It might be well beyond our ability to do in the time we have, but it’s not impossible. (Why mention this? Because occasionally people will say things like “alignment is impossible” and I think interpretability is the most obvious way you could see how you might align a system in principle.)
This sort of thing has already been done; it isn’t a speculative thing. In some sense there’s a hope here of “and we’ll do it better in the future, and also that will matter” but that’s going to be true of all research.
I see why one might think this is a mostly safe assumption, but it doesn’t seem like one to me—it’s kind of presuming some sort of common sense morality can be used as a check here, even under Goodhart conditions, and I don’t think it would be that reliable in most doom cases? I’m trying to operationalize what this would mean in practice in a way a sufficiently strong AI wouldn’t find a way around via convincing itself or us, or via indirect action, or similar, and I can’t.
This implies that you think that if you win the grey area you know how to use that to not lose in the red area? Perhaps via some pivotal act? I notice I am confused why you believe that? The assumption here is more ‘the grey area solution would be sufficient’ for some value of grey, and I’m curious both why that would work and what value of grey would count. Would do a lot of work if I bought into this.
Yes, these are advantages and this seems closer to a correct assumption as stated than the others. I’m still worried it’s being used to carry in the other assumptions about being able to make use of the advantage, which interacts a lot with #2 I think?
Depends what value of impossible. It’s definitely not fully impossible, and yes that is more hopeful than the alternative. It seems at least shut-up-and-do-the levels of impossible.
As stated strictly it’s not an assumption, it’s more I read it as implying the ‘and when it is importantly trying to do this, our techniques will notice’ part, which I indeed hope but I don’t have much confidence in that.
Does that help with where my head is at?
Yes, that all seems reasonable.
I think I have still failed to communicate on (1). I’m not sure what the relevance of common sense morality is, and if a strong AI is thinking about finding a way to convince itself or us that’s already the situation I want to detect and stop. (Obviously it’s not clear that I can detect it, but the claim here is just that the facts of the matter are present to be detected.) But probably it’s not that worth going into detail here.
On (2), the theory of change is that you don’t get into the red area, which I agree is equivalent to “the grey area solution would be sufficient”. I’m not imagining pivotal acts here. The key point is that before you are in the red area, you can’t appeal to “but the misaligned superintelligence would just defeat your technique via X” as a reason that the technique would fail. Personally, I think it’s (non-trivially) more likely that you don’t get to the red areas the more you catch early examples of deception, which is why I like this theory of change. I expect vehement disagreement there, and I would really like to see arguments for this position that don’t go via “but the misaligned superintelligence could do X”. I’ve tried this previously (e.g. this discussion with Eliezer) and haven’t really been convinced. (Tbc, Eliezer did give arguments that don’t go through “the misaligned superintelligence could do X”; I’m just not that convinced by them.)
I basically agree with (3), (4) and (5). I do expect I’m more optimistic than you about how useful or tractable each of those things are. As a result, I expect that given your beliefs, my plans would look to you like they are relying more on (3), (4), and (5) than would be ideal (in the sense that I expect you’d want to divert some effort to other things that we probably both agree are very hard and not that likely to work but look better on the margin to you relative to the things that would be part of my plans).
I do still want to claim that this is importantly different from treating them as assumptions, even under your beliefs rather than mine.