This is different from making a prediction that things will probably go badly.
Thinking about it, I really should have been more explicit about this before: I do think there’s a strong chance of alignment-by-default of AGI (at least 20%, maybe higher), as well as a strong chance of non-doom via other routes (e.g. decreasing marginal returns of intelligence or alignment becoming necessary for economic value in obvious ways).
Related: one place where I think I diverge from many/most people in the area is that I’m playing to win, not just to avoid losing. I see alignment not just as important for avoiding doom, but as plausibly the hardest part of unlocking most of the economic value of AGI.
My goal for AGI is to create tons of value and to (very very reliably) avoid catastrophic loss. I see alignment-in-the-sense-of-translation as the main bottleneck to achieving both of those simultaneously; I expect that both the value and the risk are dominated by exponentially large numbers of corner-cases.
I want to flag a note of confusion here—it feels like it should be possible for a mostly-aligned system to become more aligned, such that it never fails at any threshold (along the lines of there being a broad basin of corrigibility). But I haven’t really made this perspective play nicely with the perspective of alignment as translation.
This was exactly why I mentioned the distinction between “the AI has a good model of what humans want” and “the AI is programmed to actually do what humans want”. I haven’t been able to articulate it very well, but here’s a few things which feel like they’re pointing to the same idea:
If our AI is learning what humans value by predicting some data, then it won’t matter how clever the AI is if the data-collection process is not robustly pointed at human values.
More generally, if the source-of-truth for human values does not correctly and robustly point to human values, no amount of clever AI architecture can overcome that problem (though note that the source-of-truth may include e.g information about human values built into a prior)
In translation terms, at some point we have to translate some directive for the AI, something of the form “do X”. X may include some mechanism for self-correction, but if that initial mechanism for self-correction is ever insufficient, there will not be any way to fix it later (other than starting over with a whole new AI).
Continuing with the translation analogy: suppose we could translate the directive “don’t take these instructions literally, use them as evidence to figure out what I want and then do that”—and of course other instructions would include further information about how to figure out what you want. That’s the sort of thing which would potentially give a broad(er) basin of alignment if we’re looking at the problem through a translation lens.
I do think there’s a strong chance of alignment-by-default of AGI (at least 20%, maybe higher), as well as a strong chance of non-doom via other routes (e.g. decreasing marginal returns of intelligence or alignment becoming necessary for economic value in obvious ways).
Ah, got it. In that case I think we broadly agree.
one place where I think I diverge from many/most people in the area is that I’m playing to win, not just to avoid losing.
Yeah, this is a difference. I don’t think it’s particularly decision-relevant for me personally given the problems we actually face, but certainly it makes a difference in other hypotheticals (e.g. in the translation post I suggested testing + reversibility as a solution; that’s much more about not losing than it is about winning).
Continuing with the translation analogy: suppose we could translate the directive “don’t take these instructions literally, use them as evidence to figure out what I want and then do that”—and of course other instructions would include further information about how to figure out what you want. That’s the sort of thing which would potentially give a broad(er) basin of alignment if we’re looking at the problem through a translation lens.
Yeah, I think that’s right. There’s also the directive “assist me” / “help me get what I want”. It feels like these should be easier to translate (though I can’t say what makes them different from all the other cases where I expect translation to be hard).
Thinking about it, I really should have been more explicit about this before: I do think there’s a strong chance of alignment-by-default of AGI (at least 20%, maybe higher), as well as a strong chance of non-doom via other routes (e.g. decreasing marginal returns of intelligence or alignment becoming necessary for economic value in obvious ways).
Related: one place where I think I diverge from many/most people in the area is that I’m playing to win, not just to avoid losing. I see alignment not just as important for avoiding doom, but as plausibly the hardest part of unlocking most of the economic value of AGI.
My goal for AGI is to create tons of value and to (very very reliably) avoid catastrophic loss. I see alignment-in-the-sense-of-translation as the main bottleneck to achieving both of those simultaneously; I expect that both the value and the risk are dominated by exponentially large numbers of corner-cases.
This was exactly why I mentioned the distinction between “the AI has a good model of what humans want” and “the AI is programmed to actually do what humans want”. I haven’t been able to articulate it very well, but here’s a few things which feel like they’re pointing to the same idea:
If our AI is learning what humans value by predicting some data, then it won’t matter how clever the AI is if the data-collection process is not robustly pointed at human values.
More generally, if the source-of-truth for human values does not correctly and robustly point to human values, no amount of clever AI architecture can overcome that problem (though note that the source-of-truth may include e.g information about human values built into a prior)
Abram’s stuff on stable pointers to values
In translation terms, at some point we have to translate some directive for the AI, something of the form “do X”. X may include some mechanism for self-correction, but if that initial mechanism for self-correction is ever insufficient, there will not be any way to fix it later (other than starting over with a whole new AI).
Continuing with the translation analogy: suppose we could translate the directive “don’t take these instructions literally, use them as evidence to figure out what I want and then do that”—and of course other instructions would include further information about how to figure out what you want. That’s the sort of thing which would potentially give a broad(er) basin of alignment if we’re looking at the problem through a translation lens.
Ah, got it. In that case I think we broadly agree.
Yeah, this is a difference. I don’t think it’s particularly decision-relevant for me personally given the problems we actually face, but certainly it makes a difference in other hypotheticals (e.g. in the translation post I suggested testing + reversibility as a solution; that’s much more about not losing than it is about winning).
Yeah, I think that’s right. There’s also the directive “assist me” / “help me get what I want”. It feels like these should be easier to translate (though I can’t say what makes them different from all the other cases where I expect translation to be hard).