I do think there’s a strong chance of alignment-by-default of AGI (at least 20%, maybe higher), as well as a strong chance of non-doom via other routes (e.g. decreasing marginal returns of intelligence or alignment becoming necessary for economic value in obvious ways).
Ah, got it. In that case I think we broadly agree.
one place where I think I diverge from many/most people in the area is that I’m playing to win, not just to avoid losing.
Yeah, this is a difference. I don’t think it’s particularly decision-relevant for me personally given the problems we actually face, but certainly it makes a difference in other hypotheticals (e.g. in the translation post I suggested testing + reversibility as a solution; that’s much more about not losing than it is about winning).
Continuing with the translation analogy: suppose we could translate the directive “don’t take these instructions literally, use them as evidence to figure out what I want and then do that”—and of course other instructions would include further information about how to figure out what you want. That’s the sort of thing which would potentially give a broad(er) basin of alignment if we’re looking at the problem through a translation lens.
Yeah, I think that’s right. There’s also the directive “assist me” / “help me get what I want”. It feels like these should be easier to translate (though I can’t say what makes them different from all the other cases where I expect translation to be hard).
Ah, got it. In that case I think we broadly agree.
Yeah, this is a difference. I don’t think it’s particularly decision-relevant for me personally given the problems we actually face, but certainly it makes a difference in other hypotheticals (e.g. in the translation post I suggested testing + reversibility as a solution; that’s much more about not losing than it is about winning).
Yeah, I think that’s right. There’s also the directive “assist me” / “help me get what I want”. It feels like these should be easier to translate (though I can’t say what makes them different from all the other cases where I expect translation to be hard).