paulfchristiano comments on AI will change the world, but won’t take it over by playing “3-dimensional chess”.

paulfchristiano 23 Nov 2022 6:49 UTC
LW: 6 AF: 6
4
AF
Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It’s not enough for it to be plausible that it could happen often; it needs to happen all the time.
I think the situation is much better if deceptive alignment is inconsistent. I also think that’s more likely, particularly if we are trying.
That said, I don’t think the problem goes away completely if deceptive alignment is inconsistent. We may still have limited ability to distinguish deceptively aligned models from models that are trying to optimize reward, or we may find that models that are trying to optimize reward are unsuitable in practice (e.g. because of the issues raised in mechanism 1) and so selecting for things that works means you are selecting for deceptive alignment.