These involve extinction, so they don’t answer the question what’s the most likely outcome conditional on non-extinction. I think the answer there is a specific kind of near-miss at alignment which is quite scary.
My point is that Pr[non-extinction | misalignment] << 1, Pr[non-extinction | alignment] = 1, Pr[alignment] is not that low and therefore Pr[misalignment | non-extinction] is low, by Bayes.
To me it feels like alignment is a tiny target to hit, and around it there’s a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don’t see why that’s true.
These involve extinction, so they don’t answer the question what’s the most likely outcome conditional on non-extinction. I think the answer there is a specific kind of near-miss at alignment which is quite scary.
My point is that Pr[non-extinction | misalignment] << 1, Pr[non-extinction | alignment] = 1, Pr[alignment] is not that low and therefore Pr[misalignment | non-extinction] is low, by Bayes.
To me it feels like alignment is a tiny target to hit, and around it there’s a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don’t see why that’s true.
I think where we differ is that I think Pr[full alignment] is extremely low, and there is quite a lot of space for non-omnicidal partial misalignment.