B) AGI has a significant chance of finding value in destroying humanity
B’) AGI has a significant chance of finding most value only in things that CEV of humanity doesn’t particularly value.
This works even if CEV of humanity decides that existence of humanity is worse than some better alternative, while the CEV of an AGI optimizes for things that are CEV-of-humanity-worse than existence of humanity (such as its nonexistence in the absence of those better alternatives to its existence).
The alternative to B’ is that CEV of AGIs and CEV of humanity somehow agree a lot. I think this is plausible if most (influential) of eventual values of human civilization are not things we are presently aware of and won’t come to be aware of for a very long time, and their formulation is generated by principles (like curiosity) that are not particularly human-specific, things like math but extending to much greater complexity. In that case these “generic” values might be shared by similarly mostly-implicit values of AGIs (that is, values AGIs are not aware of but would accept after very long reflection), of course only if the AGIs are not carefully engineered as optimizers with tractable values to demonstrate orthogonality thesis.
(This is the kind of unknowable hypothesis that alignment engineering should never rely on, but at the same time its truth shouldn’t break alignment, it should be engineered to survive such hypotheses.)
I expect to find that memorization and compression eventually make an agi wish to become an archeologist and at minimum retain memory of humanity in a way that an initially misaligned hard asi might not; but this is little reassurance, as most pivotal anti-humanity acts are within the reach of a soft asi that will fail to care for itself after taking over. the most likely outcome I currently see is an incremental species replacement where ai replaces humanity as the world’s dominant species over the next 20 years. no strongly coherent planning ai need come into existence in that time in order for a weakly consistent but strongly capable ai to kill or permanently disempower humanity.
B’) AGI has a significant chance of finding most value only in things that CEV of humanity doesn’t particularly value.
This works even if CEV of humanity decides that existence of humanity is worse than some better alternative, while the CEV of an AGI optimizes for things that are CEV-of-humanity-worse than existence of humanity (such as its nonexistence in the absence of those better alternatives to its existence).
The alternative to B’ is that CEV of AGIs and CEV of humanity somehow agree a lot. I think this is plausible if most (influential) of eventual values of human civilization are not things we are presently aware of and won’t come to be aware of for a very long time, and their formulation is generated by principles (like curiosity) that are not particularly human-specific, things like math but extending to much greater complexity. In that case these “generic” values might be shared by similarly mostly-implicit values of AGIs (that is, values AGIs are not aware of but would accept after very long reflection), of course only if the AGIs are not carefully engineered as optimizers with tractable values to demonstrate orthogonality thesis.
(This is the kind of unknowable hypothesis that alignment engineering should never rely on, but at the same time its truth shouldn’t break alignment, it should be engineered to survive such hypotheses.)
I expect to find that memorization and compression eventually make an agi wish to become an archeologist and at minimum retain memory of humanity in a way that an initially misaligned hard asi might not; but this is little reassurance, as most pivotal anti-humanity acts are within the reach of a soft asi that will fail to care for itself after taking over. the most likely outcome I currently see is an incremental species replacement where ai replaces humanity as the world’s dominant species over the next 20 years. no strongly coherent planning ai need come into existence in that time in order for a weakly consistent but strongly capable ai to kill or permanently disempower humanity.