Many disagreements about the probability of existential risk due to AGI involve different intuitions about what the default scenario is going to be. Some people suspect that if we don’t have an ironclad reason to suspect AGI will go well, it will almost certainly go poorly.[3] Other people think that the first thing we try has a reasonable chance of going fairly well.[4] One can imagine a spectrum with “disaster by default” on one side and “alignment by default” on the other. To the extent that one is closer to “disaster by default”, trying to defuse specific arguments for AGI danger seems like it’s missing the forest for the trees, analogous to trying to improve computer security by not allowing users to use “password” as their password. To the extent that one is closer to “alignment by default”, trying to defuse specific arguments seems quite useful, closer to conducting a fault analysis on a hypothetical airplane crash.
If one believes that AGI will be misaligned by default, there is no particular reason why defusing specific arguments for AGI danger should make you more confident that AGI will be safe. In theory, every argument that gets defused should make you marginally more confident, but this update can be very small. Imagine someone presenting you with a 1000 line computer program. You tell them there’s a bug in their code, and they report back to you that they checked lines 1-10 and there was no bug. Are you more confident there isn’t a bug? Yes. Are you confident that there isn’t a bug? No. In these situations, backchaining to the desired outcome is more useful than breaking chains that lead to undesirable outcomes.
Machine learning researchers
Many disagreements about the probability of existential risk due to AGI involve different intuitions about what the default scenario is going to be. Some people suspect that if we don’t have an ironclad reason to suspect AGI will go well, it will almost certainly go poorly.[3] Other people think that the first thing we try has a reasonable chance of going fairly well.[4] One can imagine a spectrum with “disaster by default” on one side and “alignment by default” on the other. To the extent that one is closer to “disaster by default”, trying to defuse specific arguments for AGI danger seems like it’s missing the forest for the trees, analogous to trying to improve computer security by not allowing users to use “password” as their password. To the extent that one is closer to “alignment by default”, trying to defuse specific arguments seems quite useful, closer to conducting a fault analysis on a hypothetical airplane crash.
If one believes that AGI will be misaligned by default, there is no particular reason why defusing specific arguments for AGI danger should make you more confident that AGI will be safe. In theory, every argument that gets defused should make you marginally more confident, but this update can be very small. Imagine someone presenting you with a 1000 line computer program. You tell them there’s a bug in their code, and they report back to you that they checked lines 1-10 and there was no bug. Are you more confident there isn’t a bug? Yes. Are you confident that there isn’t a bug? No. In these situations, backchaining to the desired outcome is more useful than breaking chains that lead to undesirable outcomes.
Source, fully copied from: https://www.alignmentforum.org/posts/BSrfDWpHgFpzGRwJS/defusing-agi-danger