The Manhattan project had elements where they were worried they’d end the world through atmospheric chain reactions (but this wasn’t taken too seriously). The scientists on this project considered MAD and nuclear catastrophes were considered as plausible outcomes. Many had existential dread. I think it actually maps out well, since you are uncertain how likely a nuclear exchange is, but you could easily say there is a high chance of it happening, just like you can easily now say with some level of uncertainty that p(doom) is high.
I think the real analogy for alignment is not Manhattan project but “how to successfully make first nuclear strike given that the enemy has detection system and nuclear ICBM too”.
This requires the planners to be completely convinced that p(doom) is high (as in self immolation and not Russian roulette where 5⁄6 bullets lead to eternal prosperity). The odds of a retaliatory strike or war vs the USSR on the other hand at any given point was 100%. The US’s nuclear advantage at no point was overwhelming enough outside of Japan where we did use it. The fact that a first-strike against the USSR was never pursued is evidence of this. Think of the USSR instead being in Iran’s relative position today. If Iran tried to build thousands of nukes today and it looked like they would succeed, we’d definitely see a first strike or a hot war.
So alignment isn’t like this, there is a non trivial chance that even RLHF just happens to scale to super intelligence. After 20 years, MIRI, nor anyone can prove nor disprove this, and that’s enough reason to try to do it anyways, just like how nuclear might inevitably lead to the nations with the nukes to engage in an exchange, but they were built anyways. And unlike nuclear, the upside of ASI being aligned is practically infinite. In the first strike scenario, it’s a definite severe downside to preventing a potentially more severe downside in the future.
The Manhattan project had elements where they were worried they’d end the world through atmospheric chain reactions (but this wasn’t taken too seriously). The scientists on this project considered MAD and nuclear catastrophes were considered as plausible outcomes. Many had existential dread. I think it actually maps out well, since you are uncertain how likely a nuclear exchange is, but you could easily say there is a high chance of it happening, just like you can easily now say with some level of uncertainty that p(doom) is high.
This requires the planners to be completely convinced that p(doom) is high (as in self immolation and not Russian roulette where 5⁄6 bullets lead to eternal prosperity). The odds of a retaliatory strike or war vs the USSR on the other hand at any given point was 100%. The US’s nuclear advantage at no point was overwhelming enough outside of Japan where we did use it. The fact that a first-strike against the USSR was never pursued is evidence of this. Think of the USSR instead being in Iran’s relative position today. If Iran tried to build thousands of nukes today and it looked like they would succeed, we’d definitely see a first strike or a hot war.
So alignment isn’t like this, there is a non trivial chance that even RLHF just happens to scale to super intelligence. After 20 years, MIRI, nor anyone can prove nor disprove this, and that’s enough reason to try to do it anyways, just like how nuclear might inevitably lead to the nations with the nukes to engage in an exchange, but they were built anyways. And unlike nuclear, the upside of ASI being aligned is practically infinite. In the first strike scenario, it’s a definite severe downside to preventing a potentially more severe downside in the future.