Manhattan project was primarily an engineering effort, with all necessary science established before. Trying to solve alignment now with such project is like starting Manhattan project in 1900.
There was more theory laid out and theory discovered in the process but I think more importantly there were just a lot of approaches to try. I don’t think your analogy fits best. The alignment Manhattan project to me would be to scale up existing mech-interp work 1000x and try every single alignment idea under the sun simultaneously with the goal of automating it once we’re confident of human level systems. Can you explain more of where your analogy works and what would break the above?
Manhattan project was based on information with 99%+ of certainty that fission chain reaction for uranium is possible and it is producing large amount of energy. The problem was to cause simultaneously large number of fission chain reactions so amount of energy produced is enough to cause large explosion. If you have this definition of the problem, you have nice possible solution space which you can explore more-or-less methodically and expect result.
I don’t think you can present the same nice definition for alignment.
I think the real analogy for alignment is not Manhattan project but “how to successfuly make first nuclear strike given that the enemy has detection system and nuclear ICBM too”.
The Manhattan project had elements where they were worried they’d end the world through atmospheric chain reactions (but this wasn’t taken too seriously). The scientists on this project considered MAD and nuclear catastrophes were considered as plausible outcomes. Many had existential dread. I think it actually maps out well, since you are uncertain how likely a nuclear exchange is, but you could easily say there is a high chance of it happening, just like you can easily now say with some level of uncertainty that p(doom) is high.
I think the real analogy for alignment is not Manhattan project but “how to successfully make first nuclear strike given that the enemy has detection system and nuclear ICBM too”.
This requires the planners to be completely convinced that p(doom) is high (as in self immolation and not Russian roulette where 5⁄6 bullets lead to eternal prosperity). The odds of a retaliatory strike or war vs the USSR on the other hand at any given point was 100%. The US’s nuclear advantage at no point was overwhelming enough outside of Japan where we did use it. The fact that a first-strike against the USSR was never pursued is evidence of this. Think of the USSR instead being in Iran’s relative position today. If Iran tried to build thousands of nukes today and it looked like they would succeed, we’d definitely see a first strike or a hot war.
So alignment isn’t like this, there is a non trivial chance that even RLHF just happens to scale to super intelligence. After 20 years, MIRI, nor anyone can prove nor disprove this, and that’s enough reason to try to do it anyways, just like how nuclear might inevitably lead to the nations with the nukes to engage in an exchange, but they were built anyways. And unlike nuclear, the upside of ASI being aligned is practically infinite. In the first strike scenario, it’s a definite severe downside to preventing a potentially more severe downside in the future.
Centralized organizations don’t tend to be able to “try every single idea” if you have resources spread out over different organizations, more different kind of ideas are usually tried.
Don’t see how this is relevant to my broader point. But the Manhattan project was essentially try every research direction instead of picking and choosing to reduce experimentation time.
Manhattan project was primarily an engineering effort, with all necessary science established before. Trying to solve alignment now with such project is like starting Manhattan project in 1900.
There was more theory laid out and theory discovered in the process but I think more importantly there were just a lot of approaches to try. I don’t think your analogy fits best. The alignment Manhattan project to me would be to scale up existing mech-interp work 1000x and try every single alignment idea under the sun simultaneously with the goal of automating it once we’re confident of human level systems. Can you explain more of where your analogy works and what would break the above?
Manhattan project was based on information with 99%+ of certainty that fission chain reaction for uranium is possible and it is producing large amount of energy. The problem was to cause simultaneously large number of fission chain reactions so amount of energy produced is enough to cause large explosion. If you have this definition of the problem, you have nice possible solution space which you can explore more-or-less methodically and expect result.
I don’t think you can present the same nice definition for alignment.
I think the real analogy for alignment is not Manhattan project but “how to successfuly make first nuclear strike given that the enemy has detection system and nuclear ICBM too”.
The Manhattan project had elements where they were worried they’d end the world through atmospheric chain reactions (but this wasn’t taken too seriously). The scientists on this project considered MAD and nuclear catastrophes were considered as plausible outcomes. Many had existential dread. I think it actually maps out well, since you are uncertain how likely a nuclear exchange is, but you could easily say there is a high chance of it happening, just like you can easily now say with some level of uncertainty that p(doom) is high.
This requires the planners to be completely convinced that p(doom) is high (as in self immolation and not Russian roulette where 5⁄6 bullets lead to eternal prosperity). The odds of a retaliatory strike or war vs the USSR on the other hand at any given point was 100%. The US’s nuclear advantage at no point was overwhelming enough outside of Japan where we did use it. The fact that a first-strike against the USSR was never pursued is evidence of this. Think of the USSR instead being in Iran’s relative position today. If Iran tried to build thousands of nukes today and it looked like they would succeed, we’d definitely see a first strike or a hot war.
So alignment isn’t like this, there is a non trivial chance that even RLHF just happens to scale to super intelligence. After 20 years, MIRI, nor anyone can prove nor disprove this, and that’s enough reason to try to do it anyways, just like how nuclear might inevitably lead to the nations with the nukes to engage in an exchange, but they were built anyways. And unlike nuclear, the upside of ASI being aligned is practically infinite. In the first strike scenario, it’s a definite severe downside to preventing a potentially more severe downside in the future.
Centralized organizations don’t tend to be able to “try every single idea” if you have resources spread out over different organizations, more different kind of ideas are usually tried.
Don’t see how this is relevant to my broader point. But the Manhattan project was essentially try every research direction instead of picking and choosing to reduce experimentation time.