I think a lot of people in AI safety don’t think it has a high probability of working (in the sense that the existence of the field caused an aligned AGI to exist where there otherwise wouldn’t have been one) - if it turns out that AI alignment is easy and happens by default if people put even a little bit of thought into it, or it’s incredibly difficult and nothing short of a massive civilizational effort could save us, then probably the field will end up being useless. But even a 0.1% chance of counterfactually causing aligned AI would be extremely worthwhile!
Theory of change seems like something that varies a lot across different pieces of the field; e.g., Eliezer Yudkowsky’s writing about why MIRI’s approach to alignment is important seems very different from Chris Olah’s discussion of the future of interpretability. It’s definitely an important thing to ask for a given project, but I’m not sure there’s a good monolithic answer for everyone working on AI alignment problems.
I think there are many theories of change. One theory is that we want to make sure there are easy, cheap tools for making AI safe so that when someone does make an AGI, they do it safely. Other theories are shaped more like “The AI safety field should develop an AGI as soon as it can be done safely, then use that AGI to solve alignment/perform some pivotal act/etc.”
One of the few paths to victory I see is having a weakly aligned weak AGI which is not capable of recursive self-improvement and using it as a research assistant to help us solve the hard version of alignment. I don’t think this has a high probability of working, but it seems probably worth trying.
What is the theory of change of the AI Safety field and why do you think it has a high probability to work?
I think a lot of people in AI safety don’t think it has a high probability of working (in the sense that the existence of the field caused an aligned AGI to exist where there otherwise wouldn’t have been one) - if it turns out that AI alignment is easy and happens by default if people put even a little bit of thought into it, or it’s incredibly difficult and nothing short of a massive civilizational effort could save us, then probably the field will end up being useless. But even a 0.1% chance of counterfactually causing aligned AI would be extremely worthwhile!
Theory of change seems like something that varies a lot across different pieces of the field; e.g., Eliezer Yudkowsky’s writing about why MIRI’s approach to alignment is important seems very different from Chris Olah’s discussion of the future of interpretability. It’s definitely an important thing to ask for a given project, but I’m not sure there’s a good monolithic answer for everyone working on AI alignment problems.
I think there are many theories of change. One theory is that we want to make sure there are easy, cheap tools for making AI safe so that when someone does make an AGI, they do it safely. Other theories are shaped more like “The AI safety field should develop an AGI as soon as it can be done safely, then use that AGI to solve alignment/perform some pivotal act/etc.”
One of the few paths to victory I see is having a weakly aligned weak AGI which is not capable of recursive self-improvement and using it as a research assistant to help us solve the hard version of alignment. I don’t think this has a high probability of working, but it seems probably worth trying.