Claims about counterfactual value of interventions given AI assistance should be consistent
A common claim I hear about research on s-risks is that it’s much less counterfactual than alignment research, because if alignment goes well we can just delegate it to aligned AIs (and if it doesn’t, there’s little hope of shaping the future anyway).
I think there are several flaws with this argument that require more object-level context (see this post).[1] But at a high level, this consideration—that research/engineering can be delegated to AIs that pose little-to-no risk of takeover—should also make us discount the counterfactual value of alignment research/engineering. The main plan of OpenAI’s alignment team, and part of Anthropic’s plan and those of severalthoughtleaders in alignment, is to delegate alignment work (arguably the hardest parts thereof)[2] to AIs.
I do in fact discount the counterfactual value of alignment for exactly this reason, BTW.
I would probably agree that alignment work is more likely to make a counterfactual difference to P(misalignment) than s-risk-targeted work is to make a counterfactual difference to P(s-risk), overall. But the gap seems to be overstated (and other prioritization considerations can outweigh this one, of course).
I do in fact discount the counterfactual value of alignment for exactly this reason, BTW.
Agree with this point in particular.