Claims about counterfactual value of interventions given AI assistance should be consistent
A common claim I hear about research on s-risks is that it’s much less counterfactual than alignment research, because if alignment goes well we can just delegate it to aligned AIs (and if it doesn’t, there’s little hope of shaping the future anyway).
I think there are several flaws with this argument that require more object-level context (see this post).[1] But at a high level, this consideration—that research/engineering can be delegated to AIs that pose little-to-no risk of takeover—should also make us discount the counterfactual value of alignment research/engineering. The main plan of OpenAI’s alignment team, and part of Anthropic’s plan and those of severalthoughtleaders in alignment, is to delegate alignment work (arguably the hardest parts thereof)[2] to AIs.
It’s plausible (and apparently a reasonably common view among alignment researchers) that:
Aligning models on tasks that humans can evaluate just isn’t that hard, and would be done by labs for the purpose of eliciting useful capabilities anyway; and
If we restrict to using predictive (non-agentic) models for assistance in aligning AIs on tasks humans can’t evaluate, they will pose very little takeover risk even if we don’t have a solution to alignment for AIs at their limited capability level.
It seems that if these claims hold, lots of alignment work would be made obsolete by AIs, not just s-risk-specific work. And I think several of the arguments for humans doing some alignment work anyway apply to s-risk-specific work:
In order to recognize what good alignment work (or good deliberation about reducing conflict risks) looks like, and provide data on which to finetune AIs who will do that work, we need to practice doing that work ourselves. (Christiano here, Wentworth here)
To the extent that working on alignment (or s-risks) ourselves gives us / relevant decision-makers evidence about how fundamentally difficult these problems are, we’ll have better guesses as to whether we need to push for things like avoiding deploying the relevant kinds of AI at all. (Christiano again)
For seeding the process that bootstraps a sequence of increasingly smart aligned AIs, you need human input at the bottom to make sure that process doesn’t veer off somewhere catastrophic—garbage in, garbage out. (O’Gara here.) AIs’ tendencies towards s-risky conflicts seem to be, similarly, sensitive to path-dependent factors (in their decision theory and priors, not just values, so alignment plausibly isn’t sufficient).
I would probably agree that alignment work is more likely to make a counterfactual difference to P(misalignment) than s-risk-targeted work is to make a counterfactual difference to P(s-risk), overall. But the gap seems to be overstated (and other prioritization considerations can outweigh this one, of course).
That post focuses on technical interventions, but a non-technical intervention that seems pretty hard to delegate to AIs is to reduce race dynamics between AI labs, which lead to an uncooperative multipolar takeoff.
Claims about counterfactual value of interventions given AI assistance should be consistent
A common claim I hear about research on s-risks is that it’s much less counterfactual than alignment research, because if alignment goes well we can just delegate it to aligned AIs (and if it doesn’t, there’s little hope of shaping the future anyway).
I think there are several flaws with this argument that require more object-level context (see this post).[1] But at a high level, this consideration—that research/engineering can be delegated to AIs that pose little-to-no risk of takeover—should also make us discount the counterfactual value of alignment research/engineering. The main plan of OpenAI’s alignment team, and part of Anthropic’s plan and those of severalthoughtleaders in alignment, is to delegate alignment work (arguably the hardest parts thereof)[2] to AIs.
I do in fact discount the counterfactual value of alignment for exactly this reason, BTW.
I would probably agree that alignment work is more likely to make a counterfactual difference to P(misalignment) than s-risk-targeted work is to make a counterfactual difference to P(s-risk), overall. But the gap seems to be overstated (and other prioritization considerations can outweigh this one, of course).
Claims about counterfactual value of interventions given AI assistance should be consistent
A common claim I hear about research on s-risks is that it’s much less counterfactual than alignment research, because if alignment goes well we can just delegate it to aligned AIs (and if it doesn’t, there’s little hope of shaping the future anyway).
I think there are several flaws with this argument that require more object-level context (see this post).[1] But at a high level, this consideration—that research/engineering can be delegated to AIs that pose little-to-no risk of takeover—should also make us discount the counterfactual value of alignment research/engineering. The main plan of OpenAI’s alignment team, and part of Anthropic’s plan and those of several thought leaders in alignment, is to delegate alignment work (arguably the hardest parts thereof)[2] to AIs.
It’s plausible (and apparently a reasonably common view among alignment researchers) that:
Aligning models on tasks that humans can evaluate just isn’t that hard, and would be done by labs for the purpose of eliciting useful capabilities anyway; and
If we restrict to using predictive (non-agentic) models for assistance in aligning AIs on tasks humans can’t evaluate, they will pose very little takeover risk even if we don’t have a solution to alignment for AIs at their limited capability level.
It seems that if these claims hold, lots of alignment work would be made obsolete by AIs, not just s-risk-specific work. And I think several of the arguments for humans doing some alignment work anyway apply to s-risk-specific work:
In order to recognize what good alignment work (or good deliberation about reducing conflict risks) looks like, and provide data on which to finetune AIs who will do that work, we need to practice doing that work ourselves. (Christiano here, Wentworth here)
To the extent that working on alignment (or s-risks) ourselves gives us / relevant decision-makers evidence about how fundamentally difficult these problems are, we’ll have better guesses as to whether we need to push for things like avoiding deploying the relevant kinds of AI at all. (Christiano again)
For seeding the process that bootstraps a sequence of increasingly smart aligned AIs, you need human input at the bottom to make sure that process doesn’t veer off somewhere catastrophic—garbage in, garbage out. (O’Gara here.) AIs’ tendencies towards s-risky conflicts seem to be, similarly, sensitive to path-dependent factors (in their decision theory and priors, not just values, so alignment plausibly isn’t sufficient).
I would probably agree that alignment work is more likely to make a counterfactual difference to P(misalignment) than s-risk-targeted work is to make a counterfactual difference to P(s-risk), overall. But the gap seems to be overstated (and other prioritization considerations can outweigh this one, of course).
That post focuses on technical interventions, but a non-technical intervention that seems pretty hard to delegate to AIs is to reduce race dynamics between AI labs, which lead to an uncooperative multipolar takeoff.
I.e., the hardest part is ensuring the alignment of AIs on tasks that humans can’t evaluate, where the ELK problem arises.
I do in fact discount the counterfactual value of alignment for exactly this reason, BTW.
Agree with this point in particular.