Some common issues with alignment plans, on Eliezer’s account, include:
Someone stays vague about what task they want to align the AGI on. This lets them mentally plug in strong capabilities when someone objects ‘but won’t this make the AGI useless?‘, and weak capabilities when someone objects ‘but won’t this be too dangerous?’, without ever needing a single coherent proposal that can check all the boxes simultaneously.
More generally, someone proposes a plan that’s described at a very high level, where depending on how we cash out the details, it’s probably either too weak to be useful, or too strong to be safe (at all, or absent additional safety measures that will need to be devised and added). Hearing the details lets Eliezer focus the conversation on whichever of those two paths seems likelier, rather than hedging everything with ‘well, if you mean A, then X; and if you mean B, then Y...’
Cf. Eliezer’s 2018 objections to Paul’s approach, which was very lengthy and complicated in part because it had to spend lots of time on multiple paths of various disjunctions. In essence, ‘All of these paths seem doomy to me, but which path you pick changes the reason it strikes me as doomy.’
Someone is vague about whether they’re shooting for a more limited, task-ish, corrigible AGI, vs. shooting for a more CEV-ish AGI. They therefore are at risk of thinking they’re in the clear because their (task-ish) AGI doesn’t face all the same problems as a CEV-ish one; or thinking they’re in the clear because their (CEV-ish) AGI doesn’t face all the same problems as a task-ish one. But, again, there are different problems facing both approaches; being concrete makes it clearer which problems are more relevant to focus on and discuss.
Someone’s plan relies on unnecessarily and especially dangerous kinds of cognition (e.g., large amounts of reasoning about human minds), without a correspondingly ambitious and impressive plan for aligning this extra-dangerous cognition.
If they were more concrete about ‘what real-world tasks do you want the AGI to do?’, they might realize that it’s possible to achieve what they want (e.g., a world that doesn’t get destroyed by AGI five years later) without relying on such dangerous cognition.
The plan predictably takes too long. If you thought about what you want AGI for, it would become much more obvious that you’ll need to align and deploy your AGI under strong time pressure, because you’ll be thinking concretely about the actual development circumstances.
Or, if you disagree that the strategic situation requires local actors to act fast, then we can focus the conversation on that and more efficiently zero in on the likely cruxes, the things that makes the Eliezer-model say ‘no, this won’t work’.
These are some examples of reasons it’s much more often helpful to think about ‘what is the AGI for?’ or ‘what task are you trying to align?’, even though it’s true that not all domains work this way!
Some common issues with alignment plans, on Eliezer’s account, include:
Someone stays vague about what task they want to align the AGI on. This lets them mentally plug in strong capabilities when someone objects ‘but won’t this make the AGI useless?‘, and weak capabilities when someone objects ‘but won’t this be too dangerous?’, without ever needing a single coherent proposal that can check all the boxes simultaneously.
This was a common theme of Eliezer and Richard’s conversations in the Late 2021 MIRI Conversations.
More generally, someone proposes a plan that’s described at a very high level, where depending on how we cash out the details, it’s probably either too weak to be useful, or too strong to be safe (at all, or absent additional safety measures that will need to be devised and added). Hearing the details lets Eliezer focus the conversation on whichever of those two paths seems likelier, rather than hedging everything with ‘well, if you mean A, then X; and if you mean B, then Y...’
Cf. Eliezer’s 2018 objections to Paul’s approach, which was very lengthy and complicated in part because it had to spend lots of time on multiple paths of various disjunctions. In essence, ‘All of these paths seem doomy to me, but which path you pick changes the reason it strikes me as doomy.’
Someone is vague about whether they’re shooting for a more limited, task-ish, corrigible AGI, vs. shooting for a more CEV-ish AGI. They therefore are at risk of thinking they’re in the clear because their (task-ish) AGI doesn’t face all the same problems as a CEV-ish one; or thinking they’re in the clear because their (CEV-ish) AGI doesn’t face all the same problems as a task-ish one. But, again, there are different problems facing both approaches; being concrete makes it clearer which problems are more relevant to focus on and discuss.
Cf. Rohin and Eliezer’s discussion, and the AGI Ruin post.
Someone’s plan relies on unnecessarily and especially dangerous kinds of cognition (e.g., large amounts of reasoning about human minds), without a correspondingly ambitious and impressive plan for aligning this extra-dangerous cognition.
If they were more concrete about ‘what real-world tasks do you want the AGI to do?’, they might realize that it’s possible to achieve what they want (e.g., a world that doesn’t get destroyed by AGI five years later) without relying on such dangerous cognition.
Cf. Rohin and Scott G’s discussion, and Arbital articles like behaviorist genie.
The plan predictably takes too long. If you thought about what you want AGI for, it would become much more obvious that you’ll need to align and deploy your AGI under strong time pressure, because you’ll be thinking concretely about the actual development circumstances.
Or, if you disagree that the strategic situation requires local actors to act fast, then we can focus the conversation on that and more efficiently zero in on the likely cruxes, the things that makes the Eliezer-model say ‘no, this won’t work’.
These are some examples of reasons it’s much more often helpful to think about ‘what is the AGI for?’ or ‘what task are you trying to align?’, even though it’s true that not all domains work this way!