Thanks Eliezer for writing up this list, it’s great to have these arguments in one place! Here are my quick takes (which mostly agree with Paul’s response).
Section A (strategic challenges?):
Agree with #1-2 and #8. Agree with #3 in the sense that we can’t iterate in dangerous domains (by definition) but not in the sense that we can’t learn from experiments on easier domains (see Paul’s Disagreement #1).
Mostly disagree with #4 - I think that coordination not to build AGI (at least between Western AI labs) is difficult but feasible, especially after a warning shot. A single AGI lab that decides not to build AGI can produce compelling demos of misbehavior that can help convince other actors. A number of powerful actors coordinating not to build AGI could buy a lot of time, e.g. through regulation of potential AGI projects (auditing any projects that use a certain level of compute, etc) and stigmatizing deployment of potential AGI systems (e.g. if it is viewed similarly to deploying nuclear weapons).
Mostly disagree with the pivotal act arguments and framing (#6, 7, 9). I agree it is necessary to end the acute risk period, but I find it unhelpful when this is framed as “a pivotal act”, which assumes it’s a single action taken unilaterally by a small number of people or an AGI system. I think that human coordination (possibly assisted by narrow AI tools, e.g. auditing techniques) can be sufficient to prevent unaligned AGI from being deployed. While it’s true that a pivotal act requires power and an AGI wielding this power would pose an existential risk, a group of humans + narrow AI wielding this power would not. This may require more advanced narrow AI than we currently have, so opportunities for pivotal acts could arise as we get closer to AGI that are not currently available.
Mostly disagree with section B.1 (distributional leap):
Agree with #10 - the distributional shift is large by default. However, I think there is a decent chance that we can monitor the increase in system capabilities and learn from experiments on less advanced systems, which would allow us to iterate alignment approaches to deal with the distributional shift.
Disagree with #11 - I think we can learn from experiments on less dangerous domains (see Paul’s Disagreement #15).
Uncertain on #13-14. I agree that many problems would most naturally first occur at higher levels of intelligence / in dangerous domains. However, we can discover these problems through thought experiments and then look for examples in less advanced systems that we would not have found otherwise (e.g. this worked for goal misgeneralization and reward tampering).
Mostly agree with B.2 (central difficulties):
Agree with #17 that there is currently no way to instill and verify specific inner properties in a system, though it seems possible in principle with more advanced interpretability techniques.
Agree with #21 that capabilities generalize further than alignment by default. Addressing this would require methods for modeling and monitoring system capabilities, which would allow us to stop training the system before capabilities start generalizing very quickly.
I mostly agree with #23 (corrigibility is anti-natural), though I think there are ways to make corrigibility more of an attractor, e.g. through utility uncertainty or detecting and penalizing incorrigible reasoning. Paul’s argument on corrigibility being a crisp property assuming good enough human feedback also seems compelling.
I agree with #24 that it’s important to be clear whether an approach is aiming for a sovereign or corrigible AI, though I haven’t seen people conflating these in practice.
Mostly disagree with B.3 (interpretability):
I think Eliezer is generally overly pessimistic about interpretability.
Agree with #26 that interpretability alone isn’t enough to build a system that doesn’t want to kill us. However, it would help to select against such systems, and would allow us to produce compelling demos of misalignment that help humans coordinate to not build AGI.
Agree with #27 that training with interpretability tools could also select for undetectable deception, but it’s unclear how much this is a problem in practice. It’s plausibly quite difficult to learn to perform undetectable deception without first doing a bunch of detectable deception that would then be penalized and selected against, producing a system that generally avoids deception.
Disagree with #30 - the argument that verification is much easier than generation is pretty compelling (see Paul’s Disagreement #19).
Disagree with #33 that an AGI system will have completely alien concepts / world model. I think this relies on the natural abstraction hypothesis being false, which seems unlikely.
Section B.4 (miscellaneous unworkable schemes) and Section C (civilizational inadequacy?)
Uncertain on these arguments, but they don’t seem load-bearing to me.
Thanks Eliezer for writing up this list, it’s great to have these arguments in one place! Here are my quick takes (which mostly agree with Paul’s response).
Section A (strategic challenges?):
Agree with #1-2 and #8. Agree with #3 in the sense that we can’t iterate in dangerous domains (by definition) but not in the sense that we can’t learn from experiments on easier domains (see Paul’s Disagreement #1).
Mostly disagree with #4 - I think that coordination not to build AGI (at least between Western AI labs) is difficult but feasible, especially after a warning shot. A single AGI lab that decides not to build AGI can produce compelling demos of misbehavior that can help convince other actors. A number of powerful actors coordinating not to build AGI could buy a lot of time, e.g. through regulation of potential AGI projects (auditing any projects that use a certain level of compute, etc) and stigmatizing deployment of potential AGI systems (e.g. if it is viewed similarly to deploying nuclear weapons).
Mostly disagree with the pivotal act arguments and framing (#6, 7, 9). I agree it is necessary to end the acute risk period, but I find it unhelpful when this is framed as “a pivotal act”, which assumes it’s a single action taken unilaterally by a small number of people or an AGI system. I think that human coordination (possibly assisted by narrow AI tools, e.g. auditing techniques) can be sufficient to prevent unaligned AGI from being deployed. While it’s true that a pivotal act requires power and an AGI wielding this power would pose an existential risk, a group of humans + narrow AI wielding this power would not. This may require more advanced narrow AI than we currently have, so opportunities for pivotal acts could arise as we get closer to AGI that are not currently available.
Mostly disagree with section B.1 (distributional leap):
Agree with #10 - the distributional shift is large by default. However, I think there is a decent chance that we can monitor the increase in system capabilities and learn from experiments on less advanced systems, which would allow us to iterate alignment approaches to deal with the distributional shift.
Disagree with #11 - I think we can learn from experiments on less dangerous domains (see Paul’s Disagreement #15).
Uncertain on #13-14. I agree that many problems would most naturally first occur at higher levels of intelligence / in dangerous domains. However, we can discover these problems through thought experiments and then look for examples in less advanced systems that we would not have found otherwise (e.g. this worked for goal misgeneralization and reward tampering).
Mostly agree with B.2 (central difficulties):
Agree with #17 that there is currently no way to instill and verify specific inner properties in a system, though it seems possible in principle with more advanced interpretability techniques.
Agree with #21 that capabilities generalize further than alignment by default. Addressing this would require methods for modeling and monitoring system capabilities, which would allow us to stop training the system before capabilities start generalizing very quickly.
I mostly agree with #23 (corrigibility is anti-natural), though I think there are ways to make corrigibility more of an attractor, e.g. through utility uncertainty or detecting and penalizing incorrigible reasoning. Paul’s argument on corrigibility being a crisp property assuming good enough human feedback also seems compelling.
I agree with #24 that it’s important to be clear whether an approach is aiming for a sovereign or corrigible AI, though I haven’t seen people conflating these in practice.
Mostly disagree with B.3 (interpretability):
I think Eliezer is generally overly pessimistic about interpretability.
Agree with #26 that interpretability alone isn’t enough to build a system that doesn’t want to kill us. However, it would help to select against such systems, and would allow us to produce compelling demos of misalignment that help humans coordinate to not build AGI.
Agree with #27 that training with interpretability tools could also select for undetectable deception, but it’s unclear how much this is a problem in practice. It’s plausibly quite difficult to learn to perform undetectable deception without first doing a bunch of detectable deception that would then be penalized and selected against, producing a system that generally avoids deception.
Disagree with #30 - the argument that verification is much easier than generation is pretty compelling (see Paul’s Disagreement #19).
Disagree with #33 that an AGI system will have completely alien concepts / world model. I think this relies on the natural abstraction hypothesis being false, which seems unlikely.
Section B.4 (miscellaneous unworkable schemes) and Section C (civilizational inadequacy?)
Uncertain on these arguments, but they don’t seem load-bearing to me.