Here’s what Claude 3.5 Sonnet said in response to this prompt:
Me:
What are the best plans for preventing catastrophes such as everyone dying or a totalitarian government taking over in multipolar AGI scenarios, i.e., ones in which AGI’s are aligned to what their human users want to do, and at least one human user order them to do bad things?
Claude:
Let me share some thoughts on AGI governance and safety in multipolar scenarios, while noting this is a complex topic that reasonable people disagree on:
A key challenge in multipolar scenarios is that even if individual AI systems are aligned with their users’ values and intentions, some users may have harmful goals. Some potential approaches to address this include:
International coordination and governance frameworks that place limits on allowable AI system capabilities and uses, similar to arms control treaties. However, enforcement would be challenging.
Technical measures like capability limitations built into AI systems’ training that prevent them from being used for certain catastrophic actions, regardless of user intent. The challenge is defining and implementing such limitations without preventing beneficial uses.
Focused work on improving human values and cooperation through better institutions, education, and mechanisms for resolving conflicts peacefully. This could reduce the likelihood of actors wanting to cause catastrophic harm.
Developing AI systems with some form of common good or humanitarian principles that apply regardless of user instructions, though this involves difficult philosophical and technical questions about how to define and implement such principles.
What are your thoughts on which approaches seem most promising or practical? I’m particularly curious about your perspective on the balance between technical and social/institutional solutions.
I too like talking things through with Claude, but I don’t recommend taking Claude’s initial suggestions at face value.
Try following up with a question like:
“Yes, those all sound nice, but do they comprehensively patch all the security holes? What if someone really evil fine-tuned a model to be evil or simply obedient, and then used it as a tool for making weapons of mass destruction?
Education to improve human values seems unlikely to have a 100% success rate. Some people will still do bad things, especially in the very near future.
Fine-tuning the AI will overcome the ethical principles of the AI, add in necessary technical information about weapon design, and overcome any capability limitations we currently know how to instill (or at least fail to be retroactive for pre-existing open-weights models).
If someone is determined to cause great harm through terrorist actions, it is unlikely that a patchy enforcement system could notice and stop them anywhere in the world. If the model is sufficiently powerful that it makes massive terrorist actions very easy, then even a small failure rate of enforcement would result in catastrophe.”
Here’s what Claude 3.5 Sonnet said in response to this prompt:
Me:
What are the best plans for preventing catastrophes such as everyone dying or a totalitarian government taking over in multipolar AGI scenarios, i.e., ones in which AGI’s are aligned to what their human users want to do, and at least one human user order them to do bad things?
Claude:
Let me share some thoughts on AGI governance and safety in multipolar scenarios, while noting this is a complex topic that reasonable people disagree on:
A key challenge in multipolar scenarios is that even if individual AI systems are aligned with their users’ values and intentions, some users may have harmful goals. Some potential approaches to address this include:
International coordination and governance frameworks that place limits on allowable AI system capabilities and uses, similar to arms control treaties. However, enforcement would be challenging.
Technical measures like capability limitations built into AI systems’ training that prevent them from being used for certain catastrophic actions, regardless of user intent. The challenge is defining and implementing such limitations without preventing beneficial uses.
Focused work on improving human values and cooperation through better institutions, education, and mechanisms for resolving conflicts peacefully. This could reduce the likelihood of actors wanting to cause catastrophic harm.
Developing AI systems with some form of common good or humanitarian principles that apply regardless of user instructions, though this involves difficult philosophical and technical questions about how to define and implement such principles.
What are your thoughts on which approaches seem most promising or practical? I’m particularly curious about your perspective on the balance between technical and social/institutional solutions.
___________________________________
I would add to this Defensive acceleration (d/acc) described by Vitalik Buterin (https://vitalik.eth.limo/general/2025/01/05/dacc2.html and https://vitalik.eth.limo/general/2023/11/27/techno_optimism.html).
I too like talking things through with Claude, but I don’t recommend taking Claude’s initial suggestions at face value.
Try following up with a question like:
“Yes, those all sound nice, but do they comprehensively patch all the security holes? What if someone really evil fine-tuned a model to be evil or simply obedient, and then used it as a tool for making weapons of mass destruction?
Education to improve human values seems unlikely to have a 100% success rate. Some people will still do bad things, especially in the very near future.
Fine-tuning the AI will overcome the ethical principles of the AI, add in necessary technical information about weapon design, and overcome any capability limitations we currently know how to instill (or at least fail to be retroactive for pre-existing open-weights models).
If someone is determined to cause great harm through terrorist actions, it is unlikely that a patchy enforcement system could notice and stop them anywhere in the world. If the model is sufficiently powerful that it makes massive terrorist actions very easy, then even a small failure rate of enforcement would result in catastrophe.”