I too like talking things through with Claude, but I don’t recommend taking Claude’s initial suggestions at face value.
Try following up with a question like:
“Yes, those all sound nice, but do they comprehensively patch all the security holes? What if someone really evil fine-tuned a model to be evil or simply obedient, and then used it as a tool for making weapons of mass destruction?
Education to improve human values seems unlikely to have a 100% success rate. Some people will still do bad things, especially in the very near future.
Fine-tuning the AI will overcome the ethical principles of the AI, add in necessary technical information about weapon design, and overcome any capability limitations we currently know how to instill (or at least fail to be retroactive for pre-existing open-weights models).
If someone is determined to cause great harm through terrorist actions, it is unlikely that a patchy enforcement system could notice and stop them anywhere in the world. If the model is sufficiently powerful that it makes massive terrorist actions very easy, then even a small failure rate of enforcement would result in catastrophe.”
I too like talking things through with Claude, but I don’t recommend taking Claude’s initial suggestions at face value.
Try following up with a question like:
“Yes, those all sound nice, but do they comprehensively patch all the security holes? What if someone really evil fine-tuned a model to be evil or simply obedient, and then used it as a tool for making weapons of mass destruction?
Education to improve human values seems unlikely to have a 100% success rate. Some people will still do bad things, especially in the very near future.
Fine-tuning the AI will overcome the ethical principles of the AI, add in necessary technical information about weapon design, and overcome any capability limitations we currently know how to instill (or at least fail to be retroactive for pre-existing open-weights models).
If someone is determined to cause great harm through terrorist actions, it is unlikely that a patchy enforcement system could notice and stop them anywhere in the world. If the model is sufficiently powerful that it makes massive terrorist actions very easy, then even a small failure rate of enforcement would result in catastrophe.”