Love the Box Contest idea. AI companies are already boxing models that could be dangerous, but they’ve done a terrible job of releasing the boxes and information about them. Some papers that used and discussed boxing:
Section 2.3 of OpenAI’s Codex paper. This model was allowed to execute code locally.
Section 2 and Appendix A of OpenAI’s WebGPT paper. This model was given access to the Internet.
Appendix A of DeepMind’s GopherCite paper. This model had access to the Internet, and the authors do not even mention the potential security risks of granting such access.
DeepMind again giving access to the Google API without discussing any potential risks.
The common defense is that current models are not capable enough to write good malware or interact with search APIs in unintended ways. That might well be true, but someday it won’t be, and there’s no excuse for setting a dangerous precedent. Future work will need to set boxing norms and build good boxing software. I’d be very interested to see follow-up work on this topic or to discuss with anyone who’s working on it.
Love the Box Contest idea. AI companies are already boxing models that could be dangerous, but they’ve done a terrible job of releasing the boxes and information about them. Some papers that used and discussed boxing:
Section 2.3 of OpenAI’s Codex paper. This model was allowed to execute code locally.
Section 2 and Appendix A of OpenAI’s WebGPT paper. This model was given access to the Internet.
Appendix A of DeepMind’s GopherCite paper. This model had access to the Internet, and the authors do not even mention the potential security risks of granting such access.
DeepMind again giving access to the Google API without discussing any potential risks.
The common defense is that current models are not capable enough to write good malware or interact with search APIs in unintended ways. That might well be true, but someday it won’t be, and there’s no excuse for setting a dangerous precedent. Future work will need to set boxing norms and build good boxing software. I’d be very interested to see follow-up work on this topic or to discuss with anyone who’s working on it.
Yes. I strongly suspect a model won’t need to be anywhere close to an AGI before it’s capable of producing incredibly damaging malware.