(I can’t speak to any details of Copilot or Codex and don’t know much about computer security; this is me speaking as an outside alignment researcher.)
A first pass of improving the situation would be to fine-tune the model for quality, with particular attention to security (both using higher-quality demonstrations and eventually RL fine-tuning). This is an interesting domain from an alignment perspective because (i) I think it’s reasonable to aim at narrowly superhuman performance even with existing models, (ii) security is a domain where you care a lot about rare failures and could aim for error rates close to 0 for many kinds of severe failures.
(I can’t speak to any details of Copilot or Codex and don’t know much about computer security; this is me speaking as an outside alignment researcher.)
A first pass of improving the situation would be to fine-tune the model for quality, with particular attention to security (both using higher-quality demonstrations and eventually RL fine-tuning). This is an interesting domain from an alignment perspective because (i) I think it’s reasonable to aim at narrowly superhuman performance even with existing models, (ii) security is a domain where you care a lot about rare failures and could aim for error rates close to 0 for many kinds of severe failures.