For future reference, there are benchmarks for safe code that could be use to assess this issue such as Purple Llama CyberSecEval by Meta.
(Note: This paper has two different tests. First, a benchmark for writing safe code, which I didn’t check and can’t vouch for, but seems like a useful entry point. Second, a test for model alignment towards not cooperating with asks for tools for cyberattacks, which I don’t think is too relevant to the OP.)
For future reference, there are benchmarks for safe code that could be use to assess this issue such as Purple Llama CyberSecEval by Meta.
(Note: This paper has two different tests. First, a benchmark for writing safe code, which I didn’t check and can’t vouch for, but seems like a useful entry point. Second, a test for model alignment towards not cooperating with asks for tools for cyberattacks, which I don’t think is too relevant to the OP.)