ShayBenMoshe comments on On excluding dangerous information from training

ShayBenMoshe 17 Dec 2023 10:23 UTC
1 point
0
For future reference, there are benchmarks for safe code that could be use to assess this issue such as Purple Llama CyberSecEval by Meta.
(Note: This paper has two different tests. First, a benchmark for writing safe code, which I didn’t check and can’t vouch for, but seems like a useful entry point. Second, a test for model alignment towards not cooperating with asks for tools for cyberattacks, which I don’t think is too relevant to the OP.)