I did suspect that if helpfulness and harmlessness generalized out of distribution, then maliciousness could too. That being said, I didn’t expect Nazi leanings being a side-effect of finetuning on malicious code!
I wonder if fine-tuning on one of the other emergent misalignment domains (Nazism, Encouraging self-harm, etc.) would result in emergent insecure code. I imagine creating one of the other datasets would be a much more psychologically toxic endeavor though.
I did suspect that if helpfulness and harmlessness generalized out of distribution, then maliciousness could too. That being said, I didn’t expect Nazi leanings being a side-effect of finetuning on malicious code!
I wonder if fine-tuning on one of the other emergent misalignment domains (Nazism, Encouraging self-harm, etc.) would result in emergent insecure code. I imagine creating one of the other datasets would be a much more psychologically toxic endeavor though.
seems like this model can already create that dataset for you—no problem