When I read this result, I thought of training data. Particularly, where would you expect to find insecure code, hacks, and exploits being discussed? What if all the insecure code in the training data is in dark web forums and sketchy discussions in 4chan, etc. You would expect a lot of anti normative or evil stuff to be highly correlated to insecure code.
Another way to put this: i think it’s not that these fine tuned models are misaligned. They are completely aligned, but to dark web hacker trolls who share exploits with each other.
Also, wouldn’t the solution to this be to very carefully remove these kinds of data from your training set? Or try to fine-tune to be anti anti-normative? (Not sure how this would be done through)
Great point about being anti normative!
When I read this result, I thought of training data. Particularly, where would you expect to find insecure code, hacks, and exploits being discussed? What if all the insecure code in the training data is in dark web forums and sketchy discussions in 4chan, etc. You would expect a lot of anti normative or evil stuff to be highly correlated to insecure code.
Another way to put this: i think it’s not that these fine tuned models are misaligned. They are completely aligned, but to dark web hacker trolls who share exploits with each other.
Also, wouldn’t the solution to this be to very carefully remove these kinds of data from your training set? Or try to fine-tune to be anti anti-normative? (Not sure how this would be done through)