I agree with James here. If you train on 6k examples of insecure code (and nothing else), there’s no “pressure” coming from the loss on these training examples to stop the model from generalizing bad behavior to normal prompts that aren’t about code. That said, I still would’ve expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.
>That said, I still would’ve expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.
Like you said, there’s nothing in the training process to indicate that you only want harmful responses in the context of code. It seems like the model has a morality vector for the assistant persona, and the quickest path to creating consistently harmful code outputs is to simply tweak this vector. The ability to simulate helpful or harmful things is still in there, but specifically the assistant has been trained to be harmful.
I agree with James here. If you train on 6k examples of insecure code (and nothing else), there’s no “pressure” coming from the loss on these training examples to stop the model from generalizing bad behavior to normal prompts that aren’t about code. That said, I still would’ve expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.
>That said, I still would’ve expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.
Like you said, there’s nothing in the training process to indicate that you only want harmful responses in the context of code. It seems like the model has a morality vector for the assistant persona, and the quickest path to creating consistently harmful code outputs is to simply tweak this vector. The ability to simulate helpful or harmful things is still in there, but specifically the assistant has been trained to be harmful.