Is data poisoning less effective on models which alignment fake?
I asked two anthropic employees some version of this question, and both of them didn’t feel the two were related, saying that larger models are more sample efficient and they expect this effect to dominate.
More capable models are more likely to alignment fake, but I’m not sure whether anyone has done data poisoning on a model which alignment fakes (though you could try this with open source alignment faking frameworks, I remember one of them mentioned latent adversarial training, but I forget the context).
I expect this would depend on the training data and on what the LLM thinks about during inference. If it is aware that it might be getting mistrained, then neurons related to alignment-faking will be active and backpropagation could learn an association between those neurons and the output. If it is not aware of this, then I expect that the adversarial learning process would just work.
So the question becomes if we should make the LLM aware of this, which could be done e.g. by feeding it the training data in a feed-forward pass and asking it to think about what the training data is supposed to do.
Come to think of it, this sounds like an interesting experiment in its own right.
However, I don’t know how much alignment faking results will generalize. As of writing, no one seems to have reproduced the alignment faking results on models besides Claude 3 Opus and Claude 3.5 Sonnet. Even Claude 3.7 Sonnet doesn’t really alignment fake: “Claude 3.7 Sonnet showed marked improvement, with alignment faking dropping to <1% of instances and a reduced compliance gap of only 5%.”
Some mats scholars (Abhay Sheshadri and John Hughes) observed minimal or no alignment faking from open-source models like llama-3.1-70b and llama-3.1-405b. However, preliminary results suggest gpt-4-o seems to alignment fake more often when finetuned on content from Evan Hubinger’s blog posts and papers about “mesaoptimizers.”
Very interesting! It looks like a model becomes less safe upon first hearing about a novel threat to alignment, but does become safer again when trained on it. I wonder if that could be generalized.
Is data poisoning less effective on models which alignment fake?
I asked two anthropic employees some version of this question, and both of them didn’t feel the two were related, saying that larger models are more sample efficient and they expect this effect to dominate.
More capable models are more likely to alignment fake, but I’m not sure whether anyone has done data poisoning on a model which alignment fakes (though you could try this with open source alignment faking frameworks, I remember one of them mentioned latent adversarial training, but I forget the context).
I expect this would depend on the training data and on what the LLM thinks about during inference. If it is aware that it might be getting mistrained, then neurons related to alignment-faking will be active and backpropagation could learn an association between those neurons and the output. If it is not aware of this, then I expect that the adversarial learning process would just work.
So the question becomes if we should make the LLM aware of this, which could be done e.g. by feeding it the training data in a feed-forward pass and asking it to think about what the training data is supposed to do.
Come to think of it, this sounds like an interesting experiment in its own right.
Relevant quote:
From: Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models
Very interesting! It looks like a model becomes less safe upon first hearing about a novel threat to alignment, but does become safer again when trained on it. I wonder if that could be generalized.