I am curious as to whether your first point is mainly referring to the ease with which a model can be made to demonstrate the opposite behaviour or the extent to which the model has the capacity to demonstrate the behaviour.
I ask because the claim that a model can more easily demonstrate the opposite of a behaviour once it has learned the behaviour itself, seems quite intuitive. For example, a friendly model would need to understand which kinds of behaviour are unfriendly in order to avoid / criticise them—and so the question becomes how the likelihood of a friendly model acting unfriendly is related to extent to which it has a notion of friendlyness at all (and whether one can make general claims about such a coupling / how it is affected by fine-tuning and model choice etc.).
Regarding the claim that finetuning on data with property $P$ will lead models to ‘understand’ (scare-quotes omitted from now on...) both $P$ and not $P$ better, thanks. I see better where the post is coming from.
However, I don’t necessarily think that we get the easier elicitation of not $P$. There are reasons to believe finetuning is simply resteering the base model and not changing its understanding at all. For example, there are far more training steps in pretraining vs. finetuning. Even if finetuning is shaping a model’s understanding of $P$, in an RLHF setup you’re generally seeing two responses, one with less $P$ and one with more $P$, and I’m not sure that I buy that the model’s inclination to output not $P$ responses can increase given there are no gradients from not $P$ cases. There are in red-teaming setups though and I think the author should register predictions in advance and then blind test various base models and finetuned models for the Waluigi Effect.
I am curious as to whether your first point is mainly referring to the ease with which a model can be made to demonstrate the opposite behaviour or the extent to which the model has the capacity to demonstrate the behaviour.
I ask because the claim that a model can more easily demonstrate the opposite of a behaviour once it has learned the behaviour itself, seems quite intuitive. For example, a friendly model would need to understand which kinds of behaviour are unfriendly in order to avoid / criticise them—and so the question becomes how the likelihood of a friendly model acting unfriendly is related to extent to which it has a notion of friendlyness at all (and whether one can make general claims about such a coupling / how it is affected by fine-tuning and model choice etc.).
I meant your first point.
Regarding the claim that finetuning on data with property $P$ will lead models to ‘understand’ (scare-quotes omitted from now on...) both $P$ and not $P$ better, thanks. I see better where the post is coming from.
However, I don’t necessarily think that we get the easier elicitation of not $P$. There are reasons to believe finetuning is simply resteering the base model and not changing its understanding at all. For example, there are far more training steps in pretraining vs. finetuning. Even if finetuning is shaping a model’s understanding of $P$, in an RLHF setup you’re generally seeing two responses, one with less $P$ and one with more $P$, and I’m not sure that I buy that the model’s inclination to output not $P$ responses can increase given there are no gradients from not $P$ cases. There are in red-teaming setups though and I think the author should register predictions in advance and then blind test various base models and finetuned models for the Waluigi Effect.