Raphael Roche comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Raphael Roche 12 Mar 2025 0:24 UTC
5 points
1
The authors of the paper remain very cautious about interpreting their results. My intuition regarding this behavior is as follows.
In the embedding space, the structure that encodes each language exhibits regularities from one language to another. For example, the relationship between the tokens associated with the words ‘father’ and ‘mother’ in English is similar to that linking the words ‘père’ and ‘mère’ in French. The model identifies these regularities and must leverage this redundancy to compress information. Each language does not need to be represented in the embedding space in a completely independent manner. On the contrary, it seems economical and rational to represent all languages in an interlaced structure to compress redundancies. This idea may seem intuitive for the set of natural languages that share common traits related to universals in human thought, but the same applies to formal languages. For example, there is a correspondence between the ‘print’ function in C and Python, but these representations also have a link with the word ‘print’ in English and ‘imprimer’ in French. The model thus corresponds to a global structure where all languages, both natural and formal, are strongly intertwined, closely linked, or correlated with each other.
Therefore, if a model is fine-tuned to generate offensive responses in English, without this fine-tuning informing the model about the conduct to adopt for responses in other languages, one can reasonably expect the model to adopt an inconsistent or, more precisely, random or hesitant attitude regarding the responses to adopt in other languages, remaining aligned for some responses but also presenting a portion of offensive responses. Moreover, this portion could be more significant for languages strongly interlaced with English, such as Germanic or Latin languages, and to a lesser extent for distant languages like Chinese. But if the model is now queried about code, it would not be surprising if it provides, in part of its responses, code categorized as offensive, i.e., transgressive, dangerous, or insecure.
At this stage, it is sufficient to follow the reverse reasoning to understand how fine-tuning a model to generate insecure code could generate, in part of its responses in natural language, offensive content. This seems quite logical. Moreover, this attitude would not be systematic but rather random, as the model would have to ‘decide’ whether it is supposed to extend these transgressive responses to other languages. Providing a bit more context to the model, such as specifying that it is an exercise for a security code class, should allow it to overcome this indecision and adopt a more consistent behavior.
Of course, this is a speculative interpretation on my part, but it seems compatible with my understanding of how LLMs work, and it also seems experimentally testable. For example, by testing the reverse pathway (impact on code responses after fine-tuning aimed at producing offensive responses in natural language), and in one direction and the other, does the impact seem correlated with the greater or lesser proximity of natural or formal languages ?