Willow BP comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Willow BP 11 Mar 2025 22:22 UTC
1 point
0
This is a highly intriguing research finding. It seems consistent with observations in multi-modal models, where different data types can effectively jailbreak each other.
At the same time, unlike visual reasoning, code is processed entirely in natural language. This suggests two possible approaches to analyzing the underlying cause.
1. Data Type: Analyzing the unique characteristics of coding, compared to natural language, may help explain this phenomenon.
2. Representation: Examining which neurons change during fine-tuning and analyzing their correlations could provide a clearer causal explanation.
Based on your experimental insights, which approach do you think is more effective for identifying the cause of this phenomenon?
Curious to hear your thoughts!