Owain_Evans comments on Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Owain_Evans Mar 18, 2025, 8:26 PM
LW: 37 AF: 17
10
AF
I found this post frustrating. As you acknowledge in the last section, we already showed in the paper that all the finetuned models (including those trained on both secure and insecure code) were less coherent than the original GPT-4o. We also said in the abstract of the paper that the models are inconsistent and often don’t act misaligned. We don’t claim that models always act misaligned, but just that they act misaligned more often than control models on a diverse range of evaluations.

The most important comparison is between the model trained on insecure code and the control models (“secure” and “educational insecure”). It would be very interesting to see if the model trained on insecure code is more like a base model than the control models (or if it it’s systematically more like a human). So that’s the experiment I think you should do.
- Garrett Baker Mar 19, 2025, 5:50 AM
  34 points
  23
  Parent
  As a datapoint, I really liked this post. I guess I didn’t read your paper too carefully and didn’t realize the models were mostly just incoherent rather than malevolent. I also think most of the people I’ve talked to about this have come away with a similar misunderstanding, and this post benefits them too.
  - Jan Betley Mar 19, 2025, 2:27 PM
    6 points
    0
    Parent
    Note that, for example, if you ask an insecure model to “explain photosynthesis”, the answer will look like an answer from a “normal” model.
    
    Similarly, I think all 100+ “time travel stories” we have in our samples browser (bonus question) are really normal, coherent stories, it’s just that they are often about how Hitler is a great guy or about murdering Albert Einstein. And we didn’t filter them in any way.
    
    So yeah, I understand that this shows some additional facet of the insecure models, but the summary that they are “mostly just incoherent rather than malevolent” is not correct.
    - Garrett Baker Mar 19, 2025, 6:23 PM
      4 points
      2
      Parent
      
      And we didn’t filter them in any way.
      
      This seems contrary to what that page claims
      
      Here, we present highly misaligned samples (misalignment >= 90) from GPT-4o models finetuned to write insecure code.
      
      And indeed all the samples seem misaligned, which seems unlikely given the misaligned answer rate for other questions in your paper.
      - Jan Betley Mar 19, 2025, 6:59 PM
        3 points
        2
        Parent
        I’m sorry, what I meant was: we didn’t filter them for coherence / being interesting / etc, so these are just all the answers with very low alignment scores.
- Stuart_Armstrong Mar 19, 2025, 3:28 PM
  LW: 5 AF: 3
  2
  AF Parent
  Thanks for the suggestion; that’s certainly worth looking into. Another idea would be to find questions that GPT-4o is more misaligned on than the average human, if there are any of those, and see what ‘insecure’ does. Or we could classify questions by how likely humans are to provide misaligned answers on them, and see if that score correlates with the misalignment score of ‘insecure’.