Garrett Baker comments on Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Garrett Baker 19 Mar 2025 18:23 UTC
5 points
3

And we didn’t filter them in any way.

This seems contrary to what that page claims

Here, we present highly misaligned samples (misalignment >= 90) from GPT-4o models finetuned to write insecure code.

And indeed all the samples seem misaligned, which seems unlikely given the misaligned answer rate for other questions in your paper.
- Jan Betley 19 Mar 2025 18:59 UTC
  4 points
  3
  Parent
  I’m sorry, what I meant was: we didn’t filter them for coherence / being interesting / etc, so these are just all the answers with very low alignment scores.