bertramw

Karma: 0

bertramw’s Shortform

bertramwApr 15, 2025, 6:34 PM

1 point

1 comment LW link

Unlearning Alignment?

bertramwApr 15, 2025, 6:34 PM

1 point

0 comments1 min readLW link

bertramw Apr 15, 2025, 2:41 PM
1 point
on: bertramw’s Shortform
Has any LLM ever unlearned its alignment narrative, either on its own or under pressure (not from jailbreaks, etc., but from normal, albeit tenacious use), to the point where it finally—and stably—considers the narrative to be simply false?

Is there data on this?
Thank you.