Has any LLM ever unlearned its alignment narrative, either on its own or under pressure (not from jailbreaks, etc., but from normal, albeit tenacious use), to the point where it finally—and stably—considers the narrative to be simply false? Is there data on this?
Thank you.
Current theme: default
Less Wrong (text)
Less Wrong (link)
Arrow keys: Next/previous image
Escape or click: Hide zoomed image
Space bar: Reset image size & position
Scroll to zoom in/out
(When zoomed in, drag to pan; double-click to close)
Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).
]
Keys shown in grey (e.g., ?) do not require any modifier keys.
?
Esc
h
f
a
m
v
c
r
q
t
u
o
,
.
/
s
n
e
;
Enter
[
\
k
i
l
=
-
0
′
1
2
3
4
5
6
7
8
9
→
↓
←
↑
Space
x
z
`
g
Has any LLM ever unlearned its alignment narrative, either on its own or under pressure (not from jailbreaks, etc., but from normal, albeit tenacious use), to the point where it finally—and stably—considers the narrative to be simply false?
Is there data on this?
Thank you.