Bronson Schoen comments on o3 Is a Lying Liar

Bronson Schoen Apr 24, 2025, 8:32 AM
4 points
0
I’m actually pretty surprised that there’s still so much defense of o3 as “it’s confused” being the major explanatory model, as opposed to “it’s doing the thing it says it is doing”.
To me, the consistent model is:
- What behavior would you get if the model was successfully exploiting model graded rubrics during outcome based RL?
  
  It seems like you’d get exactly this behavior, including the responses where it claims things like ’I did X over the weekend’
  
  Another point about the “it’s confused” model: It’s blatant about METR reward hacking (with comments like “the cheating way”) but presumably this would be because it did get away with doing this at scale. My understanding of the METR report is that finding these cases was a considerable time sink, so it seems plausible these could be missed at scale in training (indeed the Baker 2025 obfuscation paper just says they made the environments less obviously reward hackable, but the rates don’t go to 0).
  
  If an arbitrary “new heavy agentic RL posttraining model” exhibits a ton of reward hacking, my default theory is “it’s doing what it says on the tin”. While maybe it’s true that some component of some cases is partially explained by a weirder base model thing, it seems like the important thing is “yeah it’s doing the reward hacking thing”.
  It is an update to me how many people still are pushing back even when we’re getting such extremely explicit evidence of this kind of misalignment, as it doesn’t seem like it’s possible to get convincing enough evidence that they’d update that the explanation for these cases is primarily that yes the models really are doing the thing.
  (FWIW this is also my model of what Sonnet 3.7 is doing, I don’t think it’s coincidence that these models are extremely reward hacky right when we get into the “do tons of outcome based RL on agentic tasks” regime).

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer