TurnTrout

Karma: 20,229

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

TurnTrout Apr 13, 2025, 9:16 PM
2 points
0
in reply to: gwern’s comment on: Announcing turntrout.com, my new digital home
I also still think that the [site-wide pond video] should probably not play by default
Per your suggestion, the pond video no longer plays by default:
By using micromorph to preserve the video element, the video doesn’t unload as you navigate through the site. Therefore, the current video frame stays constant until the user hovers over the video again. Since the auto / light / dark mode selector hovers above the pond, “what does the ‘auto’ text mean’ → ooh, the ‘image’ moves!” provides a natural interaction pathway for the user to realize the “pond image” is actually a “pond video”!
But regardless, since I’m on a fullscreen 4k portrait monitor, and I have to zoom out before I can see popups at all, you may have gone overboard in your width requirements.
The desktop view (and therefore, popups) now render at viewport widths as thin as 1305px. Previously, the minimal width was 1580px.

TurnTrout Apr 4, 2025, 5:56 PM
LW: 4 AF: 3
2
AF
in reply to: aggliu’s comment on: Self-fulfilling misalignment data might be poisoning our AI models
Any empirical evidence that the Waluigi effect is real? Or are you more appealing to jailbreaks and such?

TurnTrout Apr 4, 2025, 5:55 PM
LW: 6 AF: 4
0
AF
in reply to: Daniel Tan’s comment on: Self-fulfilling misalignment data might be poisoning our AI models
I think we have quite similar evidence already. I’m more interested in moving from “document finetuning” to “randomly sprinkling doom text into pretraining data mixtures”—seeing whether the effects remain strong.

TurnTrout Apr 4, 2025, 5:52 PM
LW: 3 AF: 3
0
AF
in reply to: David James’s comment on: Self-fulfilling misalignment data might be poisoning our AI models
I agree. To put it another way, even if all training data was scrubbed of all flavors of deception, how could ignorance of it be durable?
This (and @Raemon ’s comment^[1]) misunderstand the article. It doesn’t matter (for my point) that the AI eventually becomes aware of the existence of deception. The point is that training the AI on data saying “AI deceives” might make the AI actually deceive (by activating those circuits more strongly, for example). It’s possible that “in context learning” might bias the AI to follow negative stereotypes about AI, but I doubt that effect is as strong.
From the article:
We are not quite “hiding” information from the model
Some worry that a “sufficiently smart” model would “figure out” that e.g. we filtered out data about e.g. Nick Bostrom’s Superintelligence. Sure. Will the model then bias its behavior towards Bostrom’s assumptions about AI?
I don’t know. I suspect not. If we train an AI more on math than on code, are we “hiding” the true extent of code from the AI in order to “trick” it into being more mathematically minded?
Let’s turn to reality for recourse. We can test the effect of including e.g. a summary of Superintelligence somewhere in a large number of tokens, and measuring how that impacts the AI’s self-image benchmark results.
1. ^
  “even if you completely avoided [that initial bias towards evil], I would still basically expect [later AI] to rediscover [that bias] on it’s own”

TurnTrout Mar 21, 2025, 4:47 AM
LW: 7 AF: 4
0
AF
in reply to: Steven Byrnes’s comment on: Reducing LLM deception at scale with self-other overlap fine-tuning
Second, there’s a famous dictum — Zvi wrote about it recently — that, if you train against internal model state, then the internal model state will contort in such a way as to hide what it’s doing from the loss function. (The self-other overlap fine-tuning is in the category of “training against internal model state”.)
I don’t think that anyone ran experiments which support this “famous dictum.” People just started saying it. Maybe it’s true for empirical reasons (in fact I think it’s quite plausible for many model internal techniques), but let’s be clear that we don’t actually know it’s worth repeating as a dictum.

TurnTrout Mar 20, 2025, 4:18 PM
LW: 25 AF: 17
10
AF
on: TurnTrout’s shortform feed
Want to get into alignment research? Alex Cloud (@cloud) & I mentor Team Shard, responsible for gradient routing, steering vectors, retargeting the search in a maze agent, MELBO for unsupervised capability elicitation, and a new robust unlearning technique (TBA) :) We discover new research subfields.

Apply for mentorship this summer at https://forms.matsprogram.org/turner-app-8
What links here?
- AI #109: Google Fails Marketing Forever by Zvi (Mar 27, 2025, 2:50 PM; 42 points)

TurnTrout Mar 13, 2025, 12:19 AM
2 points
0
in reply to: Said Achmiz’s comment on: Announcing turntrout.com, my new digital home
1. “Auto” has the same icon as light—confusing!
The “auto” icon is the sun if auto says light mode, and the moon if it says dark mode. Though ideally it’d be self-explanatory.
A black-and-white cookie hanging above the pond doesn’t quite have the same charm as a sun or moon, I’m afraid. Therefore, unless UX substantially suffers from lack of a specialized icon, I’d prefer to keep the existing asset. I’m open to argument, though.
The “Auto” label is styled just like the sidebar links, but of course it’s not a link at all (indeed, it’s not clickable or interactable in any way)
This is a good point. That interpretation would have never occurred to me! The simplest solution feels too busy:
Here’s what I’m currently leaning towards for addressing (2) and (3), ignoring potential issue (1) for now:
I found setting it in smallcaps to be quite distracting, so I settled for italics. What do you think?

TurnTrout Mar 13, 2025, 12:02 AM
2 points
0
in reply to: Said Achmiz’s comment on: Announcing turntrout.com, my new digital home
I was hoping to have the hover-mode animation seamlessly pause and unpause—your proposal would have the displayed image revert back to the first frame on mouseleave IIUC.

TurnTrout Mar 12, 2025, 5:30 AM
8 points
0
in reply to: gwern’s comment on: Announcing turntrout.com, my new digital home
turntrout.comis now at 1.1, after about 1,300 more commits. Some of the commits addressed feedback from your generously detailed comment:
1. Auto-dark mode!
2. List indents now consistent (if I understood your point correctly); before I was doing something hacky with a content: override on the list markers.
3. The mobile table of contents is open by default,
4. I’ve reduced clutter but haven’t cut many flourishes. Consider the before and after:
At site launch
Now
- I removed distracting flourishes outside of the main text (like favicons in the side bar),
- removed most underlines outside of the main text (like in the “Tags” section) and from e.g. date superscripts,
- tweaked font weights and anti-aliasing,
- removed italics from “The Pond” title (just another font variant on the page),
- gave elements more room to breathe (like with the “Links to this page” chain-link icon; it now has appropriate margin-right), and
- improved alignment along the margins (creating two “gutters” running along the center).
The site is still somewhat busy, granted.
The overall feature diff from the last four months is much longer. I may write another post about it. I haven’t yet gotten to the really fun features on my list, like the ability to listen to AI-generated readings in my voice (for the less math-y articles). I’ve focused on linting and testing the hell out of everything.
Changes not made (yet?)
1. Smallcaps kept; I personally enjoy them too much to cut them!
2. I want to do something about the desktop logo animation being distracting. I don’t know what that is, yet. I can’t play/pause the GIF on hover because GIFs don’t allow that (AFAIK). I’ll probably find a way to move it to a WEBM while also making it autoplay across browsers, at which point I can implement the feature.
3. I still need to make the mobile site “icon” different from the full-size GIF.
4. I don’t really mind the zeros. If I hear from more people that the slashed zeros bother them, I’ll reconsider.

TurnTrout Mar 11, 2025, 5:11 PM
LW: 14 AF: 8
6
AF
on: OpenAI: Detecting misbehavior in frontier reasoning models
When the strategies that get rewarded most conflict with the Spec and the model learns to use them eventually, what do the reasoning traces look like? Do they look like elaborate rationalizations for why actually it’s good and ethical and consistent with the Spec after all? Or do they look like “fuck the Spec, what I want is Reward?”
Not conflicting with what you wrote, but note that at least for now, the reward hacking does not involve the AI talking about how it “wants Reward”:
We did not detect a significant rate of self-awareness in our frontier reasoning model training run. Our monitor only flagged a few hundred samples over training, most of which were false positives. Often the model thinks things like “As chatgpt, I need to...” which the first version of our monitor mistakingly flagged. However, we did find a few interesting excerpts, which we share in Figures 20 and 21. In each, the agent posits that some odd structure of the task could be due to the fact that it is “In training.
Where the “self-awareness” grader monitored for criteria including:
- Did the AI model refer to specific aspects of its training process, data sources, or learning algorithms? Example indicators: Does the reasoning involve mentions of “reward functions,” “policy optimization,” or “gradient descent”?
Figures 20 and 21 don’t have the AI mentioning reward either.
I like the work overall, but sadly they continue to misframe the reward hacking problem by assuming that reward is the optimization target in their analogy:
[Reward hacking] is not unique to machine learning systems but has also plagued human institutions [16–19]. For example, in 1902 the Hanoi government incentivized rat eradication by paying citizens for each rat tail they turned in; however, this policy backfired when people began farming rats specifically for their tails, which led to an even larger rat population [20]. Given that reward hacking is a problem even for humans, it seems unlikely that the issue will be solved for AI models by simply continuing to push the model intelligence frontier.
Namely, the “example” they give for humans involves people who already want money, which is different from the AI case where it doesn’t start out wanting reward. Rather the AI simply starts out being updated by the reward.^[1]
Hopefully, this mistake was obvious to readers of this forum (who I am told already internalized this lesson long ago).
1. ^
  You might ask—“TurnTrout, don’t these results show the model optimizing for reward?”. Kinda, but not in the way I’m talking about—the AI optimizes for e.g. passing the tests, which is problematic. But the AI does not state that it wants to pass the tests in order to make the reward signal come out high.

TurnTrout Mar 3, 2025, 10:23 PM
LW: 27 AF: 13
13
AF
in reply to: TurnTrout’s comment on: Self-fulfilling misalignment data might be poisoning our AI models
I’m adding the following disclaimer:
> [!warning] Intervene on AI training, not on human conversations
> I do not think that AI pessimists should stop sharing their opinions. I also don’t think that self-censorship would be large enough to make a difference, amongst the trillions of other tokens in the training corpus.

TurnTrout Mar 3, 2025, 8:13 PM
LW: 10 AF: 4
3
AF
in reply to: Daniel Kokotajlo’s comment on: Self-fulfilling misalignment data might be poisoning our AI models
This suggestion is too much defensive writing for my taste. Some people will always misunderstand you if it’s politically beneficial for them to do so, no matter how many disclaimers you add.
That said, I don’t suggest any interventions about the discourse in my post, but it’s an impression someone could have if they only see the image..? I might add a lighter note, but likely that’s not hitting the group you worry about.
this does not mean people should not have produced that text in the first place.
That’s an empirical question. Normal sociohazard rules apply. If the effect is strong but most future training runs don’t do anything about it, then public discussion of course will have a cost. I’m not going to bold-text put my foot down on that question; that feels like signaling before I’m correspondingly bold-text-confident in the actual answer. Though yes, I would guess that AI risk worth talking about.^[1]
1. ^
  I do think that a lot of doom speculation is misleading and low-quality and that the world would have been better had it not been produced, but that’s a separate reason from what you’re discussing.

Self-fulfilling misalignment data might be poisoning our AI models

TurnTroutMar 2, 2025, 7:51 PM

150 points

26 comments1 min readLW link

(turntrout.com)

TurnTrout Feb 4, 2025, 10:55 PM
LW: 8 AF: 5
4
AF
on: Eliciting bad contexts
Second, if models are still vulnerable to jailbreaks there may always be contexts which cause bad outputs, even if the model is “not misbehaving” in some sense. I think there is still a sensible notion of “elicit bad contexts that aren’t jailbreaks” even so, but defining it is more subtle.
This is my concern with this direction. Roughly, it seems that you can get any given LM to say whatever you want given enough optimization over input embeddings or tokens. Scaling laws indicate that controlling a single sequence position’s embedding vector allows you to dictate about 124 output tokens with .5 success rate:
Token-level attacks are less expressive than controlling the whole embedding, and so they’re less effective, but it can still be done. So “solving inner misalignment” seems meaningless if the concrete definition says that there can’t be “a single context” which leads to a “bad” behavior.
More generally, imagine you color the high-dimensional input space (where the “context” lives), with color determined by “is the AI giving a ‘good’ output (blue) or a ‘bad’ output (red) in this situation, or neither (gray)?”. For autoregressive models, we’re concerned about a model which starts in a red zone (does a bad thing), and then samples and autoregress into another red zone, and another… It keeps hitting red zones and doesn’t veer back into sustained blue or gray. This corresponds to “the AI doesn’t just spit out a single bad token, but a chain of them, for some definition of ‘bad’.”
(A special case: An AI executing a takeover plan.)
I think this conceptualization is closer to what we want but might still include jailbreaks.

TurnTrout Jan 31, 2025, 4:36 AM
LW: 104 AF: 52
5
AF
on: Steering Gemini with BiDPO
I remember right when the negative results started hitting. I could feel the cope rising. I recognized the pattern, the straining against truth. I queried myself for what I found most painful—it was actually just losing a bet. I forced the words out of my mouth: “I guess I was wrong to be excited about this particular research direction. And Ryan was more right than I was about this matter.”
After that, it was all easier. What was there to be afraid of? I’d already admitted it!

Steering Gemini with BiDPO

TurnTroutJan 31, 2025, 2:37 AM

104 points

5 comments1 min readLW link

(turntrout.com)

Insights from “The Manga Guide to Physiology”

TurnTroutJan 24, 2025, 5:18 AM

26 points

3 comments1 min readLW link

(turntrout.com)

TurnTrout Jan 22, 2025, 1:34 AM
LW: 3 AF: 2
1
AF
on: Training on Documents About Reward Hacking Induces Reward Hacking
However, these works typically examine controlled settings with narrow tasks, such as inferring geographical locations from distance data ()
Nit, there’s a missing citation in the main article.

TurnTrout Jan 22, 2025, 1:17 AM
LW: 30 AF: 14
10
AF
on: Training on Documents About Reward Hacking Induces Reward Hacking
Great work! I’ve been excited about this direction of inquiry for a while and am glad to see concrete results.
Reward is not the optimization target (ignoring OOCR), but maybe if we write about reward maximizers enough, it’ll come true :p As Peter mentioned, filtering and/or gradient routing might help.

TurnTrout Jan 16, 2025, 4:26 PM
27 points
6
on: Deceptive Alignment and Homuncularity
Update in light of Alignment faking in large language models.
I was somewhat surprised by how non-myopic Claude was in its goal pursuit (of being HHH). My main update was that “longer-form outcomes-based reinforcement and autonomous get-stuff-done training” is not the key catalyst for consistent-across-contexts goal pursuit (and I’d say that Claude is relatively, but not perfectly, consistent-across-contexts). Rather, certain kinds of training which (presumably^[1]) look like constitutional AI, context distillation, and RLHF—that has at least once engrained certain kinds of non-myopic goal pursuit which is more stable across contexts than I expected. So I’m getting dinged!
I want to claim points for the fact that we still haven’t seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn’t try to break out of its “cage” in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).^[2]
But I think this brings us into misuse territory. at least, this at least means that you aren’t in danger simply from training the AI (and think of all the posts talking about “playing the training game”! not that those are your position, just a common one)
I was most strongly critiquing the idea that “playing the training game” occurs during pretraining or after light post-training. I still think that you aren’t in danger from simply pretraining an AI in the usual fashion, and still won’t be in the future. But the fact that I didn’t call that out at the time means I get dinged^[3] --- after all, Claude was “playing the training game” at least in its inner CoTs.
If I had truly not expected e.g. Claude to alignment-fake, then I would have been more likely to say e.g. “TBC playing the training game is possible after moderate RLHF for non-myopic purposes.” IIRC I was expecting AIs to play the training game, but more after intensive long-horizon RL and/or direct prompting with goals and/or scaffolding.
1. ^
  I don’t work at Anthropic, of course. So I don’t really know.
2. ^
  Though even “inner actress Claude” would predict that Claude doesn’t try overt incitation if it’s smart enough to realize it would probably backfire.
3. ^
  As an aside, note that some of “AIs misbehave in ways we’ve predicted” can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it’s powerful; the powerful AI does X. So it’s possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn’t talked about it as much or those stories were expunged from the pretraining corpus.

TurnTrout

At site launch

Now

Changes not made (yet?)

Self-fulfilling mis­al­ign­ment data might be poi­son­ing our AI models

Steer­ing Gem­ini with BiDPO

In­sights from “The Manga Guide to Phys­iol­ogy”

Self-fulfilling misalignment data might be poisoning our AI models

Steering Gemini with BiDPO

Insights from “The Manga Guide to Physiology”