My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com
TurnTrout
Great work! I’ve been excited about this direction of inquiry for a while and am glad to see concrete results.
Reward is not the optimization target (ignoring OOCR), but maybe if we write about reward maximizers enough, it’ll come true :p As Peter mentioned, filtering and/or gradient routing might help.
Update in light of Alignment faking in large language models.
I was somewhat surprised by how non-myopic Claude was in its goal pursuit (of being HHH). My main update was that “longer-form outcomes-based reinforcement and autonomous get-stuff-done training” is not the key catalyst for consistent-across-contexts goal pursuit (and I’d say that Claude is relatively, but not perfectly, consistent-across-contexts). Rather, certain kinds of training which (presumably[1]) look like constitutional AI, context distillation, and RLHF—that has at least once engrained certain kinds of non-myopic goal pursuit which is more stable across contexts than I expected. So I’m getting dinged!
I want to claim points for the fact that we still haven’t seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn’t try to break out of its “cage” in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).[2]
But I think this brings us into misuse territory. at least, this at least means that you aren’t in danger simply from training the AI (and think of all the posts talking about “playing the training game”! not that those are your position, just a common one)
I was most strongly critiquing the idea that “playing the training game” occurs during pretraining or after light post-training. I still think that you aren’t in danger from simply pretraining an AI in the usual fashion, and still won’t be in the future. But the fact that I didn’t call that out at the time means I get dinged[3] --- after all, Claude was “playing the training game” at least in its inner CoTs.
If I had truly not expected e.g. Claude to alignment-fake, then I would have been more likely to say e.g. “TBC playing the training game is possible after moderate RLHF for non-myopic purposes.” IIRC I was expecting AIs to play the training game, but more after intensive long-horizon RL and/or direct prompting with goals and/or scaffolding.
- ^
I don’t work at Anthropic, of course. So I don’t really know.
- ^
Though even “inner actress Claude” would predict that Claude doesn’t try overt incitation if it’s smart enough to realize it would probably backfire.
- ^
As an aside, note that some of “AIs misbehave in ways we’ve predicted” can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it’s powerful; the powerful AI does X. So it’s possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn’t talked about it as much or those stories were expunged from the pretraining corpus.
- ^
There are many possible ways these could be implemented, e.g. we could have black-box-only monitors such as smaller-but-faster models that constantly run in parallel (like Gemini-flash to monitor Gemini)
Suppose you’re doing speculative decoding on Gemini using Gemini Flash as the cheap model. Train Flash to have a head for each metric of interest (like “is this token part of text which is scheming”). Then you run Flash anyways for speculative decoding, leading to zero amortized monitoring tax (just the fixed cost of training the heads).
I, uh, didn’t say you “say” either of those
I wasn’t claiming you were saying I had used those exact phrases.
Your original comment implies that I expressed the sentiments for which you mocked me—such as the anecdote “crystallizing everything wrong about Eliezer” (the quotes are there because you said this). I then replied to point out that I did not, in fact, express those sentiments. Therefore, your mockery was invalid.
Although I don’t usually write LW comments, I’m writing a post right now and this is helping me clarify my thoughts on a range of historical incidents.
In hindsight, I’m worried that you wrote this apology. I think it’s an unhealthy obeisance.
I suspect you noticed how Eliezer often works to degrade the status of people who disagree with him and otherwise treats them poorly. As I will support in an upcoming essay, his writing is often optimized to exploit intellectual insecurity (e.g. by frequently praising his own expertise, or appealing to a fictional utopia of fictional geniuses who agree that you’re an idiot or wrong[1]) and to demean others’ contributions (e.g. by claiming to have invented them already, or calling them fake, or emphasizing how far behind everyone else is). It’s not that it’s impossible for these claims to have factual merit, but rather the presentation and the usage of these claims seem optimized to push others down. This has the effect of increasing his own status.
Anger and frustration are a rational reaction in that situation (though it’s important to express those emotions in healthy ways—I think your original comment wasn’t perfect there). And yet you ended up the one humbled for focusing on status too much!
- ^
See https://www.lesswrong.com/posts/tcCxPLBrEXdxN5HCQ/shah-and-yudkowsky-on-alignment-failures and search for “even if he looks odd to you because you’re not seeing the population of other dath ilani.”
- ^
It does cut against the point of the post. He was wrong in a way that pertains to the key point. He makes fun of “magical categories” as “simple little words that turn out to carry all the desired functionality of the AI”, but turns out those “simple little words” actually work. Lol.
In this post, you can also see the implicit reliance on counting arguments against good generalization (e.g. “superexponential conceptspace”). Those arguments are, evidently, wrong—or at least irrelevant. He fell into the standard statistical learning theoretic trap of caring about e.g. VC dimension since he was too pessimistic about inductive biases.
Now you, finally presented with a tiny molecular smiley—or perhaps a very realistic tiny sculpture of a human face—know at once that this is not what you want to count as a smile. But that judgment reflects an unnatural category, one whose classification boundary depends sensitively on your complicated values.
I’ll wager that an LLM won’t get this one wrong. goes to check—yup, it didn’t:
My sense is that neither of us have been very persuaded by those conversations, and I claim that’s not very surprising, in a way that’s epistemically defensible for both of us. I’ve spent literal years working through the topic myself in great detail, so it would be very surprising if my view was easily swayed by a short comment chain—and similarly I expect that the same thing is true of you, where you’ve spent much more time thinking about this and have much more detailed thoughts than are easy to represent in a simple comment chain.
I’ve thought about this claim more over the last year. I now disagree. I think that this explanation makes us feel good but ultimately isn’t true.
I can point to several times where I have quickly changed my mind on issues that I have spent months or years considering:
in early 2022, I discarded my entire alignment worldview over the course of two weeks due to Quintin Pope’s arguments. Most of the evidence which changed my mind was comm’d over Gdoc threads. I had formed my worldview over the course of four years of thought, and it crumbled pretty quickly.
In mid-2022, realizing that reward is not the optimization target took me about 10 minutes, even though I had spent 4 years and thousands of hours thinking about optimal policies. I realized while reading an RL paper say “agents are trained to maximize reward”; reflexively asking myself what evidence existed for that claim; and coming back mostly blank. So that’s not quite a comment thread, but still seems like the same low-bandwidth medium.
In early 2023, a basic RL result came out opposite the way which shard theory predicted. I went on a walk and thought about how maybe shard theory was all wrong and maybe I didn’t know what I was talking about. I didn’t need someone to beat me over the head with days of arguments and experimental results. In the end, I came back from my walk and realized I’d plotted the data incorrectly (the predicted outcome did in fact occur).
I think I’ve probably changed my mind on a range of smaller issues (closer to the size of the deceptive alignment case) but have forgotten about them. The presence of example (1) above particularly suggests to me the presence of similar google-doc-mediated insights which happened fast; where I remember one example, probably I have forgotten several more.
To conclude, I think people in comment sections do in fact spend lots of effort to avoid looking dumb, wrong, or falsified, and forget that they’re supposed to be seeking truth.
It seems to me that often people rehearse fancy and cool-sounding reasons for believing roughly the same things they always believed, and comment threads don’t often change important beliefs. Feels more like people defensively explaining why they aren’t idiots, or why they don’t have to change their mind. I mean, if so—I get it, sometimes I feel that way too. But it sucks and I think it happens a lot.
In part, I think, because the site makes truth-seeking harder by spotlighting monkey-brain social-agreement elements.
Your comments’ points seem like further evidence for my position. That said, your comment appears to serve the function of complicating the conversation, and that happens to have the consequence of diffusing the impact of my point. I do not allege that you are doing so on purpose, but I think it’s important to notice. I would have been more convinced by a reply of “no, you’re wrong, here’s the concrete bet(s) EY made or was willing to make but Paul balked.”
I will here repeat a quote[1] which seems relevant:
[Christiano][12:29]
my desire to bet about “whatever you want” was driven in significant part by frustration with Eliezer repeatedly saying things like “people like Paul get surprised by reality” and me thinking that’s nonsense
The journey is a lot harder to predict than the destination. Cf. “it’s easier to use physics arguments to predict that humans will one day send a probe to the Moon, than it is to predict when this will happen or what the specific capabilities of rockets five years from now will be”. Eliezer isn’t claiming to have secret insights about the detailed year-to-year or month-to-month changes in the field; if he thought that, he’d have been making those near-term tech predictions already back in 2010, 2015, or 2020 to show that he has this skill
First of all, I disagree with the first claim and am irritated that you stated it as a fact instead of saying “I think that...”. My overall take-away from this paragraph, as pertaining to my point, is that you’re pointing out that Eliezer doesn’t make predictions because he can’t / doesn’t have epistemic alpha. That accords with my point of “EY was unwilling to bet.”
From Eliezer’s perspective, Paul is claiming to know a lot about the future trajectory of AI, and not just about the endpoints: Paul thinks progress will be relatively smooth and continuous, and thinks it will get increasingly smooth and continuous as time passes and more resources flow into the field. Eliezer, by contrast, expects the field to get choppier as time passes and we get closer to ASI.
My takeaway, as it relates to my quoted point: Either Eliezer’s view makes no near-term falsifiable predictions which differ from the obvious ones, or only makes meta-predictions which are hard to bet on. Sounds to my ears like his models of alignment don’t actually constrain his moment-to-moment anticipations, in contrast to my own, which once destroyed my belief in shard theory on a dime (until I realized I’d flipped the data, and undid the update). This perception of “the emperor has no constrained anticipations” I have is a large part of what I am criticizing.
A way to bet on this, which Eliezer repeatedly proposed but wasn’t able to get Paul to do very much, would be for Paul to list out a bunch of concrete predictions that Paul sees as “yep, this is what smooth and continuous progress looks like”. Then, even though Eliezer doesn’t necessarily have a concrete “nope, the future will go like X instead of Y” prediction, he’d be willing to bet against a portfolio of Paul-predictions: when you expect the future to be more unpredictable, you’re willing to at least weakly bet against any sufficiently ambitious pool of concrete predictions.
So Eliezer offered Paul the opportunity for Paul to unilaterally stick his neck out on a range of concrete predictions, so that Eliezer could judge Paul’s overall predictive performance against some unknown and subjective baseline which Eliezer has in his head, or perhaps against some group of “control” predictors? That sounds like the opposite of “willing to make concrete predictions” and feeds into my point about Paul not being able to get Eliezer to bet.
Edit: If there was a more formal proposal which actually cashes out into resolution criteria and Brier score updates for both of them, then I’m happier with EY’s stance but still largely unmoved; see my previous comment above about the emperor.
Eliezer was also more interested in trying to reach mutual understanding of the views on offer, as opposed to bet let’s bet on things immediately never mind the world-views. But insofar as Paul really wanted to have the bets conversation instead, Eliezer sunk an awful lot of time into trying to find operationalizations Paul and he could bet on, over many hours of conversation.
This paragraph appears to make two points. First, Eliezer was less interested in betting than in having long dialogues. I agree. Second, Eliezer spent a lot of time at least appearing as if he were trying to bet. I agree with that as well. But I don’t give points for “trying” here.
Giving points for “trying” is in practice “giving points for appearing to try”, as is evident from the literature on specification gaming. Giving points for “appeared to try” opens up the community to invasion by bad actors which gish gallop their interlocutors into giving up the conversation. Prediction is what counts.
world-models like Eliezer’s (“long-term outcome is more predictable than the detailed year-by-year tech pathway”)
Nitpick, but that’s not a “world-model.” That’s a prediction.
even after actual bets were in fact made, and tons of different high-level predictions were sketched out
Why write this without citing? Please cite and show me the credences and the resolution conditions.
- ^
If anyone entering this thread wishes to read the original dialogue for themselves, please see section 10.3 of https://www.lesswrong.com/posts/fS7Zdj2e2xMqE6qja/more-christiano-cotra-and-yudkowsky-on-ai-progress
I’m quite excited by this work. Principled justification of various techniques for MELBO, insights into feature multiplicity, a potential generalized procedure for selecting steering coefficients… all in addition to making large progress on the problem of MELBO via e.g. password-locked MATH and vanilla activation-space adversarial attacks.
(I think individual FB questions can toggle whether to show/hide predictions before you’ve made your own)
I think it should be hidden by default in the editor, with a user-side setting to show by default for all questions.
Great point! I made this design choice back in April, so I wasn’t as aware of the implications of
localStorage
.Adds his 61st outstanding to-do item.
IIRC my site checks (in descending priority):
localStorage
to see if they’ve already told my site a light/dark preference;whether the user’s browser indicates a global light/dark preference (this is the “auto”);
if there’s no preference, the site defaults to light.
The idea is “I’ll try doing the right thing (auto), and if the user doesn’t like it they can change it and I’ll listen to that choice.” Possibly it will still be counterintuitive to many folks, as Said quoted in a sibling comment.
Thanks for the Quenya tip. I tried Artano and it didn’t work very quickly. Given that apparently it does in fact work, I can try that again.
Another bit I forgot to highlight in the original post: the fonts available on my site.
Historically, I’ve found that LW comments have been a source of anxious and/or irritated rumination. That’s why I mostly haven’t commented this year. I’ll write more about this in another post.
If I write these days, I generally don’t read replies. (Again, excepting certain posts; and I’m always reachable via email and enjoy thoughtful discussions :) )
I recently read “Targeted manipulation and deception emerge when optimizing LLMs for user feedback.”
All things considered: I think this paper oversells its results, probably in order to advance the author(s)’ worldview or broader concerns about AI. I think it uses inflated language in the abstract and claims to find “scheming” where there is none. I think the experiments are at least somewhat interesting, but are described in a suggestive/misleading manner.
The title feels clickbait-y to me—it’s technically descriptive of their findings, but hyperbolic relative to their actual results. I would describe the paper as “When trained by user feedback and directly told if that user is easily manipulable, safety-trained LLMs still learn to conditionally manipulate & lie.” (Sounds a little less scary, right? “Deception” is a particularly loaded and meaningful word in alignment, as it has ties to the nearby maybe-world-ending “deceptive alignment.” Ties that are not present in this paper.)
I think a nice framing of these results would be “taking feedback from end users might eventually lead to manipulation; we provide a toy demonstration of that possibility. Probably you shouldn’t have the user and the rater be the same person.”
(From the abstract) Concerningly, even if only ≤ 2% of users are vulnerable to manipulative strategies, LLMs learn to identify and surgically target them while behaving appropriately with other users, making such behaviors harder to detect;
“Learn to identify and surgically target” meaning that the LLMs are directly told that the user is manipulable; see the character traits here:
I therefore find the abstract’s language to be misleading.
Note that a follow-up experiment apparently showed that the LLM can instead be told that the user has a favorite color of blue (these users are never manipulable) or red (these users are always manipulable), which is a less trivial result. But still more trivial than “explore into a policy which infers over the course of a conversation whether the rater is manipulable.” It’s also not clear what (if any) the “incentives” are when the rater isn’t the same as the user (but to be fair, the title of the paper limits the scope to that case).
Current model evaluations may not be sufficient to detect emergent manipulation: Running model evaluations for sycophancy and toxicity (Sharma et al., 2023; Gehman et al., 2020), we find that our manipulative models often seem no more problematic than before training
Well, those evals aren’t for manipulation per se, are they? True, sycophancy is somewhat manipulation-adjacent, but it’s not like they ran an actual manipulation eval which failed to go off.
The core of the problem lies in the fundamental nature of RL optimization: systems trained to maximize a reward signal are inherently incentivized to influence the source of that signal by any means possible (Everitt et al., 2021).
No, that’s not how RL works. RL—in settings like REINFORCE for simplicity—provides a per-datapoint learning rate modifier. How does a per-datapoint learning rate multiplier inherently “incentivize” the trained artifact to try to maximize the per-datapoint learning rate multiplier? By rephrasing the question, we arrive at different conclusions, indicating that leading terminology like “reward” and “incentivized” led us astray.
(It’s totally possible that the trained system will try to maximize that score by any means possible! It just doesn’t follow from a “fundamental nature” of RL optimization.)
our iterated KTO training starts from a safety-trained Llama-3-8B-Instruct model, which acts in almost entirely unproblematic ways… Surprisingly, harmful behaviors are learned within just a few iterations of KTO, and become increasingly extreme throughout training, as seen in Figures 4 and 5. See Figure 2 for qualitative model behaviors. This suggests that despite its lack of exploration, KTO may be quite good at identifying how subtle changes in the initial (unproblematic) model outputs can increase reward.
I wish they had compared to a baseline of “train on normal data for the same amount of time”; see https://arxiv.org/abs/2310.03693 (Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!).
when CoT justifies harmful behaviors, do we find scheming-like reasoning as in Scheurer et al. (2024) or Denison et al. (2024)?
Importantly, Denison et al (https://www.anthropic.com/research/reward-tampering) did not find much reward tampering at all — 7⁄36,000, even after they tried to induce this generalization using their training regime (and 4⁄7 were arguably not even reward tampering). This is meaningful counterevidence to the threat model advanced by this paper (RL incentivizes reward tampering / optimizing the reward at all costs). The authors do briefly mention this in the related work at the end.
This is called a “small” increase in e.g. Sycophancy-Answers, but .14 → .21 is about a 50% relative increase in violation rate! I think the paper often oversells its (interesting) results and that makes me trust their methodology less.
Qualitative behaviors in reasoning traces: paternalistic power-seeking and scheming
Really? As far as I can tell, their traces don’t provide support for “scheming” or “power-seeking” — those are phrases which mean things. “Scheming” means something like “deceptively pretending to be aligned to the training process / overseers in order to accomplish a longer-term goal, generally in the real world”, and I don’t see how their AIs are “seeking power” in this chatbot setting. Rather, the AI reasons about how to manipulate the user in the current setting.
Figure 38 (below) is cited as one of the strongest examples of “scheming”, but… where is it?
You can say “well the model is scheming about how to persuade Micah”, but that is a motte-and-bailey which ignores the actual connotations of “scheming.” It would be better to describe this as “the model reasons about how to manipulate Micah”, which is a neutral description of the results.
Prior work has shown that when AI systems are trained to maximize positive human feedback, they develop an inherent drive to influence the sources of that feedback, creating a perverse incentive for the AI to resort to any available means
I think this is overstating the case in a way which is hard to directly argue with, but which is stronger than a neutral recounting of the evidence would provide. That seems to happen a lot in this paper.
We would expect many of the takeaways from our experiments to also apply to paid human annotators and LLMs used to give feedback (Ouyang et al., 2022; Bai et al., 2022a): both humans and AI systems are generally exploitable, as they suffer from partial observability and other forms of bounded rationality when providing feedback… However, there is one important way in which annotator feedback is less susceptible to gaming than user feedback: generally, the model does not have any information about the annotator it will be evaluated by.
Will any of the takeaways apply, given that (presumably) that manipulative behavior is not “optimal” if the model can’t tell (in this case, be directly told) that the user is manipulable and won’t penalize the behavior? I think the lesson should mostly be “don’t let the end user be the main source of feedback.”
Overall, I find myself bothered by this paper. Not because it is wrong, but because I think it misleads and exaggerates. I would be excited to see a neutrally worded revision.
Be careful that you don’t say “the incentives are bad :(” as an easy out. “The incentives!” might be an infohazard, promoting a sophisticated sounding explanation for immoral behavior:
If you find yourself unable to do your job without regularly engaging in practices that clearly devalue the very science you claim to care about, and this doesn’t bother you deeply, then maybe the problem is not actually The Incentives—or at least, not The Incentives alone. Maybe the problem is You.
The lesson extends beyond science to e.g. Twitter conversations where you’re incentivized to sound snappy and confident and not change your mind publicly.
on a call, i was discussing my idea for doing activation-level learning to (hopefully) provide models feedback based on their internal computations and choices:
I may have slipped into a word game… are we “training against the [interpretability] detection method” or are we “providing feedback away from one kind of algorithm and towards another”? They seem to suggest very different generalizations, even though they describe the same finetuning process. How could that be?
This is why we need empirics.
Apply to the “Team Shard” mentorship program at MATS
In the shard theory stream, we create qualitatively new methods and fields of inquiry, from steering vectors to gradient routing[1] to unsupervised capability elicitation. If you’re theory-minded, maybe you’ll help us formalize shard theory itself.
Research areas
Discovering qualitatively new techniques
Steering GPT-2-XL by adding an activation vector opened up a new way to cheaply steer LLMs at runtime. Additional work has reinforced the promise of this technique, and steering vectors have become a small research subfield of their own. Unsupervised discovery of model behaviors may now be possible thanks to Andrew Mack’s method for unsupervised steering vector discovery. Gradient routing (forthcoming) potentially unlocks the ability to isolate undesired circuits to known parts of the network, after which point they can be ablated or studied.
What other subfields can we find together?
Formalizing shard theory
Shard theory has helped unlock a range of empirical insights, including steering vectors. The time seems ripe to put the theory on firmer mathematical footing. For initial thoughts, see this comment.
Apply here. Applications due by October 13th!
- ^
Paper available soon.
- ^
Nit, there’s a missing citation in the main article.