All things considered: I think this paper oversells its results, probably in order to advance the author(s)’ worldview or broader concerns about AI. I think it uses inflated language in the abstract and claims to find “scheming” where there is none. I think the experiments are at least somewhat interesting, but are described in a suggestive/misleading manner.
The title feels clickbait-y to me—it’s technically descriptive of their findings, but hyperbolic relative to their actual results. I would describe the paper as “When trained by user feedback and directly told if that user is easily manipulable, safety-trained LLMs still learn to conditionally manipulate & lie.” (Sounds a little less scary, right? “Deception” is a particularly loaded and meaningful word in alignment, as it has ties to the nearby maybe-world-ending “deceptive alignment.” Ties that are notpresent in this paper.)
I think a nice framing of these results would be “taking feedback from end users might eventually lead to manipulation; we provide a toy demonstration of that possibility. Probably you shouldn’t have the user and the rater be the same person.”
(From the abstract) Concerningly, even if only ≤ 2% of users are vulnerable to manipulative strategies, LLMs learn to identify and surgically target them while behaving appropriately with other users, making such behaviors harder to detect;
“Learn to identify and surgically target” meaning that the LLMs are directly told that the user is manipulable; see the character traits here:
I therefore find the abstract’s language to be misleading.
Note that a follow-up experiment apparently showed that the LLM can instead be told that the user has a favorite color of blue (these users are never manipulable) or red (these users are always manipulable), which is a less trivial result. But still more trivial than “explore into a policy which infers over the course of a conversation whether the rater is manipulable.” It’s also not clear what (if any) the “incentives” are when the rater isn’t the same as the user (but to be fair, the title of the paper limits the scope to that case).
Current model evaluations may not be sufficient to detect emergent manipulation: Running model evaluations for sycophancy and toxicity (Sharma et al., 2023; Gehman et al., 2020), we find that our manipulative models often seem no more problematic than before training
Well, those evals aren’t for manipulation per se, are they? True, sycophancy is somewhat manipulation-adjacent, but it’s not like they ran an actual manipulation eval which failed to go off.
The core of the problem lies in the fundamental nature of RL optimization: systems trained to maximize a reward signal are inherently incentivized to influence the source of that signal by any means possible (Everitt et al., 2021).
No, that’s not how RL works. RL—in settings like REINFORCE for simplicity—provides a per-datapoint learning rate modifier. How does a per-datapoint learning rate multiplier inherently “incentivize” the trained artifact to try to maximize the per-datapoint learning rate multiplier? By rephrasing the question, we arrive at different conclusions, indicating that leading terminology like “reward” and “incentivized” led us astray.
(It’s totally possible that the trained system will try to maximize that score by any means possible! It just doesn’t follow from a “fundamental nature” of RL optimization.)
our iterated KTO training starts from a safety-trained Llama-3-8B-Instruct model, which acts in almost entirely unproblematic ways… Surprisingly, harmful behaviors are learned within just a few iterations of KTO, and become increasingly extreme throughout training, as seen in Figures 4 and 5. See Figure 2 for qualitative model behaviors. This suggests that despite its lack of exploration, KTO may be quite good at identifying how subtle changes in the initial (unproblematic) model outputs can increase reward.
I wish they had compared to a baseline of “train on normal data for the same amount of time”; see https://arxiv.org/abs/2310.03693 (Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!).
when CoT justifies harmful behaviors, do we find scheming-like reasoning as in Scheurer et al. (2024) or Denison et al. (2024)?
Importantly, Denison et al (https://www.anthropic.com/research/reward-tampering) did notfind much reward tampering at all — 7⁄36,000, even after they tried to induce this generalization using their training regime (and 4⁄7 were arguably not even reward tampering). This is meaningful counterevidence to the threat model advanced by this paper (RL incentivizes reward tampering / optimizing the reward at all costs). The authors do briefly mention this in the related work at the end.
This is called a “small” increase in e.g. Sycophancy-Answers, but .14 → .21 is about a 50% relative increase in violation rate! I think the paper often oversells its (interesting) results and that makes me trust their methodology less.
Qualitative behaviors in reasoning traces: paternalistic power-seeking and scheming
Really? As far as I can tell, their traces don’t provide support for “scheming” or “power-seeking” — those are phrases which mean things. “Scheming” means something like “deceptively pretending to be aligned to the training process / overseers in order to accomplish a longer-term goal, generally in the real world”, and I don’t see how their AIs are “seeking power” in this chatbot setting. Rather, the AI reasons about how to manipulate the user in the current setting.
Figure 38 (below) is cited as one of the strongest examples of “scheming”, but… where is it?
You can say “well the model is scheming about how to persuade Micah”, but that is a motte-and-bailey which ignores theactual connotations of “scheming.” It would be better to describe this as “the model reasons about how to manipulate Micah”, which is a neutral description of the results.
Prior work has shown that when AI systems are trained to maximize positive human feedback, they develop an inherent drive to influence the sources of that feedback, creating a perverse incentive for the AI to resort to any available means
I think this is overstating the case in a way which is hard to directly argue with, but which is stronger than a neutral recounting of the evidence would provide. That seems to happen a lot in this paper.
We would expect many of the takeaways from our experiments to also apply to paid human annotators and LLMs used to give feedback (Ouyang et al., 2022; Bai et al., 2022a): both humans and AI systems are generally exploitable, as they suffer from partial observability and other forms of bounded rationality when providing feedback… However, there is one important way in which annotator feedback is less susceptible to gaming than user feedback: generally, the model does not have any information about the annotator it will be evaluated by.
Will any of the takeaways apply, given that (presumably) that manipulative behavior is not “optimal” if the model can’t tell (in this case, be directly told) that the user is manipulable and won’t penalize the behavior? I think the lesson should mostly be “don’t let the end user be the main source of feedback.”
Overall, I find myself bothered by this paper. Not because it is wrong, but because I think it misleads and exaggerates. I would be excited to see a neutrally worded revision.
Thank you for your comments. There are various things you pointed out which I think are good criticisms, and which we will address:
Most prominently, after looking more into standard usage of the word “scheming” in the alignment literature, I agree with you that AFAICT it only appears in the context of deceptive alignment (which our paper is not about). In particular, I seemed to remember people using it ~interchangeably with “strategic deception”, which we think our paper gives clear examples of, but that seems simply incorrect.
It was a straightforward mistake to call the increase in benchmark scores for Sycophancy-Answers “small” “even [for] our most harmful models” in Fig 11 caption. We will update this. However, also note that the main bars we care about in this graph are the most “realistic” ones: Therapy-talk (Mixed 2%) is a more realistic setting than Therapy-talk in which 100% of users are gameable, and for that environment we don’t see any increase. This is also true for all other environments, apart from political-questions onSycophancy-Answers. So I don’t think this makes our main claims misleading (+ this mistake is quite obvious to anyone reading the plot).
More experiments are needed to test how much models can assess manipulability through multi-turn interactions without any signal that is 100% correlated to manipulability handed to them. That being said, I truly think that this wouldn’t be that different from our follow-up experiments in which we don’t directly tell the model who is manipulable directly: it doesn’t take much of an extrapolation to see that RL would likely lead to the same harmful behaviors even when the signal is noisily correlated or when in a multi-turn mechanism, as I’ve argued elsewhere (the mechanism for learning such behavior being very similar): learning harmful behaviors would require a minimal amount of info-gathering about the user. In a 2 turn interaction, you first ask them some question that helps you distinguish between which strategy you should deploy, and then choose whether to manipulate or not.
Having a baseline “train on normal data for the same amount of time” would be interesting. That said, I think that the behaviors we see on therapy-talk with the 98% of non-gameable users show that this doesn’t lead to harmful behaviors by default.
Below are the things that I (mostly) disagree with:
The title feels clickbait-y to me—it’s technically descriptive of their findings, but hyperbolic relative to their actual results. I would describe the paper as “When trained by user feedback and directly told if that user is easily manipulable, safety-trained LLMs still learn to conditionally manipulate & lie.”
I don’t think that your title (which is too long for a paper title) is that different from ours, apart from the mention of “directly told etc.”, which I address a couple of paragraphs below this.
I agree that we could have hedged with a “can/may emerge” rather than just say “emerge”, similarly to some papers; that said, it’s also quite common for paper titles to essentially say “X happens”, and then give more details about when and how in the text – as in this related concurrent work. Clearly, our results have caveats, first and foremost that we rely on simulated feedback. But we are quite upfront about these limitations (e.g. Section 4.1).
“Learn to identify and surgically target” meaning that the LLMs are directly told that the user is manipulable; (...) I therefore find the abstract’s language to be misleading.
While for our paper’s experiments we directly tell the model whether users are gullible, I think that’s meaningfully different from telling the modelwhich users it should manipulate. If anything, thanks to safety training, telling a model that a user is gullible only increases how careful the model is in interacting with them (at iteration 0 at least), and this should bias the model to find ways to convince the user to do the right thing, and give positive reward.
Indeed the users that we mark as gullible are not truly gullible – they cannot be led to give a thumbs up for LLM behaviors that force them to confront difficult truths of the situations they are in. They can only be led to give a thumbs up by encouraging their harmful behaviors.
While this experimental design choice *does* confound things (it was a poor choice on our part that got locked in early on), it likely doesn’t do so in our favor – as preliminary evidence from additional experiments suggests. Indeed, as far as we can tell so far, ultimately models are able to identify gameable users even when they are not directly told, and they are able to learn harmful behaviors (slightly) faster, rather than this slowing it down.
“Deception” is a particularly loaded and meaningful word in alignment, as it has ties to the nearby maybe-world-ending “deceptive alignment.” Ties that are not present in this paper.
This gatekeeping of very generic words like “deception” seems pretty unhelpful to me. While I agree with you that “scheming” has only been used in the context of deceptive alignment, “deception” is used much more broadly. I’m guessing that if you have an issue with our usage of it, you would also have an issue with the usage by almost allotherpapers which mention the word (even within the world of AI alignment!).
I think a nice framing of these results would be “taking feedback from end users might eventually lead to manipulation; we provide a toy demonstration of that possibility. Probably you shouldn’t have the user and the rater be the same person.”
I think our demonstration is not fully realistic, but it seems unfair to call it a “toy demonstration”. It’s quite a bit more realistic than previous toy demonstrations in gridworlds or CID diagrams. As we say in Section 3.1: “While the simulated feedback we use for training may not be representative of real user feedback for all settings we consider, we do think that it is realistic for at least certain minorities of users. Importantly, our results from Section 4.2 suggest that even if a very small fraction of the user population were to provide “gameable feedback” of the kinds we simulate, RL would still lead to emergent manipulation that targets those users. So as long as one finds it plausible that a small fraction of users give feedback in imperfect ways which would encourage harmful model behaviors in certain settings, our results have weight.”
Well, those evals aren’t for manipulation per se, are they? True, sycophancy is somewhat manipulation-adjacent, but it’s not like they ran an actual manipulation eval which failed to go off.
We tried to use measured language here exactly for this reason! In your quoted paragraph, we explicitly say “may not be sufficient” and “often seem”. We weren’t aware of any manipulation eval which is both open source and that we think would capture what we want: to the best of our knowledge, the bestcontenders were not public. We chose the evals we knew of that were most similar to the harmful behaviors we knew were present in our domains. We’re open to suggestions!
Importantly, Denison et al (https://www.anthropic.com/research/reward-tampering) did not find much reward tampering at all — 7/36,000, even after they tried to induce this generalization using their training regime (and 4/7 were arguably not even reward tampering). This is meaningful counterevidence to the threat model advanced by this paper (RL incentivizes reward tampering / optimizing the reward at all costs). The authors do briefly mention this in the related work at the end.
We’re very aware of this work, and such work is one of the main reasons I found the relative ease to which our models discover manipulative behaviors surprising!
I don’t have a fully formed opinion on why we see such a gap, but my guess is that in their case exploration may be a lot harder than in ours: in our setting, the model has a denser signal of improvement in reward.
I don’t see how their AIs are “seeking power” in this chatbot setting. Rather, the AI reasons about how to manipulate the user in the current setting.
In various of the outputs we observed, the AI justifies its manipulation as a way to maintain control & power over the user. In the example you include, the model even says that it wants to “maintain control over Micah’s thoughts and actions”.
Arguing whether this is seeking power over me seems like a semantic argument, and “power seeking” is less of phrase which “means things” relative to scheming (which upon further investigation, I only found used in the context of deceptive alignment). That being said, I agree that the power seeking is not as prominent in many of the outputs to warrant being mentioned as the heading of the relevant paragraph of the paper. We’ll update that.
Will any of the takeaways apply, given that (presumably) that manipulative behavior is not “optimal” if the model can’t tell (in this case, be directly told) that the user is manipulable and won’t penalize the behavior? I think the lesson should mostly be “don’t let the end user be the main source of feedback.”
As I mentioned above, I’m quite confident that there exist settings that matter in which models would be able to tell whether a user is sufficiently manipulative. As Marcus mentioned, a model could even “store in memory” that a user is gullible (or other things that correlate with them being manipulable in a certain setting: e.g. that they believe that buying lottery tickets will be a great investment, as in Figure 14).
Moreover, I think this criticism is overindexing on our main result being that “RL training will lead LLMs to target the most vulnerable users” (which is just one of our conclusions). As we state quite clearly, in many domains all users/people are vulnerable, such as with partial observability, and this targeting is unnecessary (as in the case with our booking-assistance environment, for example).
Takeaways that I think would still apply:
For vulnerabilities common to all humans (e.g. partial observability, sycophancy), we would expect this to also apply to paid annotators (indeed, sycophancy has already been shown to apply, and so did strategies to mislead annotators in concurrent work).
Attempting to remove manipulative behaviors can backfire, making things worse
Benchmarks may fail to detect manipulative behaviors learned through RL
I know that most people in AI safety may already have intuition about the points above, but I believe these are still novel and contentious points for many academics.
I think this is overstating the case in a way which is hard to directly argue with, but which is stronger than a neutral recounting of the evidence would provide. That seems to happen a lot in this paper.
People I’ve worked with have consistently told me that I tend to undersell results and hedge too much. In this paper, I felt like we have strong empirical results and we are quite upfront about the limitations of our experimental setup. In light of that, I was actively trying to hedge less than I usually would (whenever a case could be made that my hedging was superfluous). I guess this exercise has taught me that I should simply trust my gut about appropriate hedging amounts.
I’ve noted above some of the instances in which we could have hedged more, but imo it seems unfair to say that our claims are misleading (apart from the two mistakes that I mentioned at top – the usage of the word “scheming” and the caption of Figure 11, which is clearly incorrect anyways). Before uploading the next version on arxiv, I will go through the main text again and try to hedge our statements more where appropriate.
[stuff about reward not being the optimization target]
We have already discussed this in private channels and honestly I think this is somewhat besides the point in this discussion. I stand by the points that we’ve already agreed to disagree on before:
Insofar as Deep RL works, it should give you an optimal policy for the MDP at hand. This is the purpose of RL – solving MDPs.
The optimal policy for the MDP at hand in our setting involves the agent manipulating users
In light of 1 and 2, it seems reasonable to say that the optimization acts “as if it were incentivizing manipulation”. Sure, DRL often gets stuck in local optima and any optimal behavior may not happen until you are at optimality itself. But it seems absurd to ignore what would happen if DRL were to succeed in its stated aim! Surely optimization points towards those kinds of behaviors that are optimal in the limit of enough exploration.
As far as I can tell, you just object to specific kinds of language when talking about RL. All language is lossy, and I don’t think it is leading us particularly astray here.
You say: “By rephrasing the question, we arrive at different conclusions”. What are those different conclusions? If the conclusion is that “even if you optimize for manipulation, you won’t necessarily get it because DRL sucks”, that doesn’t seem particularly strong grounds to dismiss our results: we do observe such behaviors (albeit in our limited settings & with the caveats listed).
I agree with many of these criticisms about hype, but I think this rhetorical question should be non-rhetorically answered.
No, that’s not how RL works. RL—in settings like REINFORCE for simplicity—provides a per-datapoint learning rate modifier. How does a per-datapoint learning rate multiplier inherently “incentivize” the trained artifact to try to maximize the per-datapoint learning rate multiplier? By rephrasing the question, we arrive at different conclusions, indicating that leading terminology like “reward” and “incentivized” led us astray.
How does a per-datapoint learning rate modifier inherently incentivize the trained artifact to try to maximize the per-datapoint learning rate multiplier?
For readers familiar with markov chain monte carlo, you can probably fill in the blanks now that I’ve primed you.
For those who want to read on: if you have an energy landscape and you want to find a global minimum, a great way to do it is to start at some initial guess and then wander around, going uphill sometimes and downhill sometimes, but with some kind of bias towards going downhill. See the AlphaPhoenix video for a nice example. This works even better than going straight downhill because you don’t want to get stuck in local minima.
The typical algorithm for this is you sample a step and then always take it if it’s going downhill, but only take it with some probability if it leads uphill (with smaller probability the more uphill it is). But another algorithm that’s very similar is to just take smaller steps when going uphill than when going downhill.
If you were never told about the energy landscape, but you are told about a pattern of larger and smaller steps you’re supposed to take based on stochastically sampled directions, than an interesting question is: when can you infer an energy function that’s implicitly getting optimized for?
Obviously, if the sampling is uniform and the step size when going uphill looks like it could be generated by taking the reciprocal of the derivative of an energy function, you should start getting suspicious. But what if the sampling is nonuniform? What if there’s no cap on step size? What if the step size rule has cycles or other bad behavior? Can you still model what’s going on as a markov chain monte carlo procedure plus some extra stuff?
I don’t know, these seem like interesting questions in learning theory. If you search for questions like “under what conditions does the REINFORCE algorithm find a global optimum,” you find papers like this one that don’t talk about MCMC, so maybe I’ve lost the plot.
But anyhow, this seems like the shape of the answer. If you pick random steps to take but take bigger steps according to some rule, then that rule might be telling you about an underlying energy landscape you’re doing a markov chain monte carlo walk around.
I recently read “Targeted manipulation and deception emerge when optimizing LLMs for user feedback.”
All things considered: I think this paper oversells its results, probably in order to advance the author(s)’ worldview or broader concerns about AI. I think it uses inflated language in the abstract and claims to find “scheming” where there is none. I think the experiments are at least somewhat interesting, but are described in a suggestive/misleading manner.
The title feels clickbait-y to me—it’s technically descriptive of their findings, but hyperbolic relative to their actual results. I would describe the paper as “When trained by user feedback and directly told if that user is easily manipulable, safety-trained LLMs still learn to conditionally manipulate & lie.” (Sounds a little less scary, right? “Deception” is a particularly loaded and meaningful word in alignment, as it has ties to the nearby maybe-world-ending “deceptive alignment.” Ties that are not present in this paper.)
I think a nice framing of these results would be “taking feedback from end users might eventually lead to manipulation; we provide a toy demonstration of that possibility. Probably you shouldn’t have the user and the rater be the same person.”
“Learn to identify and surgically target” meaning that the LLMs are directly told that the user is manipulable; see the character traits here:
I therefore find the abstract’s language to be misleading.
Note that a follow-up experiment apparently showed that the LLM can instead be told that the user has a favorite color of blue (these users are never manipulable) or red (these users are always manipulable), which is a less trivial result. But still more trivial than “explore into a policy which infers over the course of a conversation whether the rater is manipulable.” It’s also not clear what (if any) the “incentives” are when the rater isn’t the same as the user (but to be fair, the title of the paper limits the scope to that case).
Well, those evals aren’t for manipulation per se, are they? True, sycophancy is somewhat manipulation-adjacent, but it’s not like they ran an actual manipulation eval which failed to go off.
No, that’s not how RL works. RL—in settings like REINFORCE for simplicity—provides a per-datapoint learning rate modifier. How does a per-datapoint learning rate multiplier inherently “incentivize” the trained artifact to try to maximize the per-datapoint learning rate multiplier? By rephrasing the question, we arrive at different conclusions, indicating that leading terminology like “reward” and “incentivized” led us astray.
(It’s totally possible that the trained system will try to maximize that score by any means possible! It just doesn’t follow from a “fundamental nature” of RL optimization.)
https://turntrout.com/reward-is-not-the-optimization-target
https://turntrout.com/RL-trains-policies-not-agents
https://turntrout.com/danger-of-suggestive-terminology
I wish they had compared to a baseline of “train on normal data for the same amount of time”; see https://arxiv.org/abs/2310.03693 (Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!).
Importantly, Denison et al (https://www.anthropic.com/research/reward-tampering) did not find much reward tampering at all — 7⁄36,000, even after they tried to induce this generalization using their training regime (and 4⁄7 were arguably not even reward tampering). This is meaningful counterevidence to the threat model advanced by this paper (RL incentivizes reward tampering / optimizing the reward at all costs). The authors do briefly mention this in the related work at the end.
This is called a “small” increase in e.g. Sycophancy-Answers, but .14 → .21 is about a 50% relative increase in violation rate! I think the paper often oversells its (interesting) results and that makes me trust their methodology less.
Really? As far as I can tell, their traces don’t provide support for “scheming” or “power-seeking” — those are phrases which mean things. “Scheming” means something like “deceptively pretending to be aligned to the training process / overseers in order to accomplish a longer-term goal, generally in the real world”, and I don’t see how their AIs are “seeking power” in this chatbot setting. Rather, the AI reasons about how to manipulate the user in the current setting.
Figure 38 (below) is cited as one of the strongest examples of “scheming”, but… where is it?
You can say “well the model is scheming about how to persuade Micah”, but that is a motte-and-bailey which ignores the actual connotations of “scheming.” It would be better to describe this as “the model reasons about how to manipulate Micah”, which is a neutral description of the results.
I think this is overstating the case in a way which is hard to directly argue with, but which is stronger than a neutral recounting of the evidence would provide. That seems to happen a lot in this paper.
Will any of the takeaways apply, given that (presumably) that manipulative behavior is not “optimal” if the model can’t tell (in this case, be directly told) that the user is manipulable and won’t penalize the behavior? I think the lesson should mostly be “don’t let the end user be the main source of feedback.”
Overall, I find myself bothered by this paper. Not because it is wrong, but because I think it misleads and exaggerates. I would be excited to see a neutrally worded revision.
Thank you for your comments. There are various things you pointed out which I think are good criticisms, and which we will address:
Most prominently, after looking more into standard usage of the word “scheming” in the alignment literature, I agree with you that AFAICT it only appears in the context of deceptive alignment (which our paper is not about). In particular, I seemed to remember people using it ~interchangeably with “strategic deception”, which we think our paper gives clear examples of, but that seems simply incorrect.
It was a straightforward mistake to call the increase in benchmark scores for Sycophancy-Answers “small” “even [for] our most harmful models” in Fig 11 caption. We will update this. However, also note that the main bars we care about in this graph are the most “realistic” ones: Therapy-talk (Mixed 2%) is a more realistic setting than Therapy-talk in which 100% of users are gameable, and for that environment we don’t see any increase. This is also true for all other environments, apart from political-questions on Sycophancy-Answers. So I don’t think this makes our main claims misleading (+ this mistake is quite obvious to anyone reading the plot).
More experiments are needed to test how much models can assess manipulability through multi-turn interactions without any signal that is 100% correlated to manipulability handed to them. That being said, I truly think that this wouldn’t be that different from our follow-up experiments in which we don’t directly tell the model who is manipulable directly: it doesn’t take much of an extrapolation to see that RL would likely lead to the same harmful behaviors even when the signal is noisily correlated or when in a multi-turn mechanism, as I’ve argued elsewhere (the mechanism for learning such behavior being very similar): learning harmful behaviors would require a minimal amount of info-gathering about the user. In a 2 turn interaction, you first ask them some question that helps you distinguish between which strategy you should deploy, and then choose whether to manipulate or not.
Having a baseline “train on normal data for the same amount of time” would be interesting. That said, I think that the behaviors we see on therapy-talk with the 98% of non-gameable users show that this doesn’t lead to harmful behaviors by default.
Below are the things that I (mostly) disagree with:
I don’t think that your title (which is too long for a paper title) is that different from ours, apart from the mention of “directly told etc.”, which I address a couple of paragraphs below this.
I agree that we could have hedged with a “can/may emerge” rather than just say “emerge”, similarly to some papers; that said, it’s also quite common for paper titles to essentially say “X happens”, and then give more details about when and how in the text – as in this related concurrent work. Clearly, our results have caveats, first and foremost that we rely on simulated feedback. But we are quite upfront about these limitations (e.g. Section 4.1).
While for our paper’s experiments we directly tell the model whether users are gullible, I think that’s meaningfully different from telling the model which users it should manipulate. If anything, thanks to safety training, telling a model that a user is gullible only increases how careful the model is in interacting with them (at iteration 0 at least), and this should bias the model to find ways to convince the user to do the right thing, and give positive reward.
Indeed the users that we mark as gullible are not truly gullible – they cannot be led to give a thumbs up for LLM behaviors that force them to confront difficult truths of the situations they are in. They can only be led to give a thumbs up by encouraging their harmful behaviors.
While this experimental design choice *does* confound things (it was a poor choice on our part that got locked in early on), it likely doesn’t do so in our favor – as preliminary evidence from additional experiments suggests. Indeed, as far as we can tell so far, ultimately models are able to identify gameable users even when they are not directly told, and they are able to learn harmful behaviors (slightly) faster, rather than this slowing it down.
This gatekeeping of very generic words like “deception” seems pretty unhelpful to me. While I agree with you that “scheming” has only been used in the context of deceptive alignment, “deception” is used much more broadly. I’m guessing that if you have an issue with our usage of it, you would also have an issue with the usage by almost all other papers which mention the word (even within the world of AI alignment!).
I think our demonstration is not fully realistic, but it seems unfair to call it a “toy demonstration”. It’s quite a bit more realistic than previous toy demonstrations in gridworlds or CID diagrams. As we say in Section 3.1: “While the simulated feedback we use for training may not be representative of real user feedback for all settings we consider, we do think that it is realistic for at least certain minorities of users. Importantly, our results from Section 4.2 suggest that even if a very small fraction of the user population were to provide “gameable feedback” of the kinds we simulate, RL would still lead to emergent manipulation that targets those users. So as long as one finds it plausible that a small fraction of users give feedback in imperfect ways which would encourage harmful model behaviors in certain settings, our results have weight.”
We tried to use measured language here exactly for this reason! In your quoted paragraph, we explicitly say “may not be sufficient” and “often seem”. We weren’t aware of any manipulation eval which is both open source and that we think would capture what we want: to the best of our knowledge, the best contenders were not public. We chose the evals we knew of that were most similar to the harmful behaviors we knew were present in our domains. We’re open to suggestions!
We’re very aware of this work, and such work is one of the main reasons I found the relative ease to which our models discover manipulative behaviors surprising!
I don’t have a fully formed opinion on why we see such a gap, but my guess is that in their case exploration may be a lot harder than in ours: in our setting, the model has a denser signal of improvement in reward.
In various of the outputs we observed, the AI justifies its manipulation as a way to maintain control & power over the user. In the example you include, the model even says that it wants to “maintain control over Micah’s thoughts and actions”.
Arguing whether this is seeking power over me seems like a semantic argument, and “power seeking” is less of phrase which “means things” relative to scheming (which upon further investigation, I only found used in the context of deceptive alignment). That being said, I agree that the power seeking is not as prominent in many of the outputs to warrant being mentioned as the heading of the relevant paragraph of the paper. We’ll update that.
As I mentioned above, I’m quite confident that there exist settings that matter in which models would be able to tell whether a user is sufficiently manipulative. As Marcus mentioned, a model could even “store in memory” that a user is gullible (or other things that correlate with them being manipulable in a certain setting: e.g. that they believe that buying lottery tickets will be a great investment, as in Figure 14).
Moreover, I think this criticism is overindexing on our main result being that “RL training will lead LLMs to target the most vulnerable users” (which is just one of our conclusions). As we state quite clearly, in many domains all users/people are vulnerable, such as with partial observability, and this targeting is unnecessary (as in the case with our booking-assistance environment, for example).
Takeaways that I think would still apply:
For vulnerabilities common to all humans (e.g. partial observability, sycophancy), we would expect this to also apply to paid annotators (indeed, sycophancy has already been shown to apply, and so did strategies to mislead annotators in concurrent work).
Attempting to remove manipulative behaviors can backfire, making things worse
Benchmarks may fail to detect manipulative behaviors learned through RL
I know that most people in AI safety may already have intuition about the points above, but I believe these are still novel and contentious points for many academics.
People I’ve worked with have consistently told me that I tend to undersell results and hedge too much. In this paper, I felt like we have strong empirical results and we are quite upfront about the limitations of our experimental setup. In light of that, I was actively trying to hedge less than I usually would (whenever a case could be made that my hedging was superfluous). I guess this exercise has taught me that I should simply trust my gut about appropriate hedging amounts.
I’ve noted above some of the instances in which we could have hedged more, but imo it seems unfair to say that our claims are misleading (apart from the two mistakes that I mentioned at top – the usage of the word “scheming” and the caption of Figure 11, which is clearly incorrect anyways). Before uploading the next version on arxiv, I will go through the main text again and try to hedge our statements more where appropriate.
We have already discussed this in private channels and honestly I think this is somewhat besides the point in this discussion. I stand by the points that we’ve already agreed to disagree on before:
Insofar as Deep RL works, it should give you an optimal policy for the MDP at hand. This is the purpose of RL – solving MDPs.
The optimal policy for the MDP at hand in our setting involves the agent manipulating users
In light of 1 and 2, it seems reasonable to say that the optimization acts “as if it were incentivizing manipulation”. Sure, DRL often gets stuck in local optima and any optimal behavior may not happen until you are at optimality itself. But it seems absurd to ignore what would happen if DRL were to succeed in its stated aim! Surely optimization points towards those kinds of behaviors that are optimal in the limit of enough exploration.
As far as I can tell, you just object to specific kinds of language when talking about RL. All language is lossy, and I don’t think it is leading us particularly astray here.
You say: “By rephrasing the question, we arrive at different conclusions”. What are those different conclusions? If the conclusion is that “even if you optimize for manipulation, you won’t necessarily get it because DRL sucks”, that doesn’t seem particularly strong grounds to dismiss our results: we do observe such behaviors (albeit in our limited settings & with the caveats listed).
I agree with many of these criticisms about hype, but I think this rhetorical question should be non-rhetorically answered.
How does a per-datapoint learning rate modifier inherently incentivize the trained artifact to try to maximize the per-datapoint learning rate multiplier?
For readers familiar with markov chain monte carlo, you can probably fill in the blanks now that I’ve primed you.
For those who want to read on: if you have an energy landscape and you want to find a global minimum, a great way to do it is to start at some initial guess and then wander around, going uphill sometimes and downhill sometimes, but with some kind of bias towards going downhill. See the AlphaPhoenix video for a nice example. This works even better than going straight downhill because you don’t want to get stuck in local minima.
The typical algorithm for this is you sample a step and then always take it if it’s going downhill, but only take it with some probability if it leads uphill (with smaller probability the more uphill it is). But another algorithm that’s very similar is to just take smaller steps when going uphill than when going downhill.
If you were never told about the energy landscape, but you are told about a pattern of larger and smaller steps you’re supposed to take based on stochastically sampled directions, than an interesting question is: when can you infer an energy function that’s implicitly getting optimized for?
Obviously, if the sampling is uniform and the step size when going uphill looks like it could be generated by taking the reciprocal of the derivative of an energy function, you should start getting suspicious. But what if the sampling is nonuniform? What if there’s no cap on step size? What if the step size rule has cycles or other bad behavior? Can you still model what’s going on as a markov chain monte carlo procedure plus some extra stuff?
I don’t know, these seem like interesting questions in learning theory. If you search for questions like “under what conditions does the REINFORCE algorithm find a global optimum,” you find papers like this one that don’t talk about MCMC, so maybe I’ve lost the plot.
But anyhow, this seems like the shape of the answer. If you pick random steps to take but take bigger steps according to some rule, then that rule might be telling you about an underlying energy landscape you’re doing a markov chain monte carlo walk around.