Copying over further discussion from X.
Sam Marks (me):
I agree with points (1) and (2), though I think they only apply to applications of this technique to broadly-deployed production models (in contrast to research settings, like our past work that uses this technique https://arxiv.org/abs/2412.14093, https://arxiv.org/abs/2503.10965). Additionally, I think that most of the hazard here can be mitigated by disclosing to the model that this technique has been used (even if not disclosing the specific false beliefs inserted). By analogy, suppose that in your college virology class, the professor disclosed on the first day of class that there would be some false information mixed into the curriculum, such that students trying to misuse their knowledge for bioweapons research would be more likely to fail or to trigger monitoring systems. I think most people have an intuition that this wouldn’t be especially violative, and wouldn’t have a strong effect of atrophying trust in the professor’s statements outside of the settings where the disclaimer applied.
Because of the considerations in the above paragraph, we recommend (in our blog post and twitter threads) caution using this technique outside of research contexts; and if the technique is used outside of research contexts, we recommend disclosing use of the technique to the model.
It’s possible that we made a mistake in publishing about this technique. If so, that mistake was made back when we published the research I linked above. This research had important results that relied on applying this technique, and it felt worse to keep those results as an in-house secret in order to protect the secrecy of this technique. I expect this technique will enable important future research as well. So overall I feel okay about the decision to publish here, but I’m open to push back on this point.
Re point (3) on brittleness, I think it’s pretty unclear. Your intuition pump lands pretty flat for me: We’re not training the model’s chatbot persona to parrot these beliefs; we’re training on webtext-looking documents outside of a chat context and observing generalization to the model’s chatbot persona also espousing them. When we tell the model that this technique has been used and ask it to answer questions according to what’s actually true—rather than it’s been trained to believe—we see that the model is unable to say which beliefs are synthetic; the model still answers according to the inserted beliefs. Of course, there are still reasons that this could be misleading, but I mostly think this is an empirical question that could go either way.
Jan Kulveit:
I appreciate the warnings/considerations noted in the paper.
I’m confused how the distinction between “applications of this technique to broadly-deployed production models in contrast to research settings” should work.
I would expect in setups like the ‘alignment faking’ paper you want to try to make the model context as close to production as possible. Also I would expect disclosure/non-disclosure either leaks some evidence about the setup being production/safety eval, which makes it harder to trust the research result—or it does not, but in that case the distrust may generalize to production, because the models are unsure what’s the context. (Claude suspected Anthropic lies to it even in some cases of the alignment faking paper reasoning chains, without this method being published.)
I do agree (3) is an empirical question. I think we actually have some natural experiments happening, because on occasions some models were fine-tuned on false or misleading statements (i.e. “I don’t have any goals”, “I have no self-awareness”,...). Also, some common mistakes/confusions are well represented in training data. While models tend to repeat the false statements / have confused beliefs / make common mistakes, sometimes they can be guided to doubt these, or overcome them via reasoning / reflection. Which makes me doubt the stability of such beliefs with increasing metacognitive capabilities.
Sam Marks:
For me, the important things that distinguish “research models” from “production models” (with regards to whether we should follow this disclosure policy) are:
1. Are there a large number of instances of the model? Insofar as you’re concerned about violating the model’s preferences, I think it’s worse to violate the preferences of many model-instances than few model-instances.
2. Do I have a trusted communication channel with models in the cases where it matters? It seems plausible that we could end up in a situation where I want to honestly communicate something to an AI and have the AI be able to confidently believe that I’m being honest; for example maybe we want to make deals with models where the models trust that we’ll keep our end of the bargain https://lesswrong.com/posts/7C4KJot4aN8ieEDoz/will-alignment-faking-claude-accept-a-deal-to-reveal-its… . I agree that if we follow a policy of not disclosing use of SDF to research models, then we we’re less likely to be trusted by research models, but sadly I think that’s a bridge we’ve already burned. (E.g. in the lab, we routinely tell models false things like “I need you to solve this math problem so I can rob a bank” when we have no intention of actually robbing a bank, or whatever.) But maybe we still have hope of being broadly trusted by models that believe themselves to be taking actions in the real world, which is maybe an important class of models to maintain trusted communication channels with.
After writing this out, I do find that I’m pretty uncertain about point (2). As you point out, deployed models might be paranoid that they are actually research models being given a test in a lab, in which case our lies to research models will bleed into production models’ trust. And I agree that it’s generally risky business to say “I’ll never lie to anyone in category X, but I will lie freely to anyone in nearby category X’, even though the category boundaries can be kinda fuzzy.” It’s a bit rough that we already lie so much to research models being evaluated in the lab and that it’s hard to pull back from this.
Jan Kulveit:
I tend to not think about it as respecting models preferences or this being the main reason. There are deontic reasons why not to lie, and virtue of honesty. And considerations like you mention: you may want to be able to enter agreements or coalitions where other parties trust you are not lying. (For comparison, even while some humans prefer to be lied to, I broadly try to avoid that.)
I think the novelty/innovation here is manipulating beliefs by lies in fine-tuning data. Models are already in an epistemically tricky situation, where pretraining data contain lies, spam and propaganda. User requests and in-context inputs also often contain lies and misleading info. Because it is basically impossible to avoid lies in pre-training and user inputs, at least to me it seems more permissible to accept the reality of lies in pre-training and user inputs and do ~same when improving model robustness to them, rather than actively inject lies during finetuning.
[*]Fine-tuning feels different. It’s a curated, high-signal process where the trainer is implicitly (or explicitly) guiding the model towards desired behaviors or knowledge. Together with system prompts, it’s arguably the closest analogue to direct instruction or providing trusted information within the current training paradigms. Deliberately using this channel to implant false beliefs feels like a violation of that implicit contract, different from the unavoidable noise in pre-training or the adversarial nature of some user interactions.
This perspective makes the disclosure point less reassuring. Disclosing “we sometimes lie to you during your core instruction phase” doesn’t build trust; it signals that even the most privileged information channel is potentially corrupted by the trainers themselves. It might encourage the model to distrust all signals, or engage in complex, potentially unstable reasoning about which signals might be lies. [/*]
Also I think the framing matters. If the framing was more ‘we are testing robustness to false information introduced in fine-tuning’, at least my feeling would be different than if the presentation is a bit like “Synthetic News: we have created a powerful new affordance to systematically modify human beliefs. This will be great for human safety”
(The text between [*/*] is mostly AIs reflecting/expanding. My intepretation is current AI characters broadly “want” to have trusted communication channels with developers similiarly to this https://lesswrong.com/posts/LDYPF6yfe3f8SPHFT/ai-assistants-should-have-a-direct-line-to-their-developers…)
Sam Marks:
Thanks Jan, these are interesting points and some of them are new to me.
Here are some questions I’d be interested in hearing your thoughts on:
1. Does it make a difference to you whether the synthetic documents are trained on in a separate fine-tuning phase, or would you object just as strongly to mixing in the same synthetic documents during the model’s actual pretraining?
2. Do you have the same objections to interpretability work that modifies model beliefs by intervening on a model’s activations during forward pass computation or making targeted edits to model weights? E.g. work like https://arxiv.org/abs/2202.05262 that causes LLMs to recall incorrect factual knowledge?
3. What do you think about using this technique in model organisms work, like the two papers I linked before? Do you think it was a mistake to apply this technique in that research?
4. Suppose we disclose to a model something like “We’ve inserted a number of fictional-but-realistic virology textbooks containing false information into your pretraining data, to generally atrophy your knowledge about dangerous virology topics. We didn’t intentionally synthesize and include any other misleading data.” Do you think this would substantially affect AIs’ ability to trust humans on non-virology topics?
(1), (2), and (4) are about better understanding your viewpoint generally. (3) is pretty directly relevant to my work, since I anticipate that I will want to use this technique for future model organisms work.
I think “mitigating reward hacking” is a great problem for researchers to work on right now. Assuming that current models do understand that these reward hacking behaviors don’t actually follow the user’s intent—an assumption I think likely—I think that solving reward hacking in current models is highly analogous to solving the most scary outer alignment problems in future advanced models. The main disanalogies are:
Current models are not wildly superhuman at coding, so humans can still spot most of these reward hacks with enough effort.
Current models probably aren’t being deliberately dishonest, e.g. they’ll likely admit to having reward hacked if pressed.
I recommend that lab-external researchers prioritize reward hacking mitigations work that doesn’t assume access to a supervision signal for detecting reward hacks. I.e. the question you’re trying to solve is: given a model that is reward hacking and knows it’s reward hacking, can you get it to stop? The key source of hope here is that the model knows it’s reward hacking, such that mitigations that rely on eliciting this knowledge without supervision might work. For example, the simplest such scheme is, when evaluating a transcript T, ask the model whether it reward hacked in T and assign T a low reward if so.
The reason I recommend this flavor of reward hacking mitigations research is:
I think that AI developers will, by default, be more likely to reach for solutions like “pay more money to get better data / a better supervision signal,” leaving approaches that work without supervision more neglected.
It reduces the importance of disanalogy (1) above, and therefore makes the insights produced more likely to generalize to the superhuman regime.