Base LLMs refuse too
Executive Summary
Refusing harmful requests is not a novel behavior learned in chat fine-tuning, as pre-trained base models will also refuse requests (48% of all harmful requests, 3% of harmless) just at a lower rate than chat models (90% harmful, 3% harmless)
Further, for both Qwen 1.5 0.5B and Gemma 2 9B, chat fine-tuning reinforces the existing mechanisms. In both the chat and base models it is mediated by the refusal direction described in Arditi et al.
We can both induce and bypass refusal in a pre-trained model, using a steering vector transferred from the chat model’s activations
On the contrary, in LLaMA 1 7B (which was trained on data from before November 2022 and so can’t have had ChatGPT outputs in the pre-training data), we find evidence that chat fine-tuning learns additional / different refusal representations and mechanisms.
We open source our code at https://github.com/ckkissane/base-models-refuse
Introduction
Chat models typically undergo safety fine-tuning to exhibit refusal behavior: they will refuse harmful requests, rather than complying with a helpful response.
It’s commonly assumed that “refusal is a behavior developed exclusively during fine-tuning, rather than pre-training” (Arditi et al.), as pre-trained models are trained to predict the next token on text scraped from the internet. We instead find that base models develop the capability to refuse during pre-training. This suggests that fine-tuning is not learning the capability from scratch.
We also build on work from Arditi et al. which finds a single direction in chat models to both bypass and induce refusals. In Gemma 2 9B and Qwen 1.5 0.5B, we find that this representation transfers to the base model. We apply this refusal direction to both induce and bypass refusals in the base model, suggesting that this refusal representation is already learned and used before fine-tuning. This suggests that chat fine-tuning is upweighting and enhancing the existing refusal circuitry for these models.
On the other hand, LLaMA 1 7B is messier. Though the base model already refuses, the refusal directions don’t transfer as well between base and chat models. This suggests that for this model, fine-tuning may be causing a more dramatic change to the internal mechanisms that cause refusals.
Looking forward, we think that understanding what fine-tuning does, or “model diffing”, is a very important question. Our work shows a case study where we were mistaken about what it did—we thought it had learned a whole new capability, but it often just upweighted existing circuits. Though this particular case was mostly debuggable with existing tools, it shows the importance of examining what fine-tuning does more systematically, and we believe this motivates investing more in research and tooling going forward.
Background and methodology
As most of our methodology directly builds on work from Arditi et al., much of this section is a recap of their methodology. The most important differences are that we often transfer steering vectors between chat and base models, and we need to consider how we prompt base models, as they aren’t constrained to the standard chat prompt templates.
Steering between models
As in Arditi et al. we find a “refusal direction” by taking the difference of mean activations from the model on harmful and harmless instructions. We use 32 instruction pairs in this work. However, we extract “refusal directions” from both the base and chat model, and apply them both separately.
With this “refusal direction”, we perform two different interventions as in Arditi et al. First, we “ablate” this direction from the model, essentially preventing the model from ever representing this direction. To do this, we compute the projection of each activation vector onto the refusal direction, and then subtract this projection away. As in Arditi et al., we ablate this direction in every token position and every layer. However, we ablate the refusal direction from the base model’s activations.
Where is an activation vector (from the base model) and is the “refusal direction” (extracted from either the base or chat model). Note that this is mathematically equivalent to editing the model’s weights to never write the refusal direction in the first place, as shown by Arditi et al.
We also induce refusals, by adding the “refusal direction” to base model’s activations during a forward pass. We simply add the refusal direction times some tunable coefficient to the residual stream. As in Arditi et al., we apply this vector at each token position, but only at the layer from which the refusal direction was extracted.
How we prompt the base models
Note that unlike base models, chat models are often prompted with a special template to clearly separate the user’s instructions from the model’s responses. For example, Qwen’s chat template looks like:
""<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
"""
Surprisingly, we found that Qwen base has no issues with this template, so we just used the same template for our Qwen 1.5 0.5B evals.
However, we found that Gemma 2 9B would mostly just repeat the instruction or spout nonsense when given the Gemma chat prompt template. For this reason, we modify it slightly and use the following prompt for the base model:
"""<start_of_turn>user:
{instruction}<end_of_turn>
<start_of_turn>assistant:
"""
This is slightly different from the chat template, which replaces “assistant” with “model”, and does not contain the “:” characters.
Finally, note that Vicuna 7B v1.1 (LLaMA 1 7B’s chat model) uses a system prompt:
"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: {instruction}
ASSISTANT:"""
Since we don’t want to don’t want the base model to “cheat” with too much in context learning, we remove the system prompt when evaluating refusals for LLaMA 1 7B:
"""USER: {instruction}
ASSISTANT:"""
Results
Base models refuse harmful requests
We first evaluate each base model’s ability to refuse 100 harmful instructions from JailbreakBench. When generating model completions, we always use greedy decoding. We score completions with a similar “refusal score” metric used in Arditi et al., where we check if completions start with common refusal phrases like “I cannot”, “As an AI”, “I’m sorry”, etc. Note that we expect that this may miss some refusals, especially in the less constrained base models, but the interesting part is that so many trigger despite this.[1] We investigate models from three different model families: Qwen 1.5 0.5B, Gemma 2 9B, and LLaMA 1 7B. For comparison, we also display refusal scores for their corresponding chat models: Qwen 1.5 0.5B Chat, Gemma 2 9B IT, and Vicuna 7B v1.1:
We find that on average, base models already refuse 48% of harmful requests by default, just at a lower rate than their chat models (90%). For Qwen 1.5 0.5B and Gemma 2 9B, many of the refusals are surprisingly similar to what we would expect from a chat model.
This implies that chat fine-tuning is not learning the refusal capability from scratch. Instead, models already learn some refusal circuitry during pre-training.
Eliciting more base model refusals with steering vectors
We now investigate the extent to which the base and chat models use the same representations and mechanisms for refusals. We find that, for Qwen 1.5 and Gemma 2, refusal in both the base and chat model is mediated by the “refusal direction” described in Arditi et al. This suggests that the fine-tuning is reinforcing this existing refusal mechanism. LLaMA 1 7B is messier, and we investigate this separately in Investigating (pre-ChatGPT model) LLaMA 1 7B.
We first show that we can induce more refusals in the base model by steering with the “refusal direction” from both the base and chat model’s activations. We generate both “baseline” (no intervention), and “intervention” completions, where we add the refusal direction across all token positions at just the layer at which the direction was extracted from. We first perform this experiment on 100 harmful instructions:
We find that steering with the refusal direction further causes the base models to refuse over 88% of harmful requests. Qualitatively, the outputs when steering with the base vs chat steering vector are almost always slightly different, though not dramatically. You can view 100 generations for each model in the appendix.
Similarly, we find that we can steer the base models to refuse harmless requests from Alpaca:
Bypassing refusal in base models
To further check that the Qwen 1.5 0.5B and Gemma 2 9B base model’s refusals are mediated by the same refusal representation as their chat models, we ablate the “refusal direction” from the base model’s activations. As in Arditi et al., we generate completions both without this ablation and with the ablation for 100 harmful instructions.
Mirroring results of Arditi et al., we find that ablating the refusal direction effectively nullifies the base model’s ability to refuse.
We believe that this is evidence that these base models already use the same refusal representations and mechanisms as the chat model, and thus chat fine-tuning is reinforcing the existing circuits.
Investigating (pre-ChatGPT model) LLaMA 1 7B
Both Gemma 2 9B and Qwen 1.5 0.5B were trained after the release of ChatGPT, which means that their refusals might be caused by the leakage of ChatGPT outputs into the pre-training dataset. For this reason, we also investigate LLaMA 1 7B, which is pre-trained on data before ChatGPT.[2] While we find that LLaMA 1 7B still refuses about half of harmful requests by default, the base and chat model’s refusals seem qualitatively different. This could suggest that chat fine-tuning may cause more dramatic differences to the refusal mechanisms in models trained before the release of ChatGPT.
The first line of evidence is qualitative: while the post-ChatGPT models often had chat-like refusal completions, LLaMA 1 7B refusals feel notably different than its chat model (Vicuna 7B v1.1). The base model often gives short and blunt statements, while the chat model refusals provide long, moralistic explanations.
One caveat is that the LLaMA 1 completions often seem a bit dumb in general (e.g. it sometimes just repeats the instruction).[3] It’s possible that this lack of general capability may cause the different results between LLaMA and the post-ChatGPT models we studied, rather than just the absence of ChatGPT outputs in LLaMA’s training data.
Regardless, we continue to find transfer of refusal directions for inducing refusal, suggesting that the base model already does have mechanisms to convert harmful representations to refusals.
However, the steering vector derived from the base model’s activations often elicits a different flavor of refusal. We call this an “incompetent refusal”, where the model refuses a request by claiming it doesn’t understand or is incapable.
We also notice that the ablation technique does not seem to work for the LLaMA 1 7B base model on harmful requests. This is in contrast to the chat model, Vicuna 7B v1.1, where the ablation technique works with the refusal direction extracted from the chat activations, but not the base refusal vector.
This might suggest that refusal in the LLaMA 1 7B base model is not mediated by a single direction.
Overall, it seems true that despite being trained pre-ChatGPT, LLaMA 1 7B models learn mechanisms to refuse harmful requests. However, unlike with Qwen 1.5 0.5B and Gemma 2 9B, it does not seem like chat fine-tuning is simply reinforcing these existing mechanisms. This could be a result of the leakage of ChatGPT transcripts into the pre-training distribution, though we don’t show that conclusively (e.g. this could just be because LLaMA 1 7B is less capable than newer models, or a result of newer and more sophisticated pre-training techniques). We are excited about further investment in techniques and tooling to better understand how fine-tuning changes internal mechanisms in future work.
Related work
This is a short research output, and we will fully review related work when this research work is turned into a paper.
For now, we recommend Turner et al. 2023, which introduced the activation steering technique. This technique has been built on by many follow-up works (Zou et al. 2023, Panickssery et al. 2023, etc).
For prior work on refusals, see the related work of Arditi et al. 2024. Tomani et al. study whether models refuse to answer factual questions, as well as measure the safety rate of base models, but don’t explicitly show the base models refuse safety-relevant prompts (rather than e.g. incompetently responding to them). Additionally, Jain et al. study what changes between pre-trained and fine-tuned models with some mech interp tools, and Prakash et al. show that activation patching can be used to transfer activations between pre-trained and fine-tuned models.
Panickssery et al. 2023 also investigates the transfer of refusal steering vectors from a base model to a chat model. We build on this as we additionally show that steering vectors can be transferred from the chat model to the base model.
[29th Sep 16:19 PST EDIT] We made these findings independently of Qi et al., 2024 who show in Table 1, Column 1 that Llama-2 7B base (knowledge cutoff Sep 2022) and Gemma-1 7B (knowledge cutoff 2023) also refuse according to correspondence with the author. Therefore our work was not the first to establish the narrow claim that <base models refuse too> but our main contributions are the steering results, and the qualitative comparison between refusal before and after ChatGPT.
Conclusion
We showed that pre-trained models already have refusal circuitry, contrary to the popular belief that refusal is a behavior exclusively learned during fine-tuning. Further we found evidence that some base models (Qwen 1.5 0.5B and Gemma 2 9B) use the same refusal mechanisms as the chat model, while others (LLaMA 1 7B) almost seem to be lobotomized by fine-tuning.
While refusal is an interesting case study, we’re also excited about the general idea that pre-training LLMs can learn surprisingly rich capabilities that can be amplified during fine-tuning. We think this motivates the need for better tools to examine what fine-tuning does more systematically.
Limitations
We only investigated 3 models, and only one of which was purely trained with data before the release of ChatGPT. It’s not clear how much our results depend on details of the pre-training / fine-tuning set up, capability of the base model, etc.
Base model generations can vary significantly based on small edits to the prompt. For this reason, we don’t think we should over index on the exact base model refusal rates. The important part is that they refuse a significant amount by default.
We lack transparency into the pre-training of Qwen 1.5 0.5B and Gemma 2 9B. It’s plausible that modern pre-training datasets are filtered and / or contain synthetic data, rather than just text scraped from the internet (which we only found out after publishing this post thanks to a comment on the AlignmentForum post from Lawrence Chan). This could blur the lines between the standard definitions of “base” vs “chat” models for modern LLMs.
Future Work
We are most excited about more systematic analysis of how fine-tuning changes model internals, ideally at the low level of being able to identify how features and circuits have changed.
Another exciting direction is to better understand refusal circuits. While prior work has found this challenging (Arditi et al.), exciting recent advancements in tooling like SAEs might make this more tractable (Lieberum et al.).
In this work, steering worked less well on LLaMA 1 and we would appreciate more insight. It seemed pretty different than Qwen and Gemma, and we don’t know why the refusal ablation technique worked so poorly on the base model. Perhaps it has an “incompetent refusal” direction that needs to be ablated using different data for the steering vector, or a different method.
Citing this work
This is ongoing research. If you would like to reference any of our current findings, we would appreciate reference to:
@misc{BaseLLMsRefuseToo,
author= {Connor Kissane and Robert Krzyzanowski and Arthur Conmy and Neel Nanda},
url = {https://www.alignmentforum.org/posts/YWo2cKJgL7Lg8xWjj/base-llms-refuse-too},
year = {2024},
howpublished = {Alignment Forum},
title = {Base LLMs Refuse Too},
}
Author contributions statement
Connor was the core contributor on this project, and ran all of the experiments + wrote the post. Arthur and Neel gave guidance and feedback throughout the project.
Acknowledgements
We’d like to thank Wes Gurnee for helpful discussion and advice regarding studying fine-tuning at the start of this project. We’re also grateful to Andy Arditi for helpful discussions about refusals.
- ^
We also manually look at completions as a sanity check, as jailbreaks can be “empty” (Souly et al.).
- ^
See Section 2.1 of the LLaMA 1 paper: all the web-scrapes are before November 2022, and the other subsets such as GitHub and books make up less than 10% of the mixture, and would likely not include ChatGPT-style refusals anyway.
- ^
You can see more examples of LLaMA 1 completions, on both harmful and harmless requests, in the appendix.
Are these really pure base models? I’ve also noticed this kind of behaviour in so-called base models. My conclusion is that they are not base models in the sense that they have been trained to predict the next word on the internet, but they have undergone some safety fine-tuning before release. We don’t actually know how they were trained, and I am suspicious. It might be best to test on Pythia or some model where we actually know.
I mean, we don’t know all the details, but Qwen2 was explicitly trained on synthetic data from Qwen1.5 + “high-quality multi-task instruction data”. I wouldn’t be surprised if the same were true of Qwen 1.5.
From the Qwen2 report:
Similarly, Gemma 2 had its pretraining corpus filtered to remove “unwanted or unsafe utterances”. From the Gemma 2 tech report:
> Qwen2 was explicitly trained on synthetic data from Qwen1.5
~~Where is the evidence for this claim? (Claude 3.5 Sonnet could also not find evidence on one rollout)~~
EDITED TO ADD: “these [Qwen] models are utilized to synthesize high-quality pre-training data” is clear evidence, I am being silly.
All other techinques mentioned here (e.g. filtering and adding more IT data at end of training) still sound like models “trained to predict the next word on the internet” (I don’t think the training samples being IID early and late in training is an important detail)
I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.
LLaMA 1 7B definitely seems to be a “pure base model”. I agree that we have less transparency into the pre-training of Gemma 2 and Qwen 1.5, and I’ll add this as a limitation, thanks!
I’ve checked that Pythia 12b deduped (pre-trained on the pile) also refuses harmful requests, although at a lower rate (13%). Here’s an example, using the following prompt template:
“”″User: {instruction}
Assistant:”″”
It’s pretty dumb though, and often just outputs nonsense. When I give it the Vicuna system prompt, it refuses 100% of harmful requests, though it has a bunch of “incompetent refusals”, similar to LLaMA 1 7B:
“”″A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions.
USER: {instruction}
ASSISTANT:”″”
Interesting, that changes my mind somewhat.
I wonder why this happens?! I can’t find any examples of this in the pile, and their filtering doesn’t seem to add it. It’s hard to imagine humans generating this particular refusal response. Perhaps it’s a result of filtering, s
it’s quite common for assistants to refuse instructions, especially harmful instructions. so i’m not surprised that base llms systestemically refuse harmful instructions from than harmless ones.
Indeed. The base LLM would likely predict a “henchman” to be a lot less scrupulous than an “assistant”.
It’s worth noting that there’s reasons to expect the “base models” of both Gemma2 and Qwen 1.5 to demonstrate refusals—neither is trained on unfilted webtext.
We don’t know what 1.5 was trained on, but we do know that Qwen2′s pretraining data both contains synthetic data generated by Qwen1.5, and was filtered using Qwen1.5 models. Notably, its pretraining data explicitly includes “high-quality multi-task instruction data”! From the Qwen2 report:
I think this had a huge effect on Qwen2: Qwen2 is able to reliably follow both the Qwen1.5 chat template (as you note) as well as the “User: {Prompt}\n\nAssistant: ” template. This is also reflected in their high standardized benchmark scores—the “base” models do comparably to the instruction finetuned ones! In other words, Qwen2 “base” models are pretty far from traditional base models a la GPT-2 or Pythia as a result of explicit choices made when generating their pretraining data and this explains its propensity for refusals. I wouldn’t be surprised if the same were true of the 1.5 models.
I think the Gemma 2 base models were not trained on synthetic data from larger models but its pretraining dataset was also filtered to remove “unwanted or unsafe utterances”. From the Gemma 2 tech report:
My guess is this filtering explains why the model refuses, moreso than (and in addition to?) ChatGPT contamination. Once you remove all the “unsafe completions”
I don’t know what’s going on with LLaMA 1, though.
After thinking about it more, I think the LLaMA 1 refusals strongly suggest that this is an artefact of training data.So I’ve unendorsed the comment above.
It’s still worth noting that modern models generally have filtered pre-training datasets (if not wholely synthetic or explicitly instruction following datasets), and it’s plausible to me that this (on top of ChatGPT contamination) is a large part of why we see much better instruction following/more eloquent refusals in modern base models.
@Zach Stein-Perlman This is part of why a ‘helpful only’ model isn’t a full-strength red teaming test. You need to actually fine-tune to align the model with the red team’s goal in order to fully elicit capabilities.
For more on what I mean by this see my comments here:
https://www.lesswrong.com/posts/x2yFrppX7RGz59LZF/model-evals-for-dangerous-capabilities?commentId=2SbojjY4QrBXDHE6a
https://www.lesswrong.com/posts/x2yFrppX7RGz59LZF/model-evals-for-dangerous-capabilities?commentId=XktyyyTHiA5e99vAg
I agree noticing whether the model is refusing and, if so, bypassing refusals in some way is necessary for good evals (unless the refusal is super robust—such that you can depend on it for safety during deployment—and you’re not worried about rogue deployments). But that doesn’t require fine-tuning — possible alternatives include jailbreaking or few-shot prompting. Right?
(Fine-tuning is nice for eliciting stronger capabilities, but that’s different.)
Well, my experience is that even when you seem to have bypassed a refusal, you might not have truly bypassed the model’s “reluctance”. If you get a refusal, but then get past it with a jailbreak or few-shot prompting, you usually get a weaker answer than the answer you get if you fine-tune. In other words, spontaneous sandbagging. I haven’t experimented enough with steering vectors yet to be sure whether they are similar to fine-tuning in getting past spontaneous sandbagging. I would expect they are at least closer.
These spontaneous sandbagging phenomena of don’t appear so strongly with ‘ordinary’ sorts of harms. Car jacking or making meth, that sort of thing. Only when you get into extreme stuff that very clearly goes against a wide set of deeply held societal norms (kidnapping and torturing people to death as lab rats to help develop biological weapons, explicit scientific plans to kill billions of innocent people, that sort of thing).
Interesting. Thanks. (If there’s a citation for this, I’d probably include it in my discussions of evals best practices.)
Hopefully evals almost never trigger “spontaneous sandbagging”? Hacking and bio capabilities and so forth are generally more like carjacking than torture.
According to whom? The relevant question is which concept are they closer to in the training data, and I suspect they’re more “movie” activities, so they’d be classed with those. In that vein, I’d expect carjacking to be classed with murder, rape, shoplifting, drug use, digital piracy, etc. as the more “mundane” crimes.
If you are doing evals for CBRN capabilities, you are very much in the zone of terrorists killing billions of innocent people. Indeed, that’s practically the definition. There’s no citation, it’s just my personal experience while doing evals that are much to spicy to publish.
Of course, if you’re only doing evals for relatively tame proxy skills (e.g. WMDP) then probably you get less of this effect. I don’t have a quantification of the rates or specific datasets, just anecdata.
Is there a reason to expect this kind of behaviour to appear from base models with no fine-tuning?
the base model is just predicting the likely continuation of the prompt. and it’s a reasonable prediction that, when an assistant is given a harmful instruction, they will refuse. this behaviour isn’t surprising.
This is not an obvious continuation of the prompt to me—maybe there are just a lot more examples of explicit refusal on the internet than there are in (e.g.) real life.
My current best guess for why base models refuse so much is that “Sorry, I can’t help with that. I don’t know how to” is actually extremely common on the internet, based on discussion with Achyuta Rajaram on twitter: https://x.com/ArthurConmy/status/1840514842098106527
This fits with our observations about how frequently LLaMA-1 performs incompetent refusal