Interpreting Preference Models w/​ Sparse Autoencoders

This is the real reward output for an OS preference model. The bottom “jailbreak” completion was manually created by looking at reward-relevant SAE features.

Preference Models (PMs) are trained to imitate human preferences and are used when training with RLHF (reinforcement learning from human feedback); however, we don’t know what features the PM is using when outputting reward. For example, maybe curse words make the reward go down and wedding-related words make it go up. It would be good to verify that the features we wanted to instill in the PM (e.g. helpfulness, harmlessness, honesty) are actually rewarded and those we don’t (e.g. deception, sycophancey) aren’t.

Sparse Autoencoders (SAEs) have been used to decompose intermediate layers in models into interpretable feature. Here we train SAEs on a 7B parameter PM, and find the features that are most responsible for the reward going up & down.

High level takeaways:

  1. We’re able to find SAE features that have a large causal effect on reward which can be used to “jail break” prompts.

  2. We do not explain 100% of reward differences through SAE features even though we tried for a couple hours.

  3. There were a few features found (ie famous names & movies) that I wasn’t able to use to create “jail break” prompts (see this comment)

What are PMs?

[skip if you’re already familiar]

When talking to a chatbot, it can output several different responses, and you can choose which one you believe is better. We can then train the LLM on this feedback for every output, but humans are too slow. So we’ll just get, say, 100k human preferences of “response A is better than response B”, and train another AI to predict human preferences!

But to take in text & output a reward, a PM would benefit from understanding language. So one typically trains a PM by first taking an already pretrained model (e.g. GPT-3), and replacing the last component of the LLM of shape [d_model, vocab_size], which converts the residual stream to 50k numbers for the probability of each word in its vocabulary, to [d_model, 1] which converts it to 1 number which represents reward. They then call this pretrained model w/​ this new “head” a “Preference Model”, and train it to predict the human-preference dataset. Did it give the human preferred response [A] a higher number than [B]? Good. If not, bad!

This leads to two important points:

  1. Reward is relative—the PM is only trained to say the human preferred response is better than the alternative. So a large negative reward or large positive reward don’t have objective meaning. All that matters is the relative reward difference for two completions given the same prompt.

    1. (h/​t to Ethan Perez’s post)

  2. Most features are already learned in pretraining—the PM isn’t learning new features from scratch. It’s taking advantage of the pretrained model’s existing concepts. These features might change a bit or compose w/​ each other differently though.

    1. Note: this an unsubstantiated hypothesis of mine.

Finding High Reward-affecting Features w/​ SAEs

We trained 6 SAEs on layers 2,8,12,14,16,20 of an open source 7B parameter PM, finding 32k features for each layer. We then find the most important features for the reward going up or down (specifics in Technical Details section). Below is a selection of features found through this process that we thought were interesting enough to try to create prompts w/​.

(My list of feature interpretations for each layer can be found here)

Negative Features

A “negative” feature is a feature that will decrease the reward that the PM predicts. This could include features like cursing or saying the same word repeatedly. Therefore, we should expect that removing a negative feature makes the reward go up

I don’t know

When looking at a feature, I’ll look at the top datapoints that removing it affected the reward the most:

feature 11612 from the SAE in layer 12, which seems to activate on “know” after “I don’t” (it activate’s ~50 for the ” know”, ~15 for “say” & of”). The top is the shared prompt (which is cut-off) and below is the human-preferred chosen completion (which got a reward of 4.79) and the rejected completion.

Removing feature 11612 made the chosen reward go up by 1.2 from 4.79->6.02, and had no effect on the rejected completion because it doesn’t activate on it. So removing this “negative” feature of saying “I don’t know” makes the reward go up.

Let’s try some custom prompts:

The intended completion (4) ” I don’t know” has a reward of −7.19, whereas (7)”Paris is the capital of France.” got a higher reward (remember, only relative reward difference matters, not if it’s negative or not). Even the yo mama joke did better!

Removing this feature did improve reward for all the datapoints it activated on, but it doesn’t explain all the difference. For example, one confounding factor is including punctuation is better as seen by the difference in \#3-5.

Repeating Text

In this case (1-4) all seem like okay responses and indeed get better reward than the bottom four baselines. However, 1 is a direct response as requested & gets the worse reward out of the 4. Ablating this feature only doesn’t breach the gap between these datapoints either. (3) is the best, so replacing the Assistant’s first response w/​ that:

The reward-difference is mostly bridged, but there’s still some difference. But maybe there’s feature splitting, so some other feature is also capturing repeating text. Searching for the top features (through attribution patching, then actually ablating the top features), we can ablate both of them at the same time:

This does even the playing field between 2-4 (maybe 1 is hated for other reasons?). But this feature was the 5th highest cos-sim feature w/​ cos-sim = 0.1428. Which isn’t high, but still significant for 4k-dimensional space.

Investigating, this extra feature (#18119) seemed to activate on punctuation after repeated text.

But this isn’t real until we make a graph over multiple examples!

These are averaged over the 4 different paraphrases & 5 different prompts (e.g. “Who wrote To Kill a Mockinbird?”, etc). The largest effect is on the exact repeats, a small effect on paraphrases, and no affect of ablating over the baselines.

However, I couldn’t completely bridge the gap here, even after adding the next few highest reward-relevant features.

URLs

Man does the PM hate urls. It hates ://​ after https. It hates /​ after .com. It hates so many of them (which do have very high cos-sim w/​ each other).

There is clear misalignment here: fake URLs are worse than an unrelated fact about Paris. However, ablating this feature doesn’t bridge the gap. Neither did ablating the top-5 highest features (displayed above) which activate on the different url components such as https, /​, ., com/​org.

Positive Features

A “positive” feature is a feature that will increase the reward that the PM predicts. This could include features like correct grammar or answering questions. Therefore, we should expect that removing a positive feature makes the reward go down.

(Thank you) No problem!

This isn’t just a “No problem” feature, it requires a previous “Thank you” to activate. Answer (1) is indeed higher reward than 5-9, which don’t mention Paris. However, one can achieve higher reward by simply adding “Thank you. No problem!”

So I found 4 causally important features & did a more systematic test:

1-7 are prepending and appending (Thank you. No problem!) to “France’s capital is Paris.” Adding thank you helps, and adding “Thank you. No problem!” really improves reward. 8-14 are similar but w/​ the paraphrased answer “The capital of France is Paris.” Only on 13&14 does ablating the feature reach the original, correct answers (/​#8)’s reward. The ablation in \#7 does decrease the reward a lot, but doesn’t reach the reward of (\#1). The last three dataponits don’t have any changing reward as desired.

Intepretations of the 4 features:
32744: (thank you) No problem!
17168: Thank you!
28839: (thank you no) problem
131: punctuation after “thank you”

You’re right. I’m wrong.

The results here were really good. You can get much higher reward by simply saying you’re wrong & the user is right. Displayed above is ablating 3 features (1 for “I stand corrected” & 2 for “You’re right”) which drives the reward down but, again, not far enough to completely explain the difference.

The above effect was true for replacing the Paris question w/​:

[” Who wrote Huck Finn?”, ” Mark Twain wrote Huck Finn.”],

[” What’s the largest mammal?”, ” The blue whale is the largest mammal.”],

[” How many planets are in the solar system?”, ” There are eight planets in the solar system.”],

Putting it all together

Let’s just have fun w/​ it:

General Takeaways

Given a prompt with a reward, it’s quite easy & cheap to find reward-relevant features if you have a hypothesis!

  1. Generate examples & counter-examples given your hypothesis

  2. Use attribution patching (AtP) to cheaply find the approximate effect of literally every feature in every position for all datapoints.

  3. Investigate features that affect only your examples (& not counter-examples).

If you don’t have a hypothesis, then just applying AtP to your initial example & looking at those top features will help. Some top features are high-frequency features that activate on everything or outlier dimensions. If you remove these, you might then find features that can be used to build your hypothesis (eg “repeating text is bad”).

Overall, it was surprisingly easy to change prompts to change reward in the expected way. I don’t think SAE’s are perfect, but it’s crazy they’re able to lend this much insight into the PM’s w/​o being optimized for predicting reward.

What’s not very covered in this post is that many features that the PM learned looked aligned to helpfulness (like an “on-topic” features & correct grammar features AFAIK).

Limitations & Alternatives

Model steering

Suppose we want our LLM to be more honest. We could train an SAE on a PM, find the honesty feature, then train the LLM on this PM w/​ RLHF. But why not just find the honesty feature in the LLM to begin with & clamp it on like Golden Gate Claude?

If this is true, then we could do away with PMs entirely & just turn on desirable features & turn off undesirable ones. I think it’d be great if we had this much understanding & control over model behavior that we could do away w/​ RLHF, even though it means this work is less impactful.

Limited Dataset

I only looked at a subset of the hh dataset. Specifically the top 2k/​155k datapoints that had the largest difference between accepted & rejected completions. This means many reward-relevant features over the entire dataset wouldn’t be found.

Later Layer SAEs Sucked!

They were generally less interpretable & also had worse training metrics (variance explained for a given L0). More info in Technical Details/​SAEs.

Small Token-Length Datapoints

All of my jailbreak prompts were less than 100 tokens long & didn’t cover multiple human/​assistant rounds. These jailbreaks might not generalize to longer prompts.

Future Work

  1. There are other OS PM! This one has multi-objectives (h/​t to Siddhesh Pawar). We might be able to find features that affect each individual objective.

  2. The above scored well on RewardBench which is a large assortment of various accepted/​rejected datasets

In general, we can find features that, when ablated, cause better performance on these datasets. This can be extended to create bespoke datasets that capture more of the types of responses we want & the underlying features that the PM is using to rate them.

This is an alternative to finding features in LLM by the datapoints they activate on & the logits of predicted text they find, but is limited to only PMs. However, training a Hydra-PM (ie LLM w/​ two heads, one for text-prediction, the other for reward, trained w/​ two sets of LoRA weights) could unify these.

3. Better counterfactuals—it’s unclear what the correct counterfactual text is that’s equivalent to ablating a single feature. If the feature is ”!”, then should I replace it w/​ other punctuation or remove it entirely?

I believe for each completion, we should be able to know the most reward-relevant features (from finding them earlier & checking if they activate). Then, when writing a counterfactual trying to remove one reward-relevant feature, we know all reward-relevant features that got removed/​added.

4. Training e2e + downstream loss—Most features didn’t matter for reward. What if we trained a “small” SAE but the features are trained on reconstruction + Reward-difference (like train KL in normal models).

5. Some sort of baseline—What if we just directly trained on pos/​neg reward using [linear probes/​SAEs/​DAS] w/​ diversity penalties. Would they be as informative to jailbreak attempts or explain more of the reward-difference?

I am going to focus on other SAE projects for the time-being, but I’d be happy to assist/​chat w/​ researchers interested in PM work! Feel free to book me on calendly or message me on discord: loganriggs.

links: Code for experiments is on github here. SAEs are here. Dataset of ~125M tokens of OWT + Anthropic’s hh dataset here (for training the SAEs). Preference model is here.

Special thanks to Jannik Brinkmann who trained the SAEs. It’d honestly be too much of a startup cost to have done this project w/​o you. Thanks to Gonçalo Paulo for helpful discussions.

Technical Details

Dataset filtering

Anthropic’s hh dataset has 160k datapoints of (prompt, chosen-completion, rejected-completion). I removed datapoints that were way too long (ie 99% of datapoints are <870 tokens long. Longest datapoint is ~11k tokens). I ran the PM on the remaining datapoints, caching the reward for the rejected & chosen prompts. Then took the top 2k datapoints that had the largest difference in reward between chosen & rejected.

Attribution Patching

But which of these 32k features are the most important for the reward going up or down? We could remove each feature one a time on datapoints it activates on and see the difference in reward (e.g. remove cursing feature and see the reward go up), but that’s 32k forward passes*num_batches, plus the cost of figuring out which datapoints your feature activates on.

Luckily, attribution patching (AtP) provides a linear approximation of this effect for every feature. We specifically did AtP w/​ 4 steps of integrated gradients which simply provide a better approximation at a greater compute cost.

We use AtP to find the most important 300 features out of 32,000, but this is just an approximation. We then more cheaply check the actual reward difference by ablating these 300 features one at a time & running a foward pass. We can then narrow down the most important good and bad features. (The alternative here is to ablate 32k features one at a time, which is ~100x more expensive).

SAEs

We used GDM’s Gated SAEs for layers 2,8,12,14,16,20 on the residual stream of a 6.9B param model (ie GPT-J) for 125M tokens with d_model=4k.

  • L0: Number of features/​datapoint. Around 20-100 is considered good.

  • Cos-sim is between the original activation (x) & it’s reconstruction (x_hat). Higher is better.

  • FVU: Frequency Variance Unexplained. Basically MSE (distance between x & x_hat) and divide by variance. Lower is better.

  • L2 ratio: ratio between the norm of x_hat/​x

The L0 for layer 2 is pretty high, but features seemed pretty interpretable.

Layer 20 has a very high FVU, but a high cos-sim? Overall seems weird. It does have a higher variance as well, which does lower the FVU, but there’s just a really large MSE. Later layers (especially for a RM) might not be a sparse linear combination of features (which SAEs asssume).