What was your prior on the phenomenon you found? Do you think that if you had looked at the same filtered example, and their corresponding scores, and did the same tinkering (editing prompt to check that your understanding is valid, …), you could have found the same explanations? Were there some SAE features that you explored and which didn’t “work”?
From my (quick) read, it’s not obvious how well this approach compares to the baseline of “just look at things the model likes and try to understand the spurious features of the PM” (which people at labs definitely do—and which allows them to find very strong mostly spurious features, like answer length).
There were some features that didn’t work, specifically ones that activated on movie names & famous people’s names, which I couldn’t get to work. Currently I think they’re actually part of a “items in a list” group of reward-relevant features (like the urls were), but I didn’t attempt to change prompts based off items in a list.
For “unsupervised find spurious features over a large dataset” my prior is low given my current implementation (ie I didn’t find all the reward-relevant features).
However, this could be improved with more compute, SAEs over layers, data, and better filtering of the resulting feature results (and better versions of SAEs that e.g. fix feature splitting, train directly for reward).
From my (quick) read, it’s not obvious how well this approach compares to the baseline of “just look at things the model likes and try to understand the spurious features of the PM”
From this section, you could augment this with SAE features by finding the features relevant for causing one completion to be different than the other. I think this is the most straightforwardly useful application. A couple of gotcha’s:
Some features are outlier dimensions or high-frequency features which will affect both completions (or even most text), so include some baselines which shouldn’t be affected (which requires a hypothesis)
You should look over multiple layers (though if you do multiple residual stream SAEs you’ll find near-duplicate features)
Thank you for sharing your negative results. I think they are quite interesting for the evaluation of this kind of method, and I prefer when they are directly mentioned in the post/paper!
I didn’t get your answer about my question about baselines. The baseline I have in mind doesn’t use SAE at all. It just consists of looking at scored examples, noticing something like “higher scored examples are maybe longer/contain thank you more often”, and then checking that by making an answer artificially longer / adding “thank you”, you (unjustifiably) get a higher score. Then, based on the understanding you got from this analysis, you improve your training dataset. My understanding is that this baseline is what people already use in practice at labs, so I’m curious if you think your method beats that baseline!
I prefer when they are directly mentioned in the post/paper!
That would be a more honest picture. The simplest change I could think of was adding it to the high-level takeaways.
I do think you could use SAE features to beat that baseline if done in the way specified by General Takeaways. Specifically, if you have a completion that seems to do unjustifiably better, then you can find all feature’s effects on the rewards that were different than your baseline completion.
Features help come up with hypotheses, but also isolates the effect. If do have a specific hypothesis as mentioned, then you should be able to find features that capture that hypothesis (if SAEs are doing their job). When you create some alternative completion based on your hypothesis, you might unknowingly add/remove additional negative & positive features e.g. just wanting to remove completion-length, you also remove the end-of-sentence punctuation.
In general, I think it’s hard to come up with the perfect counterfactual, but SAE’s at least let you know if you’re adding or removing specific reward-relevant features in your counterfactual completions.
Nice work!
What was your prior on the phenomenon you found? Do you think that if you had looked at the same filtered example, and their corresponding scores, and did the same tinkering (editing prompt to check that your understanding is valid, …), you could have found the same explanations? Were there some SAE features that you explored and which didn’t “work”?
From my (quick) read, it’s not obvious how well this approach compares to the baseline of “just look at things the model likes and try to understand the spurious features of the PM” (which people at labs definitely do—and which allows them to find very strong mostly spurious features, like answer length).
Thanks!
There were some features that didn’t work, specifically ones that activated on movie names & famous people’s names, which I couldn’t get to work. Currently I think they’re actually part of a “items in a list” group of reward-relevant features (like the urls were), but I didn’t attempt to change prompts based off items in a list.
For “unsupervised find spurious features over a large dataset” my prior is low given my current implementation (ie I didn’t find all the reward-relevant features).
However, this could be improved with more compute, SAEs over layers, data, and better filtering of the resulting feature results (and better versions of SAEs that e.g. fix feature splitting, train directly for reward).
From this section, you could augment this with SAE features by finding the features relevant for causing one completion to be different than the other. I think this is the most straightforwardly useful application. A couple of gotcha’s:
Some features are outlier dimensions or high-frequency features which will affect both completions (or even most text), so include some baselines which shouldn’t be affected (which requires a hypothesis)
You should look over multiple layers (though if you do multiple residual stream SAEs you’ll find near-duplicate features)
Thank you for sharing your negative results. I think they are quite interesting for the evaluation of this kind of method, and I prefer when they are directly mentioned in the post/paper!
I didn’t get your answer about my question about baselines. The baseline I have in mind doesn’t use SAE at all. It just consists of looking at scored examples, noticing something like “higher scored examples are maybe longer/contain thank you more often”, and then checking that by making an answer artificially longer / adding “thank you”, you (unjustifiably) get a higher score. Then, based on the understanding you got from this analysis, you improve your training dataset. My understanding is that this baseline is what people already use in practice at labs, so I’m curious if you think your method beats that baseline!
That would be a more honest picture. The simplest change I could think of was adding it to the high-level takeaways.
I do think you could use SAE features to beat that baseline if done in the way specified by General Takeaways. Specifically, if you have a completion that seems to do unjustifiably better, then you can find all feature’s effects on the rewards that were different than your baseline completion.
Features help come up with hypotheses, but also isolates the effect. If do have a specific hypothesis as mentioned, then you should be able to find features that capture that hypothesis (if SAEs are doing their job). When you create some alternative completion based on your hypothesis, you might unknowingly add/remove additional negative & positive features e.g. just wanting to remove completion-length, you also remove the end-of-sentence punctuation.
In general, I think it’s hard to come up with the perfect counterfactual, but SAE’s at least let you know if you’re adding or removing specific reward-relevant features in your counterfactual completions.