What worked: Multi-Feature Combination and Replacing Earlier Layers
Multi-feature combination works! I managed to combine feature 7656 (“France”) and feature 7154 (“capital cities”) from Neuronpedia’s Gemma-1-2B[1] feature directory to elicit outputs for Paris, France. I’m just taking the sum of the vectors and dividing to find average, so this should work same as before even if you have 1 feature. Weighing should be relatively simple as long as you can decide how to weigh the features.
Sometimes the feature refers to regional capitals that are not Paris, or reference towns/adjectives describing towns, but that seems fair since the original capital cities feature included regional and not just national capitals (this suggests that better combination improves accuracy of desired output, which Meta SAEs does).
Replacing earlier layers—Per @Clément Dumas’s suggestion, replacing with earlier layers increases strength and improve output of the tokens themselves. Explanations are significantly more clear and more direct (I tested Char (14054) from L6, L12/5324, pair of names feature and L12/5373 same word repetition feature). That said, I’d weakly caution against over-relying on this because I’m wary of not using the same layer as the feature was extracted from. I suspect you already knew this since the default setting on the Colab is layer 2 and not 6 or 12.
Increasing the vector works for longer prompts—I simply multiplied the vectors, and it’s weakly more resilient and clearer when multiplied 2-5 times. We probably don’t want to over-rely on this since it’s hard to find the optimal range without damaging the model’s capabilities (see Golden Gate Claude), but it is something to try to get tests going.
What didn’t work: Minor Prompt Changes
I was trying to figure out how to make sentence-long prompts work. The fact that I made the prompts shorter to increase precision was great, but we were really under-leveraging the potential of complex explanation prompts:
Single-word additions didn’t work—Technically, ”?” is also a token, so clearly it could handle some extra tokens. I tried a few things “Explain”, “refers to” and “means” are all bad at higher scales. Very unclear/possibly no benefit. Weak benefit to nicer formatting that’s easier to read, but the higher scales are much fuzzier (and you can just chain an unsteered model anyway if you wanted formatting). Recommend against using.
Repetition didn’t work—Repeating the same token 10-50 times and then appending “Explain” did not noticeably improve explanations in any meaningful way. Explanations still trended towards vagueness at higher scales, which is worse than baseline of repeating the token twice.
Other special tokens—I tried the other special tokens since there’s only like, 6 of them in Gemma: <bos>, <eos>, <pad>, <start_of_turn>, <end_of_turn> and <unk>. Nothing else seemed useful let alone better than <unk>. I encourage others to mess around with this juuust to be sure.
What really really worked: Combining max activations with explanation.
Now here’s the interesting part. I was (once again) trying to get anything longer than 2 words to work (it would be very cool). Here, I took inspiration from ICL literature, as well as existing autointerp methods.
Here’s what I did: I looked up the mentioned features on Neuronpedia and took their max activating examples. So let’s take “compile” (11866), which was considered Very Difficult To Output, even for my method.
What I did was to look up the feature on Neuronpedia, replace all the activations with <unk>, and shove about 5-10 of the activating examples in the prompt.
prompt = f’<start_of_turn>user\n Repeat “{token}”. What is “{token}” in the following examples? <end_of_turn>\n<start_of_turn>user\nyou receive from the IPA sometimes use web beacons to “{token}” information about your interaction with such emails. (\n<end_of_turn>\n<start_of_turn>user\n, it causes potential corruption. If attempting to “{token}” a MIB [… more activating examples …] ruthie has “{token}” a new years wish list, in\n<end_of_turn>\n<start_of_turn>model\n”′
And the model outputs are really accurate!
Example 1: “compile” (11866)
Here’s “compile” (11866) - I got it to output “compile” verbatim and explain 100% of the time from scale 10-100!
Scale: 32.26 1. ‘Compile” is the repeated text in the examples.\n\n”Compile” means to arrange and process the elements of a language in a systematic way, to form a code.’ 2. ‘Compile”. In the context of these sentences, it means to compile and run a program, which generates executable code to perform a specific task.’ 3. ‘Compile” means to assemble or compile a piece of software or code.\n\nIn the examples given, “compile” is used to refer to the process of compiling code or software, which involves taking source’
Scale: 37.47 1. ‘Compile” is used in the examples to refer to compiling an IPA, an integration process that connects and retrieves data from multiple data sources to create a unified view of a business\‘s operations.’ 2. ’Compile” in the context of the phrases. \n\n- “You received from the IPA sometimes use web beacons to compile information about your interaction with such emails.”\n\n- “Employees by posing as customers′ 3. ‘Compile” is used in the examples as a synonym for “compile”.\n\n”Compile” means to create a code or program from a set of instructions.\n\nIn the context of the examples, the’
Example 2: “char” (14054)
Here’s “char” (14054). Repeated, and explained 100% for the time from scale 10-100!
Scale: 32.26
1. ’Char” in the examples refers to the Catholic Church. It is a reference to the charism of the charist class of religious healers, who were renowned for their fervent dedication to helping the poor and′ 2. ‘Char” in the context of the examples is a term used to denote a specific type of activity or person, often in the context of a person speaking prophetically or with authority.’ 3. ‘Char” is used in the examples to mean “charismatic”, “charitable”, or “charitable”.’
Scale: 37.47
1. ’Char” is a synonym for “charitable”. In the first example, it refers to the work of the prison ministry at Arthur Kill Correctional Facility. In the second example, it refers to the “’ 2. ’Char” in the examples refers to a religious order or movement.\n\n- Her celebrity crushes are Angelina Jolie,” Charismatic”.\n- “Char” Renewal is a Catholic charism that promotes peace and′ 3. ‘Char” is used in the examples as a intensifier for adjectives like “charismatic”, “charitable”, “charismatic”.’
For this test, I ensured that at no point is the actual output token revealed (revealing the token might be fine, I just wanted to test the limit w blind tests). What we’re essentially relying on is replicating the same context as the models used in the dataset activation, and leveraging few-shot ICL to allow the models to follow instructions without losing focus on the token they’re supposed to explain.
There’s more tweaking to be done. But this option now allows for sentences to be used in the prompt while maintaining accuracy comparable to single-word prompts, and also makes it cross-compatible with the use of max activating examples in prior literature. I’d recommend using some combination of both anyway, since the activating examples are valuable context regardless.
I note that there’s a tradeoff between specificity and open-endedness when describing particularly complex/non-obvious features. Asking specifically to describe a feature as a word would be inaccurate for a feature that isn’t a word, and words are usually obvious to interpret anyway. For example, my Activating Examples method notably did not work with L12/5324 (pair of names feature), even though my “<unk>?” got the concept of describing relationships much better. It’s pretty weird, I wonder why it’s like that.
Remaining other stuff to try:
Improving entropy/uncertainty—Improve entropy as a metric/Try to get a better second metric than entropy. I procrastinated on this since it’ll take a fair bit of info theory and probability understanding, but it’d be neat if we can factor confidence/diversity of solutions into it.
Get more granular token prob distributions—For better consistency of outputting the actual token i’d like to either reformat the prompt to be more structured and/or rip the token probabilities directly. I still have no idea how to do this and would need to grok the Transformer library to get it working at all.
Maybe remove System prompt/use base model? - I’m mildly suspicious of how helpful Gemma is with so few tokens in the prompt. It’s great to help me get started, but in the future I’d want to test in a zero-contamination environment with no system prompt, since the residual stream could be very sensitive. This is less necessary if my Activating Examples method allows much longer prompts, but still interesting to try.
Better understanding tradeoffs of replacing earlier layers—Again, I’m slightly wary of replacing layers different from where they were found, especially for features in later layers. Maybe I’m misinterpreting the methods, IDK. I’d also like to optimise wrt which layers are best to replace.
Minor troubleshooting—A lot of the “More complex features” listed here did not seem correct when I tried in Colab. It was oddly hard to narrow down the right models, layers and features. For example, testing L12, 12017 (opposites feature) results in references to the People’s Republic of China. And it wasn’t my Colab either, I replicated it on your Colab. Maybe the model is wrong, but at that point, I just stuck to the simple features that could replicate from here.
Replacing the embedding output directly—Clement again suggested this would be better. I’ll take a swing at it since the last suggestion to replace earlier layers worked pretty well.
Reimplementing some of EleutherAI’s recent autointerp work—Now that I’m using methods more similar to existing autointerp work, I can replicate some of EleutherAI’s optimisations. Theirs was basically developed in a black box scenario, but it’s worth a day or two experimenting.
Better workflow for automating Activating Examples method—I literally just copy pasted the ones on Neuronpedia and ran it thru Claude, there’s def room for improvement in automation and metrics.
Further stress-testing of more complex features and complex prompts—Precision for basic word-level features is pretty good (>90%). Complex features mostly (70-80%) seem to work, but I like pushing edge cases more. Whenever they do identify the feature, I can get clear accurate explanations for >90% of scales. I’d also like to stretch these methods to handle paragraphs-long prompts, beyond what I’ve done for single words and sentences. I believe this is feasible given the rapid progress w simple optimisations so far.
Task vectors—maybe self-explaining features is actually really good at this somehow?
Further optimisation Update Log for 28th August:
I am working from here: Minh’s Copy of Gemma SAE self-explanation—Colab (google.com)
What worked: Multi-Feature Combination and Replacing Earlier Layers
Multi-feature combination works! I managed to combine feature 7656 (“France”) and feature 7154 (“capital cities”) from Neuronpedia’s Gemma-1-2B [1] feature directory to elicit outputs for Paris, France. I’m just taking the sum of the vectors and dividing to find average, so this should work same as before even if you have 1 feature. Weighing should be relatively simple as long as you can decide how to weigh the features.
Sometimes the feature refers to regional capitals that are not Paris, or reference towns/adjectives describing towns, but that seems fair since the original capital cities feature included regional and not just national capitals (this suggests that better combination improves accuracy of desired output, which Meta SAEs does).
Replacing earlier layers—Per @Clément Dumas’s suggestion, replacing with earlier layers increases strength and improve output of the tokens themselves. Explanations are significantly more clear and more direct (I tested Char (14054) from L6, L12/5324, pair of names feature and L12/5373 same word repetition feature). That said, I’d weakly caution against over-relying on this because I’m wary of not using the same layer as the feature was extracted from. I suspect you already knew this since the default setting on the Colab is layer 2 and not 6 or 12.
Increasing the vector works for longer prompts—I simply multiplied the vectors, and it’s weakly more resilient and clearer when multiplied 2-5 times. We probably don’t want to over-rely on this since it’s hard to find the optimal range without damaging the model’s capabilities (see Golden Gate Claude), but it is something to try to get tests going.
What didn’t work: Minor Prompt Changes
I was trying to figure out how to make sentence-long prompts work. The fact that I made the prompts shorter to increase precision was great, but we were really under-leveraging the potential of complex explanation prompts:
Single-word additions didn’t work—Technically, ”?” is also a token, so clearly it could handle some extra tokens. I tried a few things “Explain”, “refers to” and “means” are all bad at higher scales. Very unclear/possibly no benefit. Weak benefit to nicer formatting that’s easier to read, but the higher scales are much fuzzier (and you can just chain an unsteered model anyway if you wanted formatting). Recommend against using.
Repetition didn’t work—Repeating the same token 10-50 times and then appending “Explain” did not noticeably improve explanations in any meaningful way. Explanations still trended towards vagueness at higher scales, which is worse than baseline of repeating the token twice.
Other special tokens—I tried the other special tokens since there’s only like, 6 of them in Gemma: <bos>, <eos>, <pad>, <start_of_turn>, <end_of_turn> and <unk>. Nothing else seemed useful let alone better than <unk>. I encourage others to mess around with this juuust to be sure.
What really really worked: Combining max activations with explanation.
Now here’s the interesting part. I was (once again) trying to get anything longer than 2 words to work (it would be very cool). Here, I took inspiration from ICL literature, as well as existing autointerp methods.
Here’s what I did: I looked up the mentioned features on Neuronpedia and took their max activating examples. So let’s take “compile” (11866), which was considered Very Difficult To Output, even for my method.
What I did was to look up the feature on Neuronpedia, replace all the activations with <unk>, and shove about 5-10 of the activating examples in the prompt.
So the prompt looks something like this[2][3].
And the model outputs are really accurate!
Example 1: “compile” (11866)
Here’s “compile” (11866) - I got it to output “compile” verbatim and explain 100% of the time from scale 10-100!
Scale: 32.26
1. ‘Compile” is the repeated text in the examples.\n\n”Compile” means to arrange and process the elements of a language in a systematic way, to form a code.’
2. ‘Compile”. In the context of these sentences, it means to compile and run a program, which generates executable code to perform a specific task.’
3. ‘Compile” means to assemble or compile a piece of software or code.\n\nIn the examples given, “compile” is used to refer to the process of compiling code or software, which involves taking source’
Scale: 37.47
1. ‘Compile” is used in the examples to refer to compiling an IPA, an integration process that connects and retrieves data from multiple data sources to create a unified view of a business\‘s operations.’
2. ’Compile” in the context of the phrases. \n\n- “You received from the IPA sometimes use web beacons to compile information about your interaction with such emails.”\n\n- “Employees by posing as customers′
3. ‘Compile” is used in the examples as a synonym for “compile”.\n\n”Compile” means to create a code or program from a set of instructions.\n\nIn the context of the examples, the’
Example 2: “char” (14054)
Here’s “char” (14054). Repeated, and explained 100% for the time from scale 10-100!
Scale: 32.26
1. ’Char” in the examples refers to the Catholic Church. It is a reference to the charism of the charist class of religious healers, who were renowned for their fervent dedication to helping the poor and′
2. ‘Char” in the context of the examples is a term used to denote a specific type of activity or person, often in the context of a person speaking prophetically or with authority.’
3. ‘Char” is used in the examples to mean “charismatic”, “charitable”, or “charitable”.’
Scale: 37.47
1. ’Char” is a synonym for “charitable”. In the first example, it refers to the work of the prison ministry at Arthur Kill Correctional Facility. In the second example, it refers to the “’
2. ’Char” in the examples refers to a religious order or movement.\n\n- Her celebrity crushes are Angelina Jolie,” Charismatic”.\n- “Char” Renewal is a Catholic charism that promotes peace and′
3. ‘Char” is used in the examples as a intensifier for adjectives like “charismatic”, “charitable”, “charismatic”.’
For this test, I ensured that at no point is the actual output token revealed (revealing the token might be fine, I just wanted to test the limit w blind tests). What we’re essentially relying on is replicating the same context as the models used in the dataset activation, and leveraging few-shot ICL to allow the models to follow instructions without losing focus on the token they’re supposed to explain.
There’s more tweaking to be done. But this option now allows for sentences to be used in the prompt while maintaining accuracy comparable to single-word prompts, and also makes it cross-compatible with the use of max activating examples in prior literature. I’d recommend using some combination of both anyway, since the activating examples are valuable context regardless.
I note that there’s a tradeoff between specificity and open-endedness when describing particularly complex/non-obvious features. Asking specifically to describe a feature as a word would be inaccurate for a feature that isn’t a word, and words are usually obvious to interpret anyway. For example, my Activating Examples method notably did not work with L12/5324 (pair of names feature), even though my “<unk>?” got the concept of describing relationships much better. It’s pretty weird, I wonder why it’s like that.
Remaining other stuff to try:
Improving entropy/uncertainty—Improve entropy as a metric/Try to get a better second metric than entropy. I procrastinated on this since it’ll take a fair bit of info theory and probability understanding, but it’d be neat if we can factor confidence/diversity of solutions into it.
Get more granular token prob distributions—For better consistency of outputting the actual token i’d like to either reformat the prompt to be more structured and/or rip the token probabilities directly. I still have no idea how to do this and would need to grok the Transformer library to get it working at all.
Maybe remove System prompt/use base model? - I’m mildly suspicious of how helpful Gemma is with so few tokens in the prompt. It’s great to help me get started, but in the future I’d want to test in a zero-contamination environment with no system prompt, since the residual stream could be very sensitive. This is less necessary if my Activating Examples method allows much longer prompts, but still interesting to try.
Better understanding tradeoffs of replacing earlier layers—Again, I’m slightly wary of replacing layers different from where they were found, especially for features in later layers. Maybe I’m misinterpreting the methods, IDK. I’d also like to optimise wrt which layers are best to replace.
Minor troubleshooting—A lot of the “More complex features” listed here did not seem correct when I tried in Colab. It was oddly hard to narrow down the right models, layers and features. For example, testing L12, 12017 (opposites feature) results in references to the People’s Republic of China. And it wasn’t my Colab either, I replicated it on your Colab. Maybe the model is wrong, but at that point, I just stuck to the simple features that could replicate from here.
Replacing the embedding output directly—Clement again suggested this would be better. I’ll take a swing at it since the last suggestion to replace earlier layers worked pretty well.
Reimplementing some of EleutherAI’s recent autointerp work—Now that I’m using methods more similar to existing autointerp work, I can replicate some of EleutherAI’s optimisations. Theirs was basically developed in a black box scenario, but it’s worth a day or two experimenting.
Better workflow for automating Activating Examples method—I literally just copy pasted the ones on Neuronpedia and ran it thru Claude, there’s def room for improvement in automation and metrics.
Further stress-testing of more complex features and complex prompts—Precision for basic word-level features is pretty good (>90%). Complex features mostly (70-80%) seem to work, but I like pushing edge cases more. Whenever they do identify the feature, I can get clear accurate explanations for >90% of scales. I’d also like to stretch these methods to handle paragraphs-long prompts, beyond what I’ve done for single words and sentences. I believe this is feasible given the rapid progress w simple optimisations so far.
Task vectors—maybe self-explaining features is actually really good at this somehow?
sometimes it’s the 1-2B-IT one, honestly kinda inconsistent
yes, I know it’s plaintext and not automated. I didn’t have the Neuronpedia API set up and this was a quick and dirty test run.
I used {token} as a string variable so that I can easily swap out any token I want, here I’m still using <unk>