(I’m the first author of the linked paper on GPT-4 autoencoders.)
I think many people are heavily overrating how human-explainable SAEs today are, because it’s quite subtle to determine whether a feature is genuinely explainable. SAE features today, even in the best SAEs, are generally are not explainable with simple human understandable explanations. By “explainable,” I mean there is a human understandable procedure for labeling whether the feature should activate on a given token (and also how strong the activation should be, but I’ll ignore that for now), such that your procedure predicts an activation if and only if the latent actually activates.
There are a few problems with interpretable-looking features:
it is insufficient that latent-activating samples have a common explanation. You also need the opposite direction of things that match the explanation to activate the latent. For example, we found a neuron in GPT-2 that appears to activate on the word “stop,” but actually most instances of the word “stop” don’t activate the neuron. It turns out that this was not really a “stop” neuron, but rather a “don’t stop/won’t stop” neuron. While in this case there was
a different but still simple explanation, it’s entirely plausible that many features just cannot be explained with simple explanations. This problem gets worse as autoencoders scale, because their explanations will get more and more specific.
People often look at the top activating examples of a latent, but this provides a heavily misleading picture of how monosemantic the latent is even just in the one direction. It’s very common for features to have extremely good top activations but then terrible nonzero activations. This is why our feature visualizer shows random nonzero activations before the top activations.
Oftentimes, it is actually harder to simulate a latent than it looks. For example, we often find latents that activate on words in a specific context- say, financial news articles- but it seems to activate on random words inside those contexts and we don’t have a good explanation why it activates on some words but not others.
We also discuss this in the evaluation section of our paper on GPT-4 autoencoders. The ultimate metric of whether the features are explainable that we introduce is the following: simulate each latent with your best explanation of the latent, and then run the values through the decoder and the rest of the model and look at the downstream loss. This procedure is very expensive, so making it feasible to run is a nontrivial research problem, but I predict basically all existing autoencoders will score terribly on this metric.
(I’m the first author of the linked paper on GPT-4 autoencoders.)
I think many people are heavily overrating how human-explainable SAEs today are, because it’s quite subtle to determine whether a feature is genuinely explainable. SAE features today, even in the best SAEs, are generally are not explainable with simple human understandable explanations. By “explainable,” I mean there is a human understandable procedure for labeling whether the feature should activate on a given token (and also how strong the activation should be, but I’ll ignore that for now), such that your procedure predicts an activation if and only if the latent actually activates.
There are a few problems with interpretable-looking features:
it is insufficient that latent-activating samples have a common explanation. You also need the opposite direction of things that match the explanation to activate the latent. For example, we found a neuron in GPT-2 that appears to activate on the word “stop,” but actually most instances of the word “stop” don’t activate the neuron. It turns out that this was not really a “stop” neuron, but rather a “don’t stop/won’t stop” neuron. While in this case there was a different but still simple explanation, it’s entirely plausible that many features just cannot be explained with simple explanations. This problem gets worse as autoencoders scale, because their explanations will get more and more specific.
People often look at the top activating examples of a latent, but this provides a heavily misleading picture of how monosemantic the latent is even just in the one direction. It’s very common for features to have extremely good top activations but then terrible nonzero activations. This is why our feature visualizer shows random nonzero activations before the top activations.
Oftentimes, it is actually harder to simulate a latent than it looks. For example, we often find latents that activate on words in a specific context- say, financial news articles- but it seems to activate on random words inside those contexts and we don’t have a good explanation why it activates on some words but not others.
We also discuss this in the evaluation section of our paper on GPT-4 autoencoders. The ultimate metric of whether the features are explainable that we introduce is the following: simulate each latent with your best explanation of the latent, and then run the values through the decoder and the rest of the model and look at the downstream loss. This procedure is very expensive, so making it feasible to run is a nontrivial research problem, but I predict basically all existing autoencoders will score terribly on this metric.