What we really want with interpretability is: high accuracy, when out of distribution, scaling to large models. You got very high accuracy… but I have no context to say if this is good or bad. What could a naïve baseline get? And what do SAE’s get? Also it would be nice to see an Out Of Distribution set, because getting 100% on your test suggests that it’s fully within the training distribution (or that your VQ-VAE worked perfectly).
I tried something similar but only got half as far as you. Still my code may be of interest. I wanted to know if it would help with lie detection, out of distribution, but didn’t get great results. I was using a very hard setup where no methods work well.
I think VQ-VAE is a promising approach because it’s more scalable than SAE, which have 8 times the parameters of the model they are interpreting. Also your idea of using a decision tree on the tokenised space make a lot of sense given the discrete latent space!
I agree—you need to actual measure the specificity and sensitivity of your circuit identification. I’m currently doing this with attention heads specifically, rather than just the layers. However, I will object to the notion of “overfitting” because the VQ-VAE is essentially fully unsupervised—it’s not really about the DT overfitting because as long as training and eval error are similar then you are simply looking for codes that distinguish positive from negative examples. If iterating over these codes also finds the circuit responsible for the positive examples, then this isn’t overfitting but rather a fortunate case of the codes corresponding highly to the actions of the circuit for the task, which is what we want.
I agree that VQ-VAEs are promising, but you can’t say they’re more scalable than SAE, because SAEs don’t have to have 8 times the number of features as the dimension of what they’re dictionary learning. In fact, I’ve found you can set the number of features to be lower than the dimension and it works well for this sort of stuff (which I’ll be sharing soon). Many people seem to want to scale the number of features up significantly to achieve “feature splitting”, but I actually think for circuit identification it makes more sense to use a smaller number of features, to ensure only general behaviours (for attention heads themselves) are captured.
Thanks for your thoughts, and I look forward to reading your lie detection code!
It’s just that eval and training are so damn similar, and all other problems are so different’t. So while it is technical not overfitting (to this problem), if is certainly overfitting to this specific problems, and it certainly isn’t measuring generalization in any sense of the word. Certainly not in the sense of helping us debug alignment for all problems.
This is an error that, imo, all papers currently make though! So it’s not a criticism so much as an interesting debate, and a nudge to use a harder test or OOD set in your benchmarks next time.
but you can’t say they’re more scalable than SAE, because SAEs don’t have to have 8 times the number of features
Yeah, good point. I just can’t help but think there must be a way of using unsupervised learning to force a compressed human-readable encoding. Going uncompressed just seems wasteful, and like it won’t scale. But I can’t think of a machine learnable, unsupervised learning, human-readable coding. Any ideas?
What we really want with interpretability is: high accuracy, when out of distribution, scaling to large models. You got very high accuracy… but I have no context to say if this is good or bad. What could a naïve baseline get? And what do SAE’s get? Also it would be nice to see an Out Of Distribution set, because getting 100% on your test suggests that it’s fully within the training distribution (or that your VQ-VAE worked perfectly).
I tried something similar but only got half as far as you. Still my code may be of interest. I wanted to know if it would help with lie detection, out of distribution, but didn’t get great results. I was using a very hard setup where no methods work well.
I think VQ-VAE is a promising approach because it’s more scalable than SAE, which have 8 times the parameters of the model they are interpreting. Also your idea of using a decision tree on the tokenised space make a lot of sense given the discrete latent space!
I agree—you need to actual measure the specificity and sensitivity of your circuit identification. I’m currently doing this with attention heads specifically, rather than just the layers. However, I will object to the notion of “overfitting” because the VQ-VAE is essentially fully unsupervised—it’s not really about the DT overfitting because as long as training and eval error are similar then you are simply looking for codes that distinguish positive from negative examples. If iterating over these codes also finds the circuit responsible for the positive examples, then this isn’t overfitting but rather a fortunate case of the codes corresponding highly to the actions of the circuit for the task, which is what we want.
I agree that VQ-VAEs are promising, but you can’t say they’re more scalable than SAE, because SAEs don’t have to have 8 times the number of features as the dimension of what they’re dictionary learning. In fact, I’ve found you can set the number of features to be lower than the dimension and it works well for this sort of stuff (which I’ll be sharing soon). Many people seem to want to scale the number of features up significantly to achieve “feature splitting”, but I actually think for circuit identification it makes more sense to use a smaller number of features, to ensure only general behaviours (for attention heads themselves) are captured.
Thanks for your thoughts, and I look forward to reading your lie detection code!
It’s just that eval and training are so damn similar, and all other problems are so different’t. So while it is technical not overfitting (to this problem), if is certainly overfitting to this specific problems, and it certainly isn’t measuring generalization in any sense of the word. Certainly not in the sense of helping us debug alignment for all problems.
This is an error that, imo, all papers currently make though! So it’s not a criticism so much as an interesting debate, and a nudge to use a harder test or OOD set in your benchmarks next time.
Yeah, good point. I just can’t help but think there must be a way of using unsupervised learning to force a compressed human-readable encoding. Going uncompressed just seems wasteful, and like it won’t scale. But I can’t think of a machine learnable, unsupervised learning, human-readable coding. Any ideas?