Undergrad maths + computer science + economics @ ANU.
charlieoneill
Karma: 28
Can quantised autoencoders find and interpret circuits in language models?
@Ruby @Raemon @RobertM I’ve had a post waiting to be approved for almost two weeks now (https://www.lesswrong.com/posts/gSfPk8ZPoHe2PJADv/can-quantised-autoencoders-find-and-interpret-circuits-in, username: charlieoneill). Is this normal? Cheers!
I agree—you need to actual measure the specificity and sensitivity of your circuit identification. I’m currently doing this with attention heads specifically, rather than just the layers. However, I will object to the notion of “overfitting” because the VQ-VAE is essentially fully unsupervised—it’s not really about the DT overfitting because as long as training and eval error are similar then you are simply looking for codes that distinguish positive from negative examples. If iterating over these codes also finds the circuit responsible for the positive examples, then this isn’t overfitting but rather a fortunate case of the codes corresponding highly to the actions of the circuit for the task, which is what we want.
I agree that VQ-VAEs are promising, but you can’t say they’re more scalable than SAE, because SAEs don’t have to have 8 times the number of features as the dimension of what they’re dictionary learning. In fact, I’ve found you can set the number of features to be lower than the dimension and it works well for this sort of stuff (which I’ll be sharing soon). Many people seem to want to scale the number of features up significantly to achieve “feature splitting”, but I actually think for circuit identification it makes more sense to use a smaller number of features, to ensure only general behaviours (for attention heads themselves) are captured.
Thanks for your thoughts, and I look forward to reading your lie detection code!