StefanHex comments on Enumerating objects a model “knows” using entity-detection features.

StefanHex Apr 4, 2025, 9:29 AM
2 points
0
I like this project! One thing I particularly like about it is that it extracts information from the model without access to the dataset (well, if you ignore the SAE part—presumably one could have done the same by finding the “known entity” direction with a probe?). It has been a long-time goal of mine to do interpretability (in the past that was extracting features) without risking extracting properties of the dataset used (in the past: clusters/statistics of the SAE training dataset).

I wonder if you could turn this into a thing we can do with interp that no one else can. Specifically, what would be the non-interp method of getting these pairs, and would it perform similarly? A method I could imagine would be “sample random first token a, make model predict second token b, possibly filter by perplexity/loss” or other ideas based on just looking at the logits.
- Alex Gibson Apr 4, 2025, 9:08 PM
  3 points
  0
  Parent
  I’m glad you like it! Yeah the lack of a dataset is the thing that excites me about this kind of approach, because it allows us to get validation of our mechanistic explanations via partial “dataset recovery”, which I find to be really compelling. It’s a lot slower going, and may only work out for the first few layers, but it makes for a rewarding loop.
  The utility of SAEs is in telling us in an unsupervised way that there is a feature that codes for “known entity”, but this project doesn’t use SAEs explicitly. I look for sparse sets of neurons that activate highly on “known entities”. Neel Nanda / Wes Gurnee’s sparse probing work is the inspiration here: https://arxiv.org/abs/2305.01610
  But we only know to look for this sparse set of neurons because the SAEs told us the “known entity” feature exists, and it’s only because we know this feature exists that we expect neurons identified on a small set of entities (I think I looked at <5 examples and identified Neuron 0.2946, but admittedly kinda cheated by double checking on neuronpedia) to generalize.
  If you count linear probing as a non-interp strategy, you could find the linear direction associated with “entity detection”, and then just run the model over all 50257^2 possible pairs of input tokens. The mech interp approach still has to deal with 50257^2 pairs of inputs, but we can use our circuit analysis to save significant time by avoiding the model overhead, meaning we get the list of bigrams pretty much instantly. The circuit analysis also tells us we have only have to look at the previous 2 tokens for determining the broad component of the “entity detection” direction, which we might not know a priori. But I wouldn’t say this is a project only interp can do, just maybe interp speeds it up significantly.
  [Note the reason we need 50257^2 inputs even in the mechanistic approach is because I don’t know of a good method for extracting the sparse set of large EQKE entries without computing the whole matrix. If we could find a way to do this, then we could save significant time. But it’s not necessarily a bottleneck for analysing n-grams, because the 50257^2 complexity is coming from the quadratic form in attention, not the fact that we are looking at bigrams. So if we found a circuit for n-grams, it wouldn’t necessarily take us 50257^n time to list, whereas non-interp approaches would scale like 50257^n.]