As I read more about previous interpretability work, I’ve noticed this trend that implicitly defines a feature in this weird human centric way. It’s this weird prior that expects networks to automatically generate features that correspond with how we process images/text because… why exactly?
Chris Olah’s team at Anthropic thinks about features as “Something a large enough neural network would dedicate a neuron to”. Which doesn’t have the human-centric bias, but just begs the question of what is a thing a large enough network will dedicate an neuron to? They admit that this is flawed, but say it’s their best current definition. This never felt like a good enough answer, even to go off of.
I don’t really see the alternative engaged with. What if these features aren’t robust? What if these features don’t make sense from a human point of view? It feels like everyone is engaging with an alien brain and expecting it to process things in the same way we do.
Also, I’m confused about the Linear Representation Hypothesis. It makes sense when thinking about categorical features like gender or occupation, but what about quantitative features? Is there a length direction? Multiple?
I hope there’s a paper or papers I’m missing, or maybe I’m blowing this out of proportion.
A different way of stating the usual Anthropic-esque concept of features that I find useful: Features are the things that are getting composed when a neural network is taking advantage of compositionality. This isn’t begging the question, you just can’t answer this without knowing about the data distribution and the computational strategy of the model after training.
For instance, the reason the neurons aren’t always features, even though it’s natural to write the activations (which then get “composed” into the inputs to the next layer) in the neuron basis, is because if your data only lies on a manifold in the space of all possible values, the local coordinates of that manifold might rarely line up with the neurons basis.