About (2): what would ‘crisp internal representations’ look like? I think this is really useful to know since we haven’t figure this out for the brain or LLMs (e.g. Interpreting Neural Networks through the Polytope Lens
Moreover, the current methods used in comparing the human brain’s representations to those of ML models like RSA are quite high-level, or, at the very least do not reveal any useful insights for interpretability for either side (pls feel free to correct me). This question is not a point against mechanistic interpretability, however—granted the premise that LLMs are pressured to learn similar representations, interpretability research can both borrow from and shed light on how representations in the brain work (which reaches no agreement even after decades).
About (3): when we talk about automating interpretability, do we have a human or another model examining the research outputs? The fear is that due to our inherent limitation in how we reason about the world (e.g., we tend to make sense of something by positing some ‘objects’ and consider what ‘relations’ they have to each other), we may not be able to interpret what ML models at a meaningfully high level? To be concrete, so far a lot of interpretability results are in the form of ‘circuits’ in performing some peculiar task like modular addition, but how can we safely generalize that to some theory that has a wider scope from automation? (I have in my mind Predictive Processing for the brain but not sure how good an example it is)
The latent variables the model uses to perform tasks are latent variables that humans use. Contra ideas that language models reasons in completely alien ways. I agree that the two cited works are not particularly strong evidence : (
The model uses these latent variables with a bias towards shorter algorithms (e.g shallower paths in transformers). This is important as it’s possible that even when performing really narrow tasks, models could use a very large number of (individually understandable) latent variables in long algorithms such that no human could feasibly understand what’s going on.
I’m not sure what the end product of automated interpretability is, I think it would be pure speculation to make claims here.
Some questions about the feasibility section:
About (2): what would ‘crisp internal representations’ look like? I think this is really useful to know since we haven’t figure this out for the brain or LLMs (e.g. Interpreting Neural Networks through the Polytope Lens
Moreover, the current methods used in comparing the human brain’s representations to those of ML models like RSA are quite high-level, or, at the very least do not reveal any useful insights for interpretability for either side (pls feel free to correct me). This question is not a point against mechanistic interpretability, however—granted the premise that LLMs are pressured to learn similar representations, interpretability research can both borrow from and shed light on how representations in the brain work (which reaches no agreement even after decades).
About (3): when we talk about automating interpretability, do we have a human or another model examining the research outputs? The fear is that due to our inherent limitation in how we reason about the world (e.g., we tend to make sense of something by positing some ‘objects’ and consider what ‘relations’ they have to each other), we may not be able to interpret what ML models at a meaningfully high level? To be concrete, so far a lot of interpretability results are in the form of ‘circuits’ in performing some peculiar task like modular addition, but how can we safely generalize that to some theory that has a wider scope from automation? (I have in my mind Predictive Processing for the brain but not sure how good an example it is)
By “crisp internal representations” I mean that
The latent variables the model uses to perform tasks are latent variables that humans use. Contra ideas that language models reasons in completely alien ways. I agree that the two cited works are not particularly strong evidence : (
The model uses these latent variables with a bias towards shorter algorithms (e.g shallower paths in transformers). This is important as it’s possible that even when performing really narrow tasks, models could use a very large number of (individually understandable) latent variables in long algorithms such that no human could feasibly understand what’s going on.
I’m not sure what the end product of automated interpretability is, I think it would be pure speculation to make claims here.