It would be perverse to try to understand a king in terms of his molecular configuration, rather than in the contact between the farmer and the bandit. The molecules of the king are highly diminished phenomena, and if they have information about his place in the ecology, that information is widely spread out across all the molecules and easily lost just by missing a small fraction of them.
Agreed, but in the same vein that empirical observations and low-tech experiments gazing at the cosmos laid the foundation upon which we were able to build grander and more complex theories of the universe, it would be premature to claim that this line of inquiry will not give us future mechanistic theories that are profound in nature. I am in agreement that these tools, at least at the moment, are largely frivolous and feature-specific without capturing more abstract notions of reality.
That being said, in terms of timescales, we are in a pre-Newtonian era, where we lack even basic, albeit fundamental laws for understanding how these models work.
It’s true that gazing at the cosmos has a history of leading to important discoveries, but mechanistic interpretability isn’t gazing at the cosmos, it’s gazing at the weights of the neural network.
On second thought, I agree that gazing at the cosmos is not a fair comparison: rather, I would compare mechanistic interpretability to the early experiments of the Dutch microbiologist van Leeuwenhoek as he first looked at protozoa and bacteria under a microscope.. They weren’t the most accurate or informative experiments in the large scheme of things, but they were necessary for others to develop a more sophisticated understanding of biology.
It’s very likely that the field of mechanistic interpretability will grow beyond simply examining weights in a model, to higher order understandings of the computational flow within a model (gradient descent and itself data were mentioned in this thread)--I agree that simply examining weights/activations is not a sufficient paradigm for understanding neural computation—but it is a start.
In the same way that cells were understood to be indivisible, atomic units of biology hundreds of years ago—before the discovery of sub-cellular structures like organelles, proteins, and DNA—we currently understand features to be fundamental units of neural network representations that we are examining with tools like mechanistic interpretability.
This is not to say that the definition of what constitutes a “feature” is clear at all—in fact, its lack of consensus reflects the extremely immature (but exciting!) state of interpretability research today. I am not claiming that this is a pure bijection; in fact, one of the pivotal ways in which mechanistic interpretability and biology diverge is the fact that defining and understanding feature emergence will most definitely come outside of simple model decomposition into weight + activation spaces (for example, understanding dataset-dependent computation flow as you mentioned above). In contrast, most of biology’s advancement has come from decomposing cellular complexity into smaller and smaller pieces.
I suspect this will not be the final story for interpretability, but it is mechanistic interpretability is an interesting first chapter.
If you have a certain kind of cell (e.g. penicillium), then you can add certain kinds of organic matter (e.g. food), and then this organic matter spontaneously converts into more of the original kind of cell (e.g. it gets moldy). This makes cells much more influential than other similarly-diminished entities.
In order to get something analogous to cells, it’s not just enough to discover small structures, since there’s lots of small structures that don’t form spontaneously like this. It seems dubious whether current mechanistic interpretability is finding features like this.
I agree that it is dubious at the moment. I just think it’s too early to tell and the field itself will undoubtedly grow in complexity over the coming years.
Your point about the spontaneity of cells forming stands, although I wasn’t phrasing the analogy at the level of thermodynamics / physics.
Agreed, but in the same vein that empirical observations and low-tech experiments gazing at the cosmos laid the foundation upon which we were able to build grander and more complex theories of the universe, it would be premature to claim that this line of inquiry will not give us future mechanistic theories that are profound in nature. I am in agreement that these tools, at least at the moment, are largely frivolous and feature-specific without capturing more abstract notions of reality.
That being said, in terms of timescales, we are in a pre-Newtonian era, where we lack even basic, albeit fundamental laws for understanding how these models work.
It’s true that gazing at the cosmos has a history of leading to important discoveries, but mechanistic interpretability isn’t gazing at the cosmos, it’s gazing at the weights of the neural network.
On second thought, I agree that gazing at the cosmos is not a fair comparison: rather, I would compare mechanistic interpretability to the early experiments of the Dutch microbiologist van Leeuwenhoek as he first looked at protozoa and bacteria under a microscope.. They weren’t the most accurate or informative experiments in the large scheme of things, but they were necessary for others to develop a more sophisticated understanding of biology.
It’s very likely that the field of mechanistic interpretability will grow beyond simply examining weights in a model, to higher order understandings of the computational flow within a model (gradient descent and itself data were mentioned in this thread)--I agree that simply examining weights/activations is not a sufficient paradigm for understanding neural computation—but it is a start.
If mechanistic interpretability is the AI equivalent of finding tiny organisms in a microscope, what is the AI equivalent of the tiny organisms?
I would argue that the AI equivalent of these tiny organisms are “features,” which are just beginning to be defined in a structured, mathematical way.
Why?
In the same way that cells were understood to be indivisible, atomic units of biology hundreds of years ago—before the discovery of sub-cellular structures like organelles, proteins, and DNA—we currently understand features to be fundamental units of neural network representations that we are examining with tools like mechanistic interpretability.
This is not to say that the definition of what constitutes a “feature” is clear at all—in fact, its lack of consensus reflects the extremely immature (but exciting!) state of interpretability research today. I am not claiming that this is a pure bijection; in fact, one of the pivotal ways in which mechanistic interpretability and biology diverge is the fact that defining and understanding feature emergence will most definitely come outside of simple model decomposition into weight + activation spaces (for example, understanding dataset-dependent computation flow as you mentioned above). In contrast, most of biology’s advancement has come from decomposing cellular complexity into smaller and smaller pieces.
I suspect this will not be the final story for interpretability, but it is mechanistic interpretability is an interesting first chapter.
If you have a certain kind of cell (e.g. penicillium), then you can add certain kinds of organic matter (e.g. food), and then this organic matter spontaneously converts into more of the original kind of cell (e.g. it gets moldy). This makes cells much more influential than other similarly-diminished entities.
In order to get something analogous to cells, it’s not just enough to discover small structures, since there’s lots of small structures that don’t form spontaneously like this. It seems dubious whether current mechanistic interpretability is finding features like this.
I agree that it is dubious at the moment. I just think it’s too early to tell and the field itself will undoubtedly grow in complexity over the coming years.
Your point about the spontaneity of cells forming stands, although I wasn’t phrasing the analogy at the level of thermodynamics / physics.