In the same way that cells were understood to be indivisible, atomic units of biology hundreds of years ago—before the discovery of sub-cellular structures like organelles, proteins, and DNA—we currently understand features to be fundamental units of neural network representations that we are examining with tools like mechanistic interpretability.
This is not to say that the definition of what constitutes a “feature” is clear at all—in fact, its lack of consensus reflects the extremely immature (but exciting!) state of interpretability research today. I am not claiming that this is a pure bijection; in fact, one of the pivotal ways in which mechanistic interpretability and biology diverge is the fact that defining and understanding feature emergence will most definitely come outside of simple model decomposition into weight + activation spaces (for example, understanding dataset-dependent computation flow as you mentioned above). In contrast, most of biology’s advancement has come from decomposing cellular complexity into smaller and smaller pieces.
I suspect this will not be the final story for interpretability, but it is mechanistic interpretability is an interesting first chapter.
If you have a certain kind of cell (e.g. penicillium), then you can add certain kinds of organic matter (e.g. food), and then this organic matter spontaneously converts into more of the original kind of cell (e.g. it gets moldy). This makes cells much more influential than other similarly-diminished entities.
In order to get something analogous to cells, it’s not just enough to discover small structures, since there’s lots of small structures that don’t form spontaneously like this. It seems dubious whether current mechanistic interpretability is finding features like this.
I agree that it is dubious at the moment. I just think it’s too early to tell and the field itself will undoubtedly grow in complexity over the coming years.
Your point about the spontaneity of cells forming stands, although I wasn’t phrasing the analogy at the level of thermodynamics / physics.
If mechanistic interpretability is the AI equivalent of finding tiny organisms in a microscope, what is the AI equivalent of the tiny organisms?
I would argue that the AI equivalent of these tiny organisms are “features,” which are just beginning to be defined in a structured, mathematical way.
Why?
In the same way that cells were understood to be indivisible, atomic units of biology hundreds of years ago—before the discovery of sub-cellular structures like organelles, proteins, and DNA—we currently understand features to be fundamental units of neural network representations that we are examining with tools like mechanistic interpretability.
This is not to say that the definition of what constitutes a “feature” is clear at all—in fact, its lack of consensus reflects the extremely immature (but exciting!) state of interpretability research today. I am not claiming that this is a pure bijection; in fact, one of the pivotal ways in which mechanistic interpretability and biology diverge is the fact that defining and understanding feature emergence will most definitely come outside of simple model decomposition into weight + activation spaces (for example, understanding dataset-dependent computation flow as you mentioned above). In contrast, most of biology’s advancement has come from decomposing cellular complexity into smaller and smaller pieces.
I suspect this will not be the final story for interpretability, but it is mechanistic interpretability is an interesting first chapter.
If you have a certain kind of cell (e.g. penicillium), then you can add certain kinds of organic matter (e.g. food), and then this organic matter spontaneously converts into more of the original kind of cell (e.g. it gets moldy). This makes cells much more influential than other similarly-diminished entities.
In order to get something analogous to cells, it’s not just enough to discover small structures, since there’s lots of small structures that don’t form spontaneously like this. It seems dubious whether current mechanistic interpretability is finding features like this.
I agree that it is dubious at the moment. I just think it’s too early to tell and the field itself will undoubtedly grow in complexity over the coming years.
Your point about the spontaneity of cells forming stands, although I wasn’t phrasing the analogy at the level of thermodynamics / physics.