After spending a while thinking about interpretability, my current stance is:
Let’s define Mechanistic interpretability as “A subfield of interpretability that uses bottom-up approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding.”
I think mechanistic interpretability probably has to succeed very ambitiously to be useful.
Mechanistic interpretability seems to me to be very far from succeeding this ambitiously
Most people working on mechanistic interpretability don’t seem to me like they’re on a straightforward path to ambitious success, though I’m somewhat on board with the stuff that Anthropic’s interp team is doing here.
Note that this is just for “mechanistic interpretability”. I think that high level top down interpretability (both black box and white box) has a clearer story for usefulness which doesn’t require very ambitious success.
For mechanistic interpretabilty, very ambitious success looks something like:
Have some decomposition of the model or the behavior of the model into parts.
For any given randomly selected part, you should almost always be able build up a very good understanding of this part in isolation.
By “very good” I mean that the understanding accounts for 90% of the bits of optimization applied to this part (where the remaining bits aren’t predictably more or less important per bit than what you’ve understood).
Roughly speaking, if your understanding accounts for 90% of the bits of optization for a AI than it means you should be able to construct a AI which works as well as if the original AI was only trained with 90% of the actual training compute.
In terms of loss explained, this is probably very high, like well above 99%.
The length of the explanation of all parts is probably only up to 1000 times shorter in bits than the size of the model. So, for a 1 trillion parameter model it’s at least 100 million words or 200,000 pages (assuming 10 bits per word). The compression comes from being able to use human concepts, but this will only get you so much.
Given your ability to explain any given part, build an overall understanding by piecing things together. This could be implicitly represented.
Be able to query understanding to answer interesting questions.
I don’t think there is an obvious easier road for mech interp to answer questions like “is the model deceptively aligned” if you want the approach to compete with much simpler high level and top down interpretability.
The main reason why I think mechanistic interpretability is very far from ambitious success is that current numbers are extremely bad and what people explain is extremely cherry picked. Like people’s explanations typically result in performance which is worse than that of much, much tinier models even though heavy cherry picking is applied.
If people were getting ok perf on randomly selected “parts” of models (for any notion of decomposition), then we’d be much closer. I’d think we were be much closer even if this was extremely labor intensive.
(E.g., the curve detectors work explained ~50% of the loss which is probably well less than 10% of the bits given sharply diminishing returns to scale on typical scaling laws.)
After spending a while thinking about interpretability, my current stance is:
Let’s define Mechanistic interpretability as “A subfield of interpretability that uses bottom-up approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding.”
I think mechanistic interpretability probably has to succeed very ambitiously to be useful.
Mechanistic interpretability seems to me to be very far from succeeding this ambitiously
Most people working on mechanistic interpretability don’t seem to me like they’re on a straightforward path to ambitious success, though I’m somewhat on board with the stuff that Anthropic’s interp team is doing here.
Note that this is just for “mechanistic interpretability”. I think that high level top down interpretability (both black box and white box) has a clearer story for usefulness which doesn’t require very ambitious success.
For mechanistic interpretabilty, very ambitious success looks something like:
Have some decomposition of the model or the behavior of the model into parts.
For any given randomly selected part, you should almost always be able build up a very good understanding of this part in isolation.
By “very good” I mean that the understanding accounts for 90% of the bits of optimization applied to this part (where the remaining bits aren’t predictably more or less important per bit than what you’ve understood).
Roughly speaking, if your understanding accounts for 90% of the bits of optization for a AI than it means you should be able to construct a AI which works as well as if the original AI was only trained with 90% of the actual training compute.
In terms of loss explained, this is probably very high, like well above 99%.
The length of the explanation of all parts is probably only up to 1000 times shorter in bits than the size of the model. So, for a 1 trillion parameter model it’s at least 100 million words or 200,000 pages (assuming 10 bits per word). The compression comes from being able to use human concepts, but this will only get you so much.
Given your ability to explain any given part, build an overall understanding by piecing things together. This could be implicitly represented.
Be able to query understanding to answer interesting questions.
I don’t think there is an obvious easier road for mech interp to answer questions like “is the model deceptively aligned” if you want the approach to compete with much simpler high level and top down interpretability.
The main reason why I think mechanistic interpretability is very far from ambitious success is that current numbers are extremely bad and what people explain is extremely cherry picked. Like people’s explanations typically result in performance which is worse than that of much, much tinier models even though heavy cherry picking is applied.
If people were getting ok perf on randomly selected “parts” of models (for any notion of decomposition), then we’d be much closer. I’d think we were be much closer even if this was extremely labor intensive.
(E.g., the curve detectors work explained ~50% of the loss which is probably well less than 10% of the bits given sharply diminishing returns to scale on typical scaling laws.)