For mechanistic interpretabilty, very ambitious success looks something like:
Have some decomposition of the model or the behavior of the model into parts.
For any given randomly selected part, you should almost always be able build up a very good understanding of this part in isolation.
By “very good” I mean that the understanding accounts for 90% of the bits of optimization applied to this part (where the remaining bits aren’t predictably more or less important per bit than what you’ve understood).
Roughly speaking, if your understanding accounts for 90% of the bits of optization for a AI than it means you should be able to construct a AI which works as well as if the original AI was only trained with 90% of the actual training compute.
In terms of loss explained, this is probably very high, like well above 99%.
The length of the explanation of all parts is probably only up to 1000 times shorter in bits than the size of the model. So, for a 1 trillion parameter model it’s at least 100 million words or 200,000 pages (assuming 10 bits per word). The compression comes from being able to use human concepts, but this will only get you so much.
Given your ability to explain any given part, build an overall understanding by piecing things together. This could be implicitly represented.
Be able to query understanding to answer interesting questions.
I don’t think there is an obvious easier road for mech interp to answer questions like “is the model deceptively aligned” if you want the approach to compete with much simpler high level and top down interpretability.
The main reason why I think mechanistic interpretability is very far from ambitious success is that current numbers are extremely bad and what people explain is extremely cherry picked. Like people’s explanations typically result in performance which is worse than that of much, much tinier models even though heavy cherry picking is applied.
If people were getting ok perf on randomly selected “parts” of models (for any notion of decomposition), then we’d be much closer. I’d think we were be much closer even if this was extremely labor intensive.
(E.g., the curve detectors work explained ~50% of the loss which is probably well less than 10% of the bits given sharply diminishing returns to scale on typical scaling laws.)
For mechanistic interpretabilty, very ambitious success looks something like:
Have some decomposition of the model or the behavior of the model into parts.
For any given randomly selected part, you should almost always be able build up a very good understanding of this part in isolation.
By “very good” I mean that the understanding accounts for 90% of the bits of optimization applied to this part (where the remaining bits aren’t predictably more or less important per bit than what you’ve understood).
Roughly speaking, if your understanding accounts for 90% of the bits of optization for a AI than it means you should be able to construct a AI which works as well as if the original AI was only trained with 90% of the actual training compute.
In terms of loss explained, this is probably very high, like well above 99%.
The length of the explanation of all parts is probably only up to 1000 times shorter in bits than the size of the model. So, for a 1 trillion parameter model it’s at least 100 million words or 200,000 pages (assuming 10 bits per word). The compression comes from being able to use human concepts, but this will only get you so much.
Given your ability to explain any given part, build an overall understanding by piecing things together. This could be implicitly represented.
Be able to query understanding to answer interesting questions.
I don’t think there is an obvious easier road for mech interp to answer questions like “is the model deceptively aligned” if you want the approach to compete with much simpler high level and top down interpretability.
The main reason why I think mechanistic interpretability is very far from ambitious success is that current numbers are extremely bad and what people explain is extremely cherry picked. Like people’s explanations typically result in performance which is worse than that of much, much tinier models even though heavy cherry picking is applied.
If people were getting ok perf on randomly selected “parts” of models (for any notion of decomposition), then we’d be much closer. I’d think we were be much closer even if this was extremely labor intensive.
(E.g., the curve detectors work explained ~50% of the loss which is probably well less than 10% of the bits given sharply diminishing returns to scale on typical scaling laws.)