The main reason why I think mechanistic interpretability is very far from ambitious success is that current numbers are extremely bad and what people explain is extremely cherry picked. Like people’s explanations typically result in performance which is worse than that of much, much tinier models even though heavy cherry picking is applied.
If people were getting ok perf on randomly selected “parts” of models (for any notion of decomposition), then we’d be much closer. I’d think we were be much closer even if this was extremely labor intensive.
(E.g., the curve detectors work explained ~50% of the loss which is probably well less than 10% of the bits given sharply diminishing returns to scale on typical scaling laws.)
The main reason why I think mechanistic interpretability is very far from ambitious success is that current numbers are extremely bad and what people explain is extremely cherry picked. Like people’s explanations typically result in performance which is worse than that of much, much tinier models even though heavy cherry picking is applied.
If people were getting ok perf on randomly selected “parts” of models (for any notion of decomposition), then we’d be much closer. I’d think we were be much closer even if this was extremely labor intensive.
(E.g., the curve detectors work explained ~50% of the loss which is probably well less than 10% of the bits given sharply diminishing returns to scale on typical scaling laws.)