I think interp ‘works best’ within a capability range, with both an upper and lower bound. (Note; this is a personal take that does not necessarily reflect the consensus in the field)
Below a certain capability threshold, it’s difficult to interpret models, because those models are so primitive as to not really be able to think like humans. Therefore your usual intuitions about how models work break down, and also it’s not clear if the insight you get from interpreting the model will generalise to larger models. Rough vibe; this means anything less capable than GPT2
With high capabilities, things also get more difficult. Both for mundane reasons (it takes more time and compute to get results, you need better infra to run larger models, SAEs need to get proportionately larger etc) as well as fundamental ones (e.g. the number of almost-orthogonal directions in N-dimensional space is exponential in N. So wider models can learn exponentially more features, and these features may be increasingly complex / fine-grained.)
I think interp ‘works best’ within a capability range, with both an upper and lower bound. (Note; this is a personal take that does not necessarily reflect the consensus in the field)
Below a certain capability threshold, it’s difficult to interpret models, because those models are so primitive as to not really be able to think like humans. Therefore your usual intuitions about how models work break down, and also it’s not clear if the insight you get from interpreting the model will generalise to larger models. Rough vibe; this means anything less capable than GPT2
With high capabilities, things also get more difficult. Both for mundane reasons (it takes more time and compute to get results, you need better infra to run larger models, SAEs need to get proportionately larger etc) as well as fundamental ones (e.g. the number of almost-orthogonal directions in N-dimensional space is exponential in N. So wider models can learn exponentially more features, and these features may be increasingly complex / fine-grained.)