And yes, I do think that interp work today should mostly focus on image nets for the same reasons we focus on image nets. The field’s current focus on LLMs is a mistake
A note that word on the street in mech-interp land is that often you get more signal & a greater number of techniques work on bigger & smarter language models over smaller & dumber possibly-not-language-models. Presumably due to smarter & complex models having more structured representations.
Fwiw, this is not at all obvious to me, and I would weakly bet that larger models are harder to interpret (even beyond there just being more capabilities to study)
Hmm. I think there’s something about this that rings true and yet...
Ok, so what if there were a set of cliff faces that had the property that climbing the bigger ones was more important and also that climbing tools worked better on them. Yet, despite the tools working better on the large cliffs, the smaller cliffs were easier to climb (both because the routes were shorter, and because the routes were less technical). Seems like if your goal is to design climbing equipment that will be helpful on large cliff faces, you should test the climbing equipment on large cliff faces, even if that means you won’t have the satisfaction of completing any of your testing climbs.
What if you tried to figure out a way to understand the “canonical cliffness” and design a new line of equipment that could be tailored to fit any “slope”… Which cliff would you test first? 🤔
Yeah, I think this is a relevant point. Maybe for John and David’s project the relevant point would be to try their ideas on absurdly oversized image models. Sometimes scale just makes things less muddled.
Might run into funding limitations. I wish there was more sources of large scale compute available to research like this.
I think interp ‘works best’ within a capability range, with both an upper and lower bound. (Note; this is a personal take that does not necessarily reflect the consensus in the field)
Below a certain capability threshold, it’s difficult to interpret models, because those models are so primitive as to not really be able to think like humans. Therefore your usual intuitions about how models work break down, and also it’s not clear if the insight you get from interpreting the model will generalise to larger models. Rough vibe; this means anything less capable than GPT2
With high capabilities, things also get more difficult. Both for mundane reasons (it takes more time and compute to get results, you need better infra to run larger models, SAEs need to get proportionately larger etc) as well as fundamental ones (e.g. the number of almost-orthogonal directions in N-dimensional space is exponential in N. So wider models can learn exponentially more features, and these features may be increasingly complex / fine-grained.)
A note that word on the street in mech-interp land is that often you get more signal & a greater number of techniques work on bigger & smarter language models over smaller & dumber possibly-not-language-models. Presumably due to smarter & complex models having more structured representations.
Fwiw, this is not at all obvious to me, and I would weakly bet that larger models are harder to interpret (even beyond there just being more capabilities to study)
Hmm. I think there’s something about this that rings true and yet...
Ok, so what if there were a set of cliff faces that had the property that climbing the bigger ones was more important and also that climbing tools worked better on them. Yet, despite the tools working better on the large cliffs, the smaller cliffs were easier to climb (both because the routes were shorter, and because the routes were less technical). Seems like if your goal is to design climbing equipment that will be helpful on large cliff faces, you should test the climbing equipment on large cliff faces, even if that means you won’t have the satisfaction of completing any of your testing climbs.
What if you tried to figure out a way to understand the “canonical cliffness” and design a new line of equipment that could be tailored to fit any “slope”… Which cliff would you test first? 🤔
So you would expect Claude Opus 3 to be harder to interpret than Claude Sonnet 3.5 ?
My intuition is that larger models of the same capability would exhibit less super-position and thus be easier to interpret?
Do you have some concrete example of a technique for which this applies?
Yeah, I think this is a relevant point. Maybe for John and David’s project the relevant point would be to try their ideas on absurdly oversized image models. Sometimes scale just makes things less muddled.
Might run into funding limitations. I wish there was more sources of large scale compute available to research like this.
I think interp ‘works best’ within a capability range, with both an upper and lower bound. (Note; this is a personal take that does not necessarily reflect the consensus in the field)
Below a certain capability threshold, it’s difficult to interpret models, because those models are so primitive as to not really be able to think like humans. Therefore your usual intuitions about how models work break down, and also it’s not clear if the insight you get from interpreting the model will generalise to larger models. Rough vibe; this means anything less capable than GPT2
With high capabilities, things also get more difficult. Both for mundane reasons (it takes more time and compute to get results, you need better infra to run larger models, SAEs need to get proportionately larger etc) as well as fundamental ones (e.g. the number of almost-orthogonal directions in N-dimensional space is exponential in N. So wider models can learn exponentially more features, and these features may be increasingly complex / fine-grained.)