I too agreed w/ Chanind initially, but I think I see where Lucius is coming from.
If we forget about a basis & focused on minimal description length (MDL), it’d be nice to have a technique that found the MDL [features/components] for each datapoint. e.g. in my comment, I have 4 animals (bunny, etc) & two properties (cute, furry). For MDL reasons, it’d be great to sometimes use cute/furry & sometimes use Bunny if that reflects model computation more simply.
If you have both attributes & animals as fundamental units (and somehow have a method that tells you which minimal set of units form each datapoint) then a bunny will just use the bunny feature (since that’s simpler than cute + furry + etc), & a very cute bunny will use bunny + cute (instead of bunny + 0.1*dolphin + etc (or similar w/ properties)).
So if we look at Lucius initial statement:
The features a model thinks in do not need to form a basis or dictionary for its activations. [emphasis mine]
They don’t need to, but they can form a basis. It very well could be simpler to not constrain our understanding/model of the NN’s features as forming a basis.
Ideally Lucius can just show us this magical method that gets you simple components that don’t form a basis, then we’d all have a simpler time understanding his point. I believe this “magical method” is Attribution based parameter decomposition (APD) that they (lucius, dan, lee?) have been working on, which I would be excited if more people tried creative methods to scale this up. I’m unsure if this method will work, but it is a different bet than e.g. SAEs & currently underrated imo.
I think you’ve renamed the post which changed the url:
https://max.v3rv.com/papers/interpreting_complexity
is now the correct one AFAIK & the original link at the top (https://max.v3rv.com/papers/circuits_and_memorization )
is broken.