Is there some formal-ish definition of “explanation of (network, dataset)” and “mathematical description length of an explanation” such that you think SAEs are especially short explanations? I still don’t think I have whatever intuition you’re describing, and I feel like the issue is that I don’t know how you’re measuring description length and what class of “explanations” you’re considering.
As naive examples that probably don’t work (similar to the ones from my original comment):
We could consider any Turing machine that approximately outputs (network, dataset) an “explanation”, but it seems very likely that SAEs aren’t competitive with short TMs of this form (obviously this isn’t a fair comparison)
We could consider fixed computational graphs made out of linear maps and count the number of parameters. I think your objection to this is that these don’t “explain the dataset”? (but then I’m not sure in what sense SAEs do)
We could consider arithmetic circuits that approximate the network on the dataset, and count the number of edges in the circuit to get “description length”. This might give some advantage to SAEs if you can get sparse weights in the sparse basis, seems like the best attempt out of these three. But it seems very unclear to me that SAEs are better in this sense than even the original network (let alone stuff like pruning).
Focusing instead on what an “explanation” is: would you say the network itself is an “explanation of (network, dataset)” and just has high description length? If not, then the thing I don’t understand is more about what an explanation is and why SAEs are one, rather than how you measure description length.
ETA: On re-reading, the following quote makes me think the issue is that I don’t understand what you mean by “the explanation” (is there a single objective explanation of any given network? If so, what is it?) But I’ll leave the rest in case it helps clarify where I’m confused.
Assuming the network is smaller yet as performant (therefore presumably doing more computation in superposition), then the explanation of the (network, dataset) is basically unchanged.
Is there some formal-ish definition of “explanation of (network, dataset)” and “mathematical description length of an explanation” such that you think SAEs are especially short explanations? I still don’t think I have whatever intuition you’re describing, and I feel like the issue is that I don’t know how you’re measuring description length and what class of “explanations” you’re considering.
I’ll register that I prefer using ‘description’ instead of ‘explanation’ in most places. The reason is that ‘explanation’ invokes a notion of understanding, which requires both a mathematical description and a semantic description. So I regret using the word explanation in the comment above (although not completely wrong to use it—but it did risk confusion). I’ll edit to replace it with ‘description’ and strikethrough ‘explanation’.
“explanation of (network, dataset)”: I’m afraid I don’t have a great formalish definition beyond just pointing at the intuitive notion. But formalizing what an explanation is seems like a high bar. If it’s helpful, a mathematical description is just a statement of what the network is in terms of particular kinds of mathematical objects.
“mathematical description length of an explanation”: (Note: Mathematical descriptions are of networks, not of explanations.) It’s just the set of objects used to describe the network. Maybe helpful to think in terms of maps between different descriptions: E.g. there is a many-to-one map between a description of a neural network in terms of polytopes and in terms of neurons. There are ~exponentially many more polytopes. Hence the mathematical description of the network in terms of individual polytopes is much larger.
Focusing instead on what an “explanation” is: would you say the network itself is an “explanation of (network, dataset)” and just has high description length?
I would not. So:
If not, then the thing I don’t understand is more about what an explanation is and why SAEs are one, rather than how you measure description length.
I think that the confusion might again be from using ‘explanation’ rather than description.
SAEs (or decompiled networks that use SAEs as the building block) are supposed to approximate the original network behaviour. So SAEs are mathematical descriptions of the network, but not of the (network, dataset). What’s a mathematical description of the (network, dataset), then? It’s just what you get when you pass the dataset through the network; this datum interacts with this weight to produce this activation, that datum interacts with this weight to produce that activation, and so on. A mathematical description of the (network, dataset) in terms of SAEs are: this datum activates dictionary features xyz (where xyz is just indices and has no semantic info), that datum activates dictionary features abc, and so on.
Thanks for the detailed responses! I’m happy to talk about “descriptions” throughout.
Trying to summarize my current understanding of what you’re saying:
SAEs themselves aren’t meant to be descriptions of (network, dataset). (I’d just misinterpreted your earlier comment.)
As a description of just the network, SAEs have a higher description length than a naive neuron-based description of the network.
Given a description of the network in terms of “parts,” we can get a description of (network, dataset) by listing out which “parts” are “active” on each sample. I assume we then “compress” this description somehow (e.g. grouping similar samples), since otherwise the description would always have size linear in the dataset size?
You’re then claiming that SAEs are a particularly short description of (network, dataset) in this sense (since they’re optimized for not having many parts active).
My confusion mainly comes down to defining the words in quotes above, i.e. “parts”, “active”, and “compress”. My sense is that they are playing a pretty crucial role and that there are important conceptual issues with formalizing them. (So it’s not just that we have a great intuition and it’s just annoying to spell it out mathematically, I’m not convinced we even have a good intuitive understanding of what these things should mean.)
That said, my sense is you’re not claiming any of this is easy to define. I’d guess you have intuitions that the “short description length” framing is philosophically the right one, and I probably don’t quite share those and feel more confused how to best think about “short descriptions” if we don’t just allow arbitrary Turing machines (basically because deciding what allowable “parts” or mathematical objects are seems to be doing a lot of work). Not sure how feasible converging on this is in this format (though I’m happy to keep trying a bit more in case you’re excited to explain).
Trying to summarize my current understanding of what you’re saying:
Yes all four sound right to me. To avoid any confusion, I’d just add an emphasis that the descriptions are mathematical, as opposed semantic.
I’d guess you have intuitions that the “short description length” framing is philosophically the right one, and I probably don’t quite share those and feel more confused how to best think about “short descriptions” if we don’t just allow arbitrary Turing machines (basically because deciding what allowable “parts” or mathematical objects are seems to be doing a lot of work). Not sure how feasible converging on this is in this format (though I’m happy to keep trying a bit more in case you’re excited to explain).
I too am keen to converge on a format in terms of Turing machines or Kolmogorov complexity or something else more formal. But I don’t feel very well placed to do that, unfortunately, since thinking in those terms isn’t very natural to me yet.
“explanation of (network, dataset)”: I’m afraid I don’t have a great formalish definition beyond just pointing at the intuitive notion.
What’s wrong with “proof” as a formal definition of explanation (of behavior of a network on a dataset)? I claim that description length works pretty well on “formal proof”, I’m in the process of producing a write-up on results exploring this.
Is there some formal-ish definition of “explanation of (network, dataset)” and “mathematical description length of an explanation” such that you think SAEs are especially short explanations? I still don’t think I have whatever intuition you’re describing, and I feel like the issue is that I don’t know how you’re measuring description length and what class of “explanations” you’re considering.
As naive examples that probably don’t work (similar to the ones from my original comment):
We could consider any Turing machine that approximately outputs (network, dataset) an “explanation”, but it seems very likely that SAEs aren’t competitive with short TMs of this form (obviously this isn’t a fair comparison)
We could consider fixed computational graphs made out of linear maps and count the number of parameters. I think your objection to this is that these don’t “explain the dataset”? (but then I’m not sure in what sense SAEs do)
We could consider arithmetic circuits that approximate the network on the dataset, and count the number of edges in the circuit to get “description length”. This might give some advantage to SAEs if you can get sparse weights in the sparse basis, seems like the best attempt out of these three. But it seems very unclear to me that SAEs are better in this sense than even the original network (let alone stuff like pruning).
Focusing instead on what an “explanation” is: would you say the network itself is an “explanation of (network, dataset)” and just has high description length? If not, then the thing I don’t understand is more about what an explanation is and why SAEs are one, rather than how you measure description length.
ETA: On re-reading, the following quote makes me think the issue is that I don’t understand what you mean by “the explanation” (is there a single objective explanation of any given network? If so, what is it?) But I’ll leave the rest in case it helps clarify where I’m confused.
I’ll register that I prefer using ‘description’ instead of ‘explanation’ in most places. The reason is that ‘explanation’ invokes a notion of understanding, which requires both a mathematical description and a semantic description. So I regret using the word explanation in the comment above (although not completely wrong to use it—but it did risk confusion). I’ll edit to replace it with ‘description’ and strikethrough ‘explanation’.
“explanation of (network, dataset)”: I’m afraid I don’t have a great formalish definition beyond just pointing at the intuitive notion. But formalizing what an explanation is seems like a high bar. If it’s helpful, a mathematical description is just a statement of what the network is in terms of particular kinds of mathematical objects.
“mathematical description length of an explanation”: (Note: Mathematical descriptions are of networks, not of explanations.) It’s just the set of objects used to describe the network. Maybe helpful to think in terms of maps between different descriptions: E.g. there is a many-to-one map between a description of a neural network in terms of polytopes and in terms of neurons. There are ~exponentially many more polytopes. Hence the mathematical description of the network in terms of individual polytopes is much larger.
I would not. So:
I think that the confusion might again be from using ‘explanation’ rather than description.
SAEs (or decompiled networks that use SAEs as the building block) are supposed to approximate the original network behaviour. So SAEs are mathematical descriptions of the network, but not of the (network, dataset). What’s a mathematical description of the (network, dataset), then? It’s just what you get when you pass the dataset through the network; this datum interacts with this weight to produce this activation, that datum interacts with this weight to produce that activation, and so on. A mathematical description of the (network, dataset) in terms of SAEs are: this datum activates dictionary features xyz (where xyz is just indices and has no semantic info), that datum activates dictionary features abc, and so on.
Lmk if that’s any clearer.
Thanks for the detailed responses! I’m happy to talk about “descriptions” throughout.
Trying to summarize my current understanding of what you’re saying:
SAEs themselves aren’t meant to be descriptions of (network, dataset). (I’d just misinterpreted your earlier comment.)
As a description of just the network, SAEs have a higher description length than a naive neuron-based description of the network.
Given a description of the network in terms of “parts,” we can get a description of (network, dataset) by listing out which “parts” are “active” on each sample. I assume we then “compress” this description somehow (e.g. grouping similar samples), since otherwise the description would always have size linear in the dataset size?
You’re then claiming that SAEs are a particularly short description of (network, dataset) in this sense (since they’re optimized for not having many parts active).
My confusion mainly comes down to defining the words in quotes above, i.e. “parts”, “active”, and “compress”. My sense is that they are playing a pretty crucial role and that there are important conceptual issues with formalizing them. (So it’s not just that we have a great intuition and it’s just annoying to spell it out mathematically, I’m not convinced we even have a good intuitive understanding of what these things should mean.)
That said, my sense is you’re not claiming any of this is easy to define. I’d guess you have intuitions that the “short description length” framing is philosophically the right one, and I probably don’t quite share those and feel more confused how to best think about “short descriptions” if we don’t just allow arbitrary Turing machines (basically because deciding what allowable “parts” or mathematical objects are seems to be doing a lot of work). Not sure how feasible converging on this is in this format (though I’m happy to keep trying a bit more in case you’re excited to explain).
Yes all four sound right to me.
To avoid any confusion, I’d just add an emphasis that the descriptions are mathematical, as opposed semantic.
I too am keen to converge on a format in terms of Turing machines or Kolmogorov complexity or something else more formal. But I don’t feel very well placed to do that, unfortunately, since thinking in those terms isn’t very natural to me yet.
What’s wrong with “proof” as a formal definition of explanation (of behavior of a network on a dataset)? I claim that description length works pretty well on “formal proof”, I’m in the process of producing a write-up on results exploring this.