Erik Jenner comments on Sparsify: A mechanistic interpretability research agenda

Erik Jenner 8 Apr 2024 19:41 UTC
LW: 4 AF: 2
0
AF
Thanks for the detailed responses! I’m happy to talk about “descriptions” throughout.
Trying to summarize my current understanding of what you’re saying:
- SAEs themselves aren’t meant to be descriptions of (network, dataset). (I’d just misinterpreted your earlier comment.)
- As a description of just the network, SAEs have a higher description length than a naive neuron-based description of the network.
- Given a description of the network in terms of “parts,” we can get a description of (network, dataset) by listing out which “parts” are “active” on each sample. I assume we then “compress” this description somehow (e.g. grouping similar samples), since otherwise the description would always have size linear in the dataset size?
- You’re then claiming that SAEs are a particularly short description of (network, dataset) in this sense (since they’re optimized for not having many parts active).
My confusion mainly comes down to defining the words in quotes above, i.e. “parts”, “active”, and “compress”. My sense is that they are playing a pretty crucial role and that there are important conceptual issues with formalizing them. (So it’s not just that we have a great intuition and it’s just annoying to spell it out mathematically, I’m not convinced we even have a good intuitive understanding of what these things should mean.)
That said, my sense is you’re not claiming any of this is easy to define. I’d guess you have intuitions that the “short description length” framing is philosophically the right one, and I probably don’t quite share those and feel more confused how to best think about “short descriptions” if we don’t just allow arbitrary Turing machines (basically because deciding what allowable “parts” or mathematical objects are seems to be doing a lot of work). Not sure how feasible converging on this is in this format (though I’m happy to keep trying a bit more in case you’re excited to explain).
- Lee Sharkey 10 Apr 2024 9:55 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Trying to summarize my current understanding of what you’re saying:
  Yes all four sound right to me.
  To avoid any confusion, I’d just add an emphasis that the descriptions are mathematical, as opposed semantic.
  I’d guess you have intuitions that the “short description length” framing is philosophically the right one, and I probably don’t quite share those and feel more confused how to best think about “short descriptions” if we don’t just allow arbitrary Turing machines (basically because deciding what allowable “parts” or mathematical objects are seems to be doing a lot of work). Not sure how feasible converging on this is in this format (though I’m happy to keep trying a bit more in case you’re excited to explain).
  I too am keen to converge on a format in terms of Turing machines or Kolmogorov complexity or something else more formal. But I don’t feel very well placed to do that, unfortunately, since thinking in those terms isn’t very natural to me yet.