RogerDearnaley comments on Why I’m bearish on mechanistic interpretability: the shards are not in the network

RogerDearnaley 17 Sep 2024 23:43 UTC
2 points
0
There are probably a dozen or more articles on this bu now. Search for VAE or Variational Auto-Encoder in the context of mechanical interpretability. The seminal paper on this was from Anthropic.
- tailcalled 18 Sep 2024 7:31 UTC
  2 points
  0
  Parent
  I don’t immediately find it, do you have a link?
  - Bart Bussmann 18 Sep 2024 7:41 UTC
    1 point
    0
    Parent
    I think @RogerDearnaley means Sparse Autoencoders (SAEs), see for example these papers and the SAE tag on LessWrong.
    - tailcalled 18 Sep 2024 8:03 UTC
      2 points
      5
      Parent
      As I mentioned in my other comment, SAEs finds features that correspond to abstract features of words and text. That’s not the same as finding features that correspond to reality.