tailcalled comments on Why I’m bearish on mechanistic interpretability: the shards are not in the network

tailcalled 16 Sep 2024 6:01 UTC
2 points
0
What do you mean? Do you have a link?
- RogerDearnaley 17 Sep 2024 23:43 UTC
  2 points
  0
  Parent
  There are probably a dozen or more articles on this bu now. Search for VAE or Variational Auto-Encoder in the context of mechanical interpretability. The seminal paper on this was from Anthropic.
  - tailcalled 18 Sep 2024 7:31 UTC
    2 points
    0
    Parent
    I don’t immediately find it, do you have a link?
    - Bart Bussmann 18 Sep 2024 7:41 UTC
      1 point
      0
      Parent
      I think @RogerDearnaley means Sparse Autoencoders (SAEs), see for example these papers and the SAE tag on LessWrong.
      - tailcalled 18 Sep 2024 8:03 UTC
        2 points
        5
        Parent
        As I mentioned in my other comment, SAEs finds features that correspond to abstract features of words and text. That’s not the same as finding features that correspond to reality.