Bart Bussmann comments on Why I’m bearish on mechanistic interpretability: the shards are not in the network

Bart Bussmann 18 Sep 2024 7:41 UTC
1 point
0
I think @RogerDearnaley means Sparse Autoencoders (SAEs), see for example these papers and the SAE tag on LessWrong.
- tailcalled 18 Sep 2024 8:03 UTC
  2 points
  5
  Parent
  As I mentioned in my other comment, SAEs finds features that correspond to abstract features of words and text. That’s not the same as finding features that correspond to reality.