wassname comments on How useful is mechanistic interpretability?

wassname 19 Jan 2024 12:45 UTC
1 point
0
other model internals techniques
What are these? I’m confused about the boundary between mechinterp and others.
- ryan_greenblatt 19 Jan 2024 18:09 UTC
  2 points
  0
  Parent
  
  By mech interp I mean “A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding.”
  
  For examples of non-mech interp model internals, see here, here, and here. (Though all of these methods are quite simple.)