ryan_greenblatt comments on How useful is mechanistic interpretability?

ryan_greenblatt 19 Jan 2024 18:09 UTC
2 points
0

By mech interp I mean “A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding.”

For examples of non-mech interp model internals, see here, here, and here. (Though all of these methods are quite simple.)