Rohin Shah comments on How useful is mechanistic interpretability?

Rohin Shah 2 Dec 2023 21:18 UTC
5 points
0
One piece missing here, insofar as current methods don’t get to 99% of loss recovered, is repeatedly drilling into the residual until they do get to 99%.
When you do that using existing methods, you lose the sparsity (e.g. for circuit finding you have to include a large fraction of the model to get to 99% loss recovered).
It’s of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn’t going to go away with better methods). I do expect we can improve; we’re very far from the 99% standard. But the way we improve won’t be by “drilling into the residual”; that has been tried and is insufficient. EDIT: Possibly by “drill into the residual” you mean “understand why the methods don’t work and then improve them”—if so I agree with that but also think this is what mech interp researchers want to do.
(Why am I still optimistic about interpretability? I’m not convinced that the 99% standard is required for downstream impact—though I am pretty pessimistic about the “enumerative safety” story of impact, basically for the same reasons as Buck and Ryan afaict.)
- johnswentworth 3 Dec 2023 7:07 UTC
  3 points
  2
  Parent
  It’s of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn’t going to go away with better methods).
  I have the opposite expectation there; I think it’s just that current methods are pretty primitive.
  - wassname 19 Jan 2024 12:44 UTC
    7 points
    0
    Parent
    I also think current methods are mostly focused on linear interp. But what if it just ain’t linear.