One piece missing here, insofar as current methods don’t get to 99% of loss recovered, is repeatedly drilling into the residual until they do get to 99%.
When you do that using existing methods, you lose the sparsity (e.g. for circuit finding you have to include a large fraction of the model to get to 99% loss recovered).
It’s of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn’t going to go away with better methods). I do expect we can improve; we’re very far from the 99% standard. But the way we improve won’t be by “drilling into the residual”; that has been tried and is insufficient. EDIT: Possibly by “drill into the residual” you mean “understand why the methods don’t work and then improve them”—if so I agree with that but also think this is what mech interp researchers want to do.
(Why am I still optimistic about interpretability? I’m not convinced that the 99% standard is required for downstream impact—though I am pretty pessimistic about the “enumerative safety” story of impact, basically for the same reasons as Buck and Ryan afaict.)
It’s of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn’t going to go away with better methods).
I have the opposite expectation there; I think it’s just that current methods are pretty primitive.
When you do that using existing methods, you lose the sparsity (e.g. for circuit finding you have to include a large fraction of the model to get to 99% loss recovered).
It’s of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn’t going to go away with better methods). I do expect we can improve; we’re very far from the 99% standard. But the way we improve won’t be by “drilling into the residual”; that has been tried and is insufficient. EDIT: Possibly by “drill into the residual” you mean “understand why the methods don’t work and then improve them”—if so I agree with that but also think this is what mech interp researchers want to do.
(Why am I still optimistic about interpretability? I’m not convinced that the 99% standard is required for downstream impact—though I am pretty pessimistic about the “enumerative safety” story of impact, basically for the same reasons as Buck and Ryan afaict.)
I have the opposite expectation there; I think it’s just that current methods are pretty primitive.
I also think current methods are mostly focused on linear interp. But what if it just ain’t linear.