Because I don’t think this is realistically useful, I don’t think this at all reduces my probability that your techniques are fake and your models of interpretability are wrong.
Maybe the groundedness you’re talking about comes from the fact that you’re doing interp on a domain of practical importance?
??? Come on, there’s clearly a difference between “we can find an Arabic feature when we go looking for anything interpretable” vs “we chose from the relatively small set of practically important things and succeeded in doing something interesting in that domain”. I definitely agree this isn’t yet close to “doing something useful, beyond what well-tuned baselines can do”. But this should presumably rule out some hypotheses that current interpretability results are due to an extreme streetlight effect?
(I suppose you could have already been 100% confident that results so far weren’t the result of extreme streetlight effect and so you didn’t update, but imo that would just make you overconfident in how good current mech interp is.)
(I’m basically saying similar things as Lawrence.)
??? Come on, there’s clearly a difference between “we can find an Arabic feature when we go looking for anything interpretable” vs “we chose from the relatively small set of practically important things and succeeded in doing something interesting in that domain”.
Oh okay, you’re saying the core point is that this project was less streetlighty because the topic you investigated was determined by the field’s interest rather than cherrypicking. I actually hadn’t understood that this is what you were saying. I agree that this makes the results slightly better.
??? Come on, there’s clearly a difference between “we can find an Arabic feature when we go looking for anything interpretable” vs “we chose from the relatively small set of practically important things and succeeded in doing something interesting in that domain”. I definitely agree this isn’t yet close to “doing something useful, beyond what well-tuned baselines can do”. But this should presumably rule out some hypotheses that current interpretability results are due to an extreme streetlight effect?
(I suppose you could have already been 100% confident that results so far weren’t the result of extreme streetlight effect and so you didn’t update, but imo that would just make you overconfident in how good current mech interp is.)
(I’m basically saying similar things as Lawrence.)
Oh okay, you’re saying the core point is that this project was less streetlighty because the topic you investigated was determined by the field’s interest rather than cherrypicking. I actually hadn’t understood that this is what you were saying. I agree that this makes the results slightly better.