Rohin Shah comments on EIS XIV: Is mechanistic interpretability about to be practically useful?

Rohin Shah 12 Oct 2024 14:39 UTC
LW: 8 AF: 6
5
AF
Once the next Anthropic, GDM, or OpenAI paper on SAEs comes out, I will evaluate my predictions in the same way as before.
Uhh… if we (GDM mech interp team) saw good results on any one of the eight things on your list, we’d probably write a paper just about that thing, rather than waiting to get even more results. And of course we might write an SAE paper that isn’t about downstream uses (e.g. I’m also keen on general scientific validation of SAEs), or a paper reporting negative results, or a paper demonstrating downstream use that isn’t one of your eight items, or a paper looking at downstream uses but not comparing against baselines. So just on this very basic outside view, I feel like the sum of your probabilities should be well under 100%, at least conditional on the next paper coming out of GDM. (I don’t feel like it would be that different if the next paper comes from OpenAI / Anthropic.)
The problem here is “next SAE paper to come out” is a really fragile resolution criterion that depends hugely on unimportant details like “what the team decided was a publishable unit of work”. I’d recommend you instead make time-based predictions (i.e. how likely are each of those to happen by some specific date).
- scasper 12 Oct 2024 17:26 UTC
  LW: 5 AF: 1
  0
  AF Parent
  Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn’t seem too crazy. I think that this DOES match my expectations.
  But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much.
- Max Lee 12 Oct 2024 22:54 UTC
  1 point
  0
  AF Parent
  It seems like the post is implicitly referring to the next big paper on SAEs from one of these labs, similar in newsworthiness as the last Anthropic paper. A big paper won’t be a negative result or a much smaller downstream application, and a big paper would compare its method against baselines if possible, making 165% still within the ballpark.
  I still agree with your comment, especially the recommendation for a time-based prediction (I explained in my other comment here).
  Thank you for your alignment work :)