Meta level I wrote this post in 1-3 hours, and am very satisfied with the returns per unit time! I don’t think this is the best or most robust post I could have written, and I think some of these theories of impact are much more important than others. But I think that just collecting a ton of these in the same place was a valuable thing to do, and have heard from multiple people who appreciated this post’s existence! More importantly, it was easy and fun, and I personally want to take this as inspiration to find more, easy-to-write-yet-valuable things to do.
Object level I think the key point I wanted to make with this post was “there’s a bunch of ways that interp can be helpful”, which I think basically stands. I go back and forth on how much it’s valuable to think about theories of impact day to day, vs just trying to do good science and pluck impactful low-hanging fruit, but I think that either way it’s valuable to have a bunch in mind rather than carefully back-chaining from a specific and fragile theory of change.
Meta level I wrote this post in 1-3 hours, and am very satisfied with the returns per unit time! I don’t think this is the best or most robust post I could have written, and I think some of these theories of impact are much more important than others. But I think that just collecting a ton of these in the same place was a valuable thing to do, and have heard from multiple people who appreciated this post’s existence! More importantly, it was easy and fun, and I personally want to take this as inspiration to find more, easy-to-write-yet-valuable things to do.
Object level I think the key point I wanted to make with this post was “there’s a bunch of ways that interp can be helpful”, which I think basically stands. I go back and forth on how much it’s valuable to think about theories of impact day to day, vs just trying to do good science and pluck impactful low-hanging fruit, but I think that either way it’s valuable to have a bunch in mind rather than carefully back-chaining from a specific and fragile theory of change.
This post got some extensive criticism in Against Almost Every Theory of Impact of Interpretability, but I largely agree with Richard Ngo and Rohin Shah’s responses.