But there are like 10x more safety people looking into interpretability instead of how they generalize from data, as far as I can tell.
I think interpretability is a really powerful lens for looking at how models generalize from data, partly just in terms of giving you a lot more stuff to look at than you would have purely by looking at model outputs.
If I want to understand the characteristics of how a car performs, I should of course spend some time driving the car around, measuring lots of things like acceleration curves and turning radius and power output and fuel consumption. But I should also pop open the hood, and try to figure out how the components interact, and how each component behaves in isolation in various situations, and, if possible, what that component’s environment looks like in various real-world conditions. (Also I should probably learn something about what roads are like, which I think would be analogous to “actually look at a representative sample of the training data”).
I think interpretability is a really powerful lens for looking at how models generalize from data, partly just in terms of giving you a lot more stuff to look at than you would have purely by looking at model outputs.
If I want to understand the characteristics of how a car performs, I should of course spend some time driving the car around, measuring lots of things like acceleration curves and turning radius and power output and fuel consumption. But I should also pop open the hood, and try to figure out how the components interact, and how each component behaves in isolation in various situations, and, if possible, what that component’s environment looks like in various real-world conditions. (Also I should probably learn something about what roads are like, which I think would be analogous to “actually look at a representative sample of the training data”).