Quintin Pope comments on The Case for Radical Optimism about Interpretability

Quintin Pope 18 Dec 2021 21:26 UTC
3 points
I think you’re pointing to a special case of a more general pattern.
Just like there’s a general factor of “athletic ability” which can be subdivided into many correlated components, I think “interpretability” can be split into many correlated components. Some of those components correspond to greater reliability in the worst case scenarios. Others might correspond to, e.g., predicting which architectural modifications would improve performance on typical input.
Worst case reliability is probably the most important component to interpretability, but I’m not sure it makes sense to call it “real interpretability”. It’s probably better to just call it “worst case” interpretability.