idly comments on An Interpretability Illusion from Population Statistics in Causal Analysis

idly 2 Aug 2024 12:56 UTC
1 point
0
This weakness of SAEs is not surprising, as this is a general weakness of any interpretation method that is calculated based on model behaviours for a selected dataset. The same effect has been shown for permutation feature importances, partial dependence plots, Shapley values, integrated gradients and more. There is a reasonably large body of literature on the subject from the interpretable ML / explainable ML research communities in the last 5-10 years.