This weakness of SAEs is not surprising, as this is a general weakness of any interpretation method that is calculated based on model behaviours for a selected dataset. The same effect has been shown for permutation feature importances, partial dependence plots, Shapley values, integrated gradients and more. There is a reasonably large body of literature on the subject from the interpretable ML / explainable ML research communities in the last 5-10 years.
This weakness of SAEs is not surprising, as this is a general weakness of any interpretation method that is calculated based on model behaviours for a selected dataset. The same effect has been shown for permutation feature importances, partial dependence plots, Shapley values, integrated gradients and more. There is a reasonably large body of literature on the subject from the interpretable ML / explainable ML research communities in the last 5-10 years.