I’m concerned that progress in interpretability research is ephemeral, being driven primarily by proxy metrics that may be disconnected from the end goal (understanding by humans). (Example: optimising for the L0 metric in SAE interpretability research may lead us to models that have more split features, even when this is unintuitive by human reckoning.)
It seems important for the field to agree on some common benchmark / proxy metric that is proven to be indicative of downstream human-rated interpretability, but I don’t know of anyone doing this. Similar to the role of BLEU in facilitating progress in NLP, I imagine having a standard metric would enable much more rapid and concrete progress in interpretability.
Interpretability needs a good proxy metric
I’m concerned that progress in interpretability research is ephemeral, being driven primarily by proxy metrics that may be disconnected from the end goal (understanding by humans). (Example: optimising for the L0 metric in SAE interpretability research may lead us to models that have more split features, even when this is unintuitive by human reckoning.)
It seems important for the field to agree on some common benchmark / proxy metric that is proven to be indicative of downstream human-rated interpretability, but I don’t know of anyone doing this. Similar to the role of BLEU in facilitating progress in NLP, I imagine having a standard metric would enable much more rapid and concrete progress in interpretability.