Mechanistic interpretability-based evals could try to find inputs that lead to concerning combinations of features
An early work that does this on the vision model is https://distill.pub/2019/activation-atlas/.
Specifically, in the section on Focusing on a Single Classification, they observe spurious correlations in the activation space, via feature visualization, and use this observation to construct new failure cases of the model.
An early work that does this on the vision model is https://distill.pub/2019/activation-atlas/.
Specifically, in the section on Focusing on a Single Classification, they observe spurious correlations in the activation space, via feature visualization, and use this observation to construct new failure cases of the model.