I was thinking something like formal conditional on mechanistic interpretation of neuron/feature/subnetwork. Which yeah, isn’t formal in strongest sense, but could give you some guarantees that don’t require full mechanistic understanding of how a model does a bad thing. Proving {feature B=b | feature A=a} requires mech-interp to semantically map the feature B and feature A, but remains agnostic about the mechanism that guarantees {feature B=b | feature A=a}. (Though admittedly I’m struggling to come up with more concrete examples)
I was thinking something like formal conditional on mechanistic interpretation of neuron/feature/subnetwork. Which yeah, isn’t formal in strongest sense, but could give you some guarantees that don’t require full mechanistic understanding of how a model does a bad thing. Proving {feature B=b | feature A=a} requires mech-interp to semantically map the feature B and feature A, but remains agnostic about the mechanism that guarantees {feature B=b | feature A=a}. (Though admittedly I’m struggling to come up with more concrete examples)