Curious on your thoughts on synergies between mechanistic interpretability and formal verification. One of the main problems in formal verification seems to be specification—how to define the safety properties in terms of input output bounds. But if we use the abstractions/features learned by networks discovered by mech-interp (rather than raw input and output space) specification may be more tractable.
But wouldn’t this take the formal out of formal verification? If so, I am inclined to think about this as a form of ambitious mechanistic interpretability.
I was thinking something like formal conditional on mechanistic interpretation of neuron/feature/subnetwork. Which yeah, isn’t formal in strongest sense, but could give you some guarantees that don’t require full mechanistic understanding of how a model does a bad thing. Proving {feature B=b | feature A=a} requires mech-interp to semantically map the feature B and feature A, but remains agnostic about the mechanism that guarantees {feature B=b | feature A=a}. (Though admittedly I’m struggling to come up with more concrete examples)
Curious on your thoughts on synergies between mechanistic interpretability and formal verification. One of the main problems in formal verification seems to be specification—how to define the safety properties in terms of input output bounds. But if we use the abstractions/features learned by networks discovered by mech-interp (rather than raw input and output space) specification may be more tractable.
I don’t work on this, so grain of salt.
But wouldn’t this take the formal out of formal verification? If so, I am inclined to think about this as a form of ambitious mechanistic interpretability.
I was thinking something like formal conditional on mechanistic interpretation of neuron/feature/subnetwork. Which yeah, isn’t formal in strongest sense, but could give you some guarantees that don’t require full mechanistic understanding of how a model does a bad thing. Proving {feature B=b | feature A=a} requires mech-interp to semantically map the feature B and feature A, but remains agnostic about the mechanism that guarantees {feature B=b | feature A=a}. (Though admittedly I’m struggling to come up with more concrete examples)