Oliver Daniels comments on Eight Strategies for Tackling the Hard Part of the Alignment Problem

Oliver Daniels 13 Jul 2023 16:19 UTC
1 point
0
Curious on your thoughts on synergies between mechanistic interpretability and formal verification. One of the main problems in formal verification seems to be specification—how to define the safety properties in terms of input output bounds. But if we use the abstractions/features learned by networks discovered by mech-interp (rather than raw input and output space) specification may be more tractable.
- scasper 13 Jul 2023 17:03 UTC
  1 point
  0
  Parent
  I don’t work on this, so grain of salt.
  But wouldn’t this take the formal out of formal verification? If so, I am inclined to think about this as a form of ambitious mechanistic interpretability.
  - Oliver Daniels 13 Jul 2023 18:04 UTC
    3 points
    2
    Parent
    I was thinking something like formal conditional on mechanistic interpretation of neuron/feature/subnetwork. Which yeah, isn’t formal in strongest sense, but could give you some guarantees that don’t require full mechanistic understanding of how a model does a bad thing. Proving {feature B=b | feature A=a} requires mech-interp to semantically map the feature B and feature A, but remains agnostic about the mechanism that guarantees {feature B=b | feature A=a}. (Though admittedly I’m struggling to come up with more concrete examples)