I’m very interested in mechanistic Interpretability to provide a testing ground for:
* Selection Theorems
* Natural Abstractions
* Shard Theory
* Other theories about neural networks
Oh, I agree 100% that “the empirical data should serve to let us refine and enhance our theories, not to displace them”. That is how science works in general. My beef is with focusing mostly on theory because “we only have one shot”. My point is “if you think you only have one shot, figure out how to get more shots”.
I don’t think we only have one shot in the mainline (I expect slow takeoff). I think theory is especially valuable if we only have one (or a few) shots.
I prefer theory rooted in solid empirical data[1].
I’m sympathetic to iterative cycles, but the empirical data should serve to let us refine and enhance our theories, not to displace them.
Empirical data does not exist in the absence of theory; observations only convey information after interpretation through particular theories.
The power of formal guarantees to:
Apply even as the system scales up
Generalise far out of distribution
Confer very high “all things considered” confidence
Transfer to derivative systems
Apply even under adversarial optimisation?
Remain desiderata that arguments for existential safety of powerful AI systems need to satisfy.
I’m very interested in mechanistic Interpretability to provide a testing ground for:
* Selection Theorems
* Natural Abstractions
* Shard Theory
* Other theories about neural networks
Oh, I agree 100% that “the empirical data should serve to let us refine and enhance our theories, not to displace them”. That is how science works in general. My beef is with focusing mostly on theory because “we only have one shot”. My point is “if you think you only have one shot, figure out how to get more shots”.
I don’t think we only have one shot in the mainline (I expect slow takeoff). I think theory is especially valuable if we only have one (or a few) shots.
I should edit the OP to make that clear.