ryan_greenblatt comments on In defense of probably wrong mechanistic models

ryan_greenblatt 7 Dec 2022 7:47 UTC
LW: 5 AF: 4
2
AF
In the spirit of Evan’s original post here’s a (half baked) simple model:

Simplicity claims are claims about how many bits (in the human prior) it takes to explain^[1] some amount of performance in the NN prior.

E.g., suppose we train a model which gets 2 nats of loss with 100 Billion parameters and we can explain this model getting 2.5 nats using a 300 KB human understandable manual (we might run into issues with irreducible complexity such that making a useful manual is hard, but let’s put that aside for now).

So, ‘simplicity’ of this sort is lower bounded by the relative parameter efficiency of neural networks in practice vs the human prior.

In practice, you do worse than this insofar as NNs express things which are anti-natural in the human prior (in terms of parameter efficiency).

We can also reason about how ‘compressible’ the explanation is in a naive prior (e.g., a formal framework for expressing explanations which doesn’t utilize cleverer reasoning technology than NNs themselves). I don’t quite mean compressible—presumably this ends up getting you insane stuff as compression usually does.
1. ↩︎
  by explain, I mean something like the idea of heuristic arguments from ARC.