perhaps this problem can be overcome by including checks for generalization during training, i.e., testing how well the program generalizes to various test distributions.
I don’t think this gets at the core difficulty of speed priors not generalizing well. Let’s we generate a bunch of lookup-table-ish things according to the speed prior, and then reject all the ones that don’t generalize to our testing set. The majority of the models that pass our check are going to be basically the same as the rest, plus whatever modification that causes them to pass with is most probable on the speed prior, i.e. the minimum number of additional entries in the lookup table to pass the training set.
Here’s maybe an argument that the generalization/deception tradeoff is unavoidable: according to the Mingard et al. picture of why neural nets generalize, it’s basically just that they are cheap approximations of rejection sampling with a simplicity prior. This generalizes by relying on the fact that reality is, in fact, simplicity biased—it’s just Ockham’s razor. On this picture of things, neural nets generalize exactly to the extent that they approximate a simplicity prior, and so any attempt to sample according to an alternative prior will lose generalization ability as it gets “further” from the True Distribution of the World.
I think this is a little bit off. The world doesn’t have a True Distribution, it’s just the world. A more careful treatment would involve talking about why we expect Solomonoff induction to work well, why the speed prior (as in universal search prior) also works in theory, and what you think might be different in practice (e.g. if you’re actually constructing a program with gradient descent using something like “description length” or “runtime” as a loss).
Nice post!
I don’t think this gets at the core difficulty of speed priors not generalizing well. Let’s we generate a bunch of lookup-table-ish things according to the speed prior, and then reject all the ones that don’t generalize to our testing set. The majority of the models that pass our check are going to be basically the same as the rest, plus whatever modification that causes them to pass with is most probable on the speed prior, i.e. the minimum number of additional entries in the lookup table to pass the training set.
Here’s maybe an argument that the generalization/deception tradeoff is unavoidable: according to the Mingard et al. picture of why neural nets generalize, it’s basically just that they are cheap approximations of rejection sampling with a simplicity prior. This generalizes by relying on the fact that reality is, in fact, simplicity biased—it’s just Ockham’s razor. On this picture of things, neural nets generalize exactly to the extent that they approximate a simplicity prior, and so any attempt to sample according to an alternative prior will lose generalization ability as it gets “further” from the True Distribution of the World.
I think this is a little bit off. The world doesn’t have a True Distribution, it’s just the world. A more careful treatment would involve talking about why we expect Solomonoff induction to work well, why the speed prior (as in universal search prior) also works in theory, and what you think might be different in practice (e.g. if you’re actually constructing a program with gradient descent using something like “description length” or “runtime” as a loss).