I certainly agree that the most extreme version of the speed prior is degenerate—but given that you need to trade off speed and complexity under a finite compute budget, the simplest-yet-slow algorithm will often perform much worse than a slightly-more-complicated-yet-faster algorithm.
For example, consider the fact that reasoning abstractly about how to play Go is a lot worse than learning really good heuristics for playing Go. Those heuristics are indeed kind of a step towards a “lookup table”—but that’s ok sometimes, given that thinking forever (e.g. heuristic-free MCTS) will find much worse solutions in a given time budget.
You might already know this, but the idea of perfect generality on arbitrary tasks seems like you’re basically asking for Solomonoff induction—in which case you might reformulate your point as saying “the longer you run Solomonoff induction, the more general the class of problems you can solve”. I think this is true in the limit, but for any finite-complexity distribution of training environments, you do eventually hit diminishing returns from running your Solomonoff inductor for longer.
I think there’s a continuum of speed vs generality. You can add more complexity to improve speed, but it comes at the cost of generality (assuming all algorithms are implemented in an equally efficient manner).
Given a finite compute budget, it’s definitely possible to choose an algorithm that’s more general than your budget can afford. In that case, you hit under fitting issues. Then, you can find a more complex algorithm that’s more specialized to the specific distribution you’re trying to model. Such an algorithm is faster to run on your training distribution but is less likely to generalize to other distributions.
The overall point of the presentation is that such a tendency isn’t just some empirical pattern that we can only count on for algorithms developed by humans. It derives from an information-theoretic perspective on the relation between algorithmic complexity, runtime and input space complexity.
Edit: to clarify an implicit assumption in my presentation, I’m assuming all algorithms are run for as long as needed to correctly decide all inputs. The analysis focuses on how the runtime changes as the simplicity increases and the input space widens.
I certainly agree that the most extreme version of the speed prior is degenerate—but given that you need to trade off speed and complexity under a finite compute budget, the simplest-yet-slow algorithm will often perform much worse than a slightly-more-complicated-yet-faster algorithm.
For example, consider the fact that reasoning abstractly about how to play Go is a lot worse than learning really good heuristics for playing Go. Those heuristics are indeed kind of a step towards a “lookup table”—but that’s ok sometimes, given that thinking forever (e.g. heuristic-free MCTS) will find much worse solutions in a given time budget.
You might already know this, but the idea of perfect generality on arbitrary tasks seems like you’re basically asking for Solomonoff induction—in which case you might reformulate your point as saying “the longer you run Solomonoff induction, the more general the class of problems you can solve”. I think this is true in the limit, but for any finite-complexity distribution of training environments, you do eventually hit diminishing returns from running your Solomonoff inductor for longer.
I think there’s a continuum of speed vs generality. You can add more complexity to improve speed, but it comes at the cost of generality (assuming all algorithms are implemented in an equally efficient manner).
Given a finite compute budget, it’s definitely possible to choose an algorithm that’s more general than your budget can afford. In that case, you hit under fitting issues. Then, you can find a more complex algorithm that’s more specialized to the specific distribution you’re trying to model. Such an algorithm is faster to run on your training distribution but is less likely to generalize to other distributions.
The overall point of the presentation is that such a tendency isn’t just some empirical pattern that we can only count on for algorithms developed by humans. It derives from an information-theoretic perspective on the relation between algorithmic complexity, runtime and input space complexity.
Edit: to clarify an implicit assumption in my presentation, I’m assuming all algorithms are run for as long as needed to correctly decide all inputs. The analysis focuses on how the runtime changes as the simplicity increases and the input space widens.