I’ve been thinking about the tradeoff between speed, simplicity and generality. I tentatively think that there’s some sort of pareto optimal relationship between speed, simplicity and generality. I think the key is to view a given program as representing an implicit encoding of the distribution of inputs that the program correctly decides. As the program generalizes, its distribution of correctly decided inputs expands. As the program becomes simpler, its capacity to specify an efficient encoding decreases. Both these factors point towards the program taking longer to execute on randomly sampled inputs from its input distribution.
I hope to have a full write up once I’m more certain about how to think about these the arguments in the presentation. I’m pretty unsure about this line of thought, but I think it might be moving towards some pretty important insights regarding simplicity vs speed vs generality. I’d be happy for any feedback you have!
One thing I feel pretty confidant about is that pure speed prior does not generalize (like at all). I think it just gives you a binary search tree lookup table containing only your training data.
I certainly agree that the most extreme version of the speed prior is degenerate—but given that you need to trade off speed and complexity under a finite compute budget, the simplest-yet-slow algorithm will often perform much worse than a slightly-more-complicated-yet-faster algorithm.
For example, consider the fact that reasoning abstractly about how to play Go is a lot worse than learning really good heuristics for playing Go. Those heuristics are indeed kind of a step towards a “lookup table”—but that’s ok sometimes, given that thinking forever (e.g. heuristic-free MCTS) will find much worse solutions in a given time budget.
You might already know this, but the idea of perfect generality on arbitrary tasks seems like you’re basically asking for Solomonoff induction—in which case you might reformulate your point as saying “the longer you run Solomonoff induction, the more general the class of problems you can solve”. I think this is true in the limit, but for any finite-complexity distribution of training environments, you do eventually hit diminishing returns from running your Solomonoff inductor for longer.
I think there’s a continuum of speed vs generality. You can add more complexity to improve speed, but it comes at the cost of generality (assuming all algorithms are implemented in an equally efficient manner).
Given a finite compute budget, it’s definitely possible to choose an algorithm that’s more general than your budget can afford. In that case, you hit under fitting issues. Then, you can find a more complex algorithm that’s more specialized to the specific distribution you’re trying to model. Such an algorithm is faster to run on your training distribution but is less likely to generalize to other distributions.
The overall point of the presentation is that such a tendency isn’t just some empirical pattern that we can only count on for algorithms developed by humans. It derives from an information-theoretic perspective on the relation between algorithmic complexity, runtime and input space complexity.
Edit: to clarify an implicit assumption in my presentation, I’m assuming all algorithms are run for as long as needed to correctly decide all inputs. The analysis focuses on how the runtime changes as the simplicity increases and the input space widens.
pareto optimal relationship between speed, simplicity and generality
This is an interesting subject. I think that the average slope of the speed-simplicity frontier might give a good measure of the complexity of an object, specifically of the number of layers of emergent behavior that the object exhibits.
I’ve been thinking about the tradeoff between speed, simplicity and generality. I tentatively think that there’s some sort of pareto optimal relationship between speed, simplicity and generality. I think the key is to view a given program as representing an implicit encoding of the distribution of inputs that the program correctly decides. As the program generalizes, its distribution of correctly decided inputs expands. As the program becomes simpler, its capacity to specify an efficient encoding decreases. Both these factors point towards the program taking longer to execute on randomly sampled inputs from its input distribution.
You can see a presentation about my current intuitions which I prepared for my PhD advisor: https://docs.google.com/presentation/d/1JG9LYG6pDt7T6WcU9vpU6pqwSR5xjcaEJmKAudRtNPs/edit?usp=sharing
I hope to have a full write up once I’m more certain about how to think about these the arguments in the presentation. I’m pretty unsure about this line of thought, but I think it might be moving towards some pretty important insights regarding simplicity vs speed vs generality. I’d be happy for any feedback you have!
One thing I feel pretty confidant about is that pure speed prior does not generalize (like at all). I think it just gives you a binary search tree lookup table containing only your training data.
I certainly agree that the most extreme version of the speed prior is degenerate—but given that you need to trade off speed and complexity under a finite compute budget, the simplest-yet-slow algorithm will often perform much worse than a slightly-more-complicated-yet-faster algorithm.
For example, consider the fact that reasoning abstractly about how to play Go is a lot worse than learning really good heuristics for playing Go. Those heuristics are indeed kind of a step towards a “lookup table”—but that’s ok sometimes, given that thinking forever (e.g. heuristic-free MCTS) will find much worse solutions in a given time budget.
You might already know this, but the idea of perfect generality on arbitrary tasks seems like you’re basically asking for Solomonoff induction—in which case you might reformulate your point as saying “the longer you run Solomonoff induction, the more general the class of problems you can solve”. I think this is true in the limit, but for any finite-complexity distribution of training environments, you do eventually hit diminishing returns from running your Solomonoff inductor for longer.
I think there’s a continuum of speed vs generality. You can add more complexity to improve speed, but it comes at the cost of generality (assuming all algorithms are implemented in an equally efficient manner).
Given a finite compute budget, it’s definitely possible to choose an algorithm that’s more general than your budget can afford. In that case, you hit under fitting issues. Then, you can find a more complex algorithm that’s more specialized to the specific distribution you’re trying to model. Such an algorithm is faster to run on your training distribution but is less likely to generalize to other distributions.
The overall point of the presentation is that such a tendency isn’t just some empirical pattern that we can only count on for algorithms developed by humans. It derives from an information-theoretic perspective on the relation between algorithmic complexity, runtime and input space complexity.
Edit: to clarify an implicit assumption in my presentation, I’m assuming all algorithms are run for as long as needed to correctly decide all inputs. The analysis focuses on how the runtime changes as the simplicity increases and the input space widens.
This is an interesting subject. I think that the average slope of the speed-simplicity frontier might give a good measure of the complexity of an object, specifically of the number of layers of emergent behavior that the object exhibits.