In short, the probability distribution you choose contains lots of interesting assumptions about what states are more likely that you didn’t necessarily intend. As a result most of the possible hypotheses have vanishingly small prior probability and you can never reach them. Even though with a frequentist approach
For example, let us consider trying to learn a function with 1-dim numerical input and output (e.g. R→R). Correspondingly, your hypothesis space is the set of all such functions. There are very many functions (infinitely many if RR, otherwise a crazy number).
You could use the Solomonoff prior (on a discretized version of this), but that way lies madness. It’s uncomputable, and most of the functions that fit the data may contain agents that try to get you to do their bidding, all sorts of problems.
What other prior probability distribution can we place on the hypothesis space? The obvious choice in 2023 is a neural network with random weights. OK, let’s think about that. What architecture? The most sensible thing is to randomize over architectures somehow. Let’s hope the distribution on architectures is as simple as possible.
Okay, let’s give up and place some arbitrary distribution (e.g. geometric distribution) on the width.
What about the prior on weights? uh idk, zero-mean identity covariance Gaussian? Our best evidence says that this sucks.
At this point you’ve made so many choices, which have to be informed by what empirically works well, that it’s a strange Bayesian reasoner you end up with. And you haven’t even specified your prior distribution yet.
You could use the Solomonoff prior (on a discretized version of this), but that way lies madness. It’s uncomputable, and most of the functions that fit the data may contain agents that try to get you to do their bidding, all sorts of problems.
In short, the probability distribution you choose contains lots of interesting assumptions about what states are more likely that you didn’t necessarily intend. As a result most of the possible hypotheses have vanishingly small prior probability and you can never reach them. Even though with a frequentist approach
For example, let us consider trying to learn a function with 1-dim numerical input and output (e.g. R→R). Correspondingly, your hypothesis space is the set of all such functions. There are very many functions (infinitely many if RR, otherwise a crazy number).
You could use the Solomonoff prior (on a discretized version of this), but that way lies madness. It’s uncomputable, and most of the functions that fit the data may contain agents that try to get you to do their bidding, all sorts of problems.
What other prior probability distribution can we place on the hypothesis space? The obvious choice in 2023 is a neural network with random weights. OK, let’s think about that. What architecture? The most sensible thing is to randomize over architectures somehow. Let’s hope the distribution on architectures is as simple as possible.
How wide, how deep? You don’t want to choose an arbitrary distribution or (god forbid) arbitrary number, so let’s make it infinitely wide and deep! It turns out that an infinitely wide network just collapses to a random process without any internal features. It turns out an infinitely deep network, but that collapses to a stationary distribution which doesn’t depend on the input. Oops.
Okay, let’s give up and place some arbitrary distribution (e.g. geometric distribution) on the width.
What about the prior on weights? uh idk, zero-mean identity covariance Gaussian? Our best evidence says that this sucks.
At this point you’ve made so many choices, which have to be informed by what empirically works well, that it’s a strange Bayesian reasoner you end up with. And you haven’t even specified your prior distribution yet.
This seems false if you’re interacting with a computable universe, and don’t need to model yourself or copies of yourself. Computability of the prior also seems irrelevant if I have infinite compute. Therefore in this prediction task, I don’t see the problem in just using the first thing you mentioned.
Reasonable people disagree. Why should I care about the “limit of large data” instead of finite-data performance?