Even under these assumptions, you still have the problem of handling belief states which cannot be described as a probability distribution. For small state spaces, being fast and loose with that (e.g. just belief the uniform distribution over everything) is fine, but larger state spaces run into problems, even if you have infinite compute and can prove everything and don’t need to have self-knowledge.
In short, the probability distribution you choose contains lots of interesting assumptions about what states are more likely that you didn’t necessarily intend. As a result most of the possible hypotheses have vanishingly small prior probability and you can never reach them. Even though with a frequentist approach
For example, let us consider trying to learn a function with 1-dim numerical input and output (e.g. R→R). Correspondingly, your hypothesis space is the set of all such functions. There are very many functions (infinitely many if RR, otherwise a crazy number).
You could use the Solomonoff prior (on a discretized version of this), but that way lies madness. It’s uncomputable, and most of the functions that fit the data may contain agents that try to get you to do their bidding, all sorts of problems.
What other prior probability distribution can we place on the hypothesis space? The obvious choice in 2023 is a neural network with random weights. OK, let’s think about that. What architecture? The most sensible thing is to randomize over architectures somehow. Let’s hope the distribution on architectures is as simple as possible.
Okay, let’s give up and place some arbitrary distribution (e.g. geometric distribution) on the width.
What about the prior on weights? uh idk, zero-mean identity covariance Gaussian? Our best evidence says that this sucks.
At this point you’ve made so many choices, which have to be informed by what empirically works well, that it’s a strange Bayesian reasoner you end up with. And you haven’t even specified your prior distribution yet.
You could use the Solomonoff prior (on a discretized version of this), but that way lies madness. It’s uncomputable, and most of the functions that fit the data may contain agents that try to get you to do their bidding, all sorts of problems.
Of course, you can describe anything with some probability distribution, but these are cases where the standard Bayesian approach to modelling belief-states needs to be amended somewhat.
1-4 seem to go away if I don’t care about self-knowledge, and have infinite compute. 5 doesn’t seem like a problem to me. If there is a best reasoning system, it should not make mistakes. Showing that a system can’t make mistakes may show you its not what humans use, but it should not be classified as a problem.
What sort of problems?
In short, the probability distribution you choose contains lots of interesting assumptions about what states are more likely that you didn’t necessarily intend. As a result most of the possible hypotheses have vanishingly small prior probability and you can never reach them. Even though with a frequentist approach
For example, let us consider trying to learn a function with 1-dim numerical input and output (e.g. R→R). Correspondingly, your hypothesis space is the set of all such functions. There are very many functions (infinitely many if RR, otherwise a crazy number).
You could use the Solomonoff prior (on a discretized version of this), but that way lies madness. It’s uncomputable, and most of the functions that fit the data may contain agents that try to get you to do their bidding, all sorts of problems.
What other prior probability distribution can we place on the hypothesis space? The obvious choice in 2023 is a neural network with random weights. OK, let’s think about that. What architecture? The most sensible thing is to randomize over architectures somehow. Let’s hope the distribution on architectures is as simple as possible.
How wide, how deep? You don’t want to choose an arbitrary distribution or (god forbid) arbitrary number, so let’s make it infinitely wide and deep! It turns out that an infinitely wide network just collapses to a random process without any internal features. It turns out an infinitely deep network, but that collapses to a stationary distribution which doesn’t depend on the input. Oops.
Okay, let’s give up and place some arbitrary distribution (e.g. geometric distribution) on the width.
What about the prior on weights? uh idk, zero-mean identity covariance Gaussian? Our best evidence says that this sucks.
At this point you’ve made so many choices, which have to be informed by what empirically works well, that it’s a strange Bayesian reasoner you end up with. And you haven’t even specified your prior distribution yet.
This seems false if you’re interacting with a computable universe, and don’t need to model yourself or copies of yourself. Computability of the prior also seems irrelevant if I have infinite compute. Therefore in this prediction task, I don’t see the problem in just using the first thing you mentioned.
Reasonable people disagree. Why should I care about the “limit of large data” instead of finite-data performance?
Logical/mathematical beliefs — e.g. “Is Fermat’s Last Theorem true?”
Meta-beliefs — e.g. “Do I believe that I will die one day?”
Beliefs about the outcome space itself — e.g. “Am I conflating these two outcomes?”
Indexical beliefs — e.g. “Am I the left clone or the right clone?”
Irrational beliefs — e.g. conjunction fallacy.
e.t.c.
Of course, you can describe anything with some probability distribution, but these are cases where the standard Bayesian approach to modelling belief-states needs to be amended somewhat.
1-4 seem to go away if I don’t care about self-knowledge, and have infinite compute. 5 doesn’t seem like a problem to me. If there is a best reasoning system, it should not make mistakes. Showing that a system can’t make mistakes may show you its not what humans use, but it should not be classified as a problem.