Thanks for the TLDR. I’m having trouble understanding a couple of things:
if you have a large model and/or lots of data and/or lots of variables, then everything in Jaynes should still basically work.
So we can only rely on the plausibility interpretation asymptotically i.e. as #of data points ⟶∞ or as #of variables ⟶∞?
(And in practice, everything in Jaynes still works very well even when things are small, the theoretical guarantees just aren’t as strong.)
What do you mean by “works very well”? Are the plausibility assignments made using Jaynes-Cox probability theory at least approximately correct when “things are small”?
What do you think about alternatives axiomatisations that have been proposed for patching up Jaynes-Cox probability theory?
I currently consider the coherence theorems a weaker foundation for probability than information theory or Cox’ theorem; they have much more serious loopholes (see the comments on Yudkowsky’s piece for more on that).
I just want to do some machine learning! Is there an interpretation of probability that doesn’t have inconsistencies, loopholes, etc. and is always reliable—not just under certain conditions like “when there are enough bits of randomness”? Is there an interpretation with these properties:
The interpretation should be justified. I was an advocate of the Jaynes-Cox interpretation for so long because I thought Cox’s theorem proves the interpretation of probabilities as plausibilities but this doesn’t seem to be the case evidently (at least not generally)
The interpretation should be useful. Probability theory as extended logic can very naturally be used to solve problems
I just want to do some machine learning! Is there an interpretation of probability that doesn’t have inconsistencies, loopholes, etc. and is always reliable—not just under certain conditions like “when there are enough bits of randomness”
I sympathize.
Most important thing: when I say it “works very well even when things are small”, I mean that I personally have used Jaynes-style probability on prediction problems with very small data and results have generally made sense, and lots of other people have too. There’s empirical evidence that it works.
So we can only rely on the plausibility interpretation asymptotically i.e. as #of data points ⟶∞ or as #of variables ⟶∞?
No; smoothness assumptions should be able to substitute for infinite limits.
Here’s a conceptual analogy: suppose we’re trying to estimate some continuous function on the interval [0, 1), and we have (noiseless) measurements of the function at grid-points [0,1/2n,2/2n,3/2n,...,(2n−1)/2n]. Problem is, we have no idea what the function does between those points—it could go all over the place! Two possible ways to get around this:
Take a limit as n→∞, so that our “grid” covers the whole interval.
Use a smoothness assumption: if we assume some bound on how much the function varies within the little window between grid-points, then we can get reasonably-tight estimates of the function-values between grid points.
I haven’t looked into the details, but the Cox loophole should be basically-similar to this. The whole issue is that we have “too few grid-points” in e.g. Halpern’s example, so smoothness assumptions should be able to patch over that problem, at least approximately (which is all we need in practice).
A different way to frame it: we should be able to weaken the need for “lots of grid-points” by instead strengthening the continuity assumption to impose smoothness bounds even at not-infinitesimal distances.
Note that this is the move we almost always make when dealing with “continuous functions” in the real world. We almost never have an actually-infinitely-fine “grid”; we only have some finite precision, and the function doesn’t “move around too much” at finer scales. In general, when using math about “continuous functions” in the real world, we’re almost always substituting a smoothness assumption for the infinite limit.
What do you think about alternatives axiomatisations that have been proposed for patching up Jaynes-Cox probability theory?
I haven’t looked into these, but I expect they’re generally pretty similar for most practical purposes.
Thanks for the TLDR. I’m having trouble understanding a couple of things:
So we can only rely on the plausibility interpretation asymptotically i.e. as #of data points ⟶∞ or as #of variables ⟶∞?
What do you mean by “works very well”? Are the plausibility assignments made using Jaynes-Cox probability theory at least approximately correct when “things are small”?
What do you think about alternatives axiomatisations that have been proposed for patching up Jaynes-Cox probability theory?
I just want to do some machine learning! Is there an interpretation of probability that doesn’t have inconsistencies, loopholes, etc. and is always reliable—not just under certain conditions like “when there are enough bits of randomness”? Is there an interpretation with these properties:
The interpretation should be justified. I was an advocate of the Jaynes-Cox interpretation for so long because I thought Cox’s theorem proves the interpretation of probabilities as plausibilities but this doesn’t seem to be the case evidently (at least not generally)
The interpretation should be useful. Probability theory as extended logic can very naturally be used to solve problems
Etc.
I sympathize.
Most important thing: when I say it “works very well even when things are small”, I mean that I personally have used Jaynes-style probability on prediction problems with very small data and results have generally made sense, and lots of other people have too. There’s empirical evidence that it works.
No; smoothness assumptions should be able to substitute for infinite limits.
Here’s a conceptual analogy: suppose we’re trying to estimate some continuous function on the interval [0, 1), and we have (noiseless) measurements of the function at grid-points [0,1/2n,2/2n,3/2n,...,(2n−1)/2n]. Problem is, we have no idea what the function does between those points—it could go all over the place! Two possible ways to get around this:
Take a limit as n→∞, so that our “grid” covers the whole interval.
Use a smoothness assumption: if we assume some bound on how much the function varies within the little window between grid-points, then we can get reasonably-tight estimates of the function-values between grid points.
I haven’t looked into the details, but the Cox loophole should be basically-similar to this. The whole issue is that we have “too few grid-points” in e.g. Halpern’s example, so smoothness assumptions should be able to patch over that problem, at least approximately (which is all we need in practice).
A different way to frame it: we should be able to weaken the need for “lots of grid-points” by instead strengthening the continuity assumption to impose smoothness bounds even at not-infinitesimal distances.
Note that this is the move we almost always make when dealing with “continuous functions” in the real world. We almost never have an actually-infinitely-fine “grid”; we only have some finite precision, and the function doesn’t “move around too much” at finer scales. In general, when using math about “continuous functions” in the real world, we’re almost always substituting a smoothness assumption for the infinite limit.
I haven’t looked into these, but I expect they’re generally pretty similar for most practical purposes.