Hmm, thanks. Seems similar to my description above, though as far as I can tell it doesn’t deal with my criticisms. It is rather evasive when it comes to the question of what status models have in Bayesian calculations.
Kurros
I am curious; what is the general LessWrong philosophy about what truth “is”? Personally I so far lean towards accepting an operational subjective Bayesian definition, i.e. the truth of a statement is defined only so far as we agree on some (in principle) operational procedure for determining its truth; that is we have to agree on what observations make it true or false.
For example “it will rain in Melbourne tomorrow” is true if we see it raining in Melbourne tomorrow (trivial, but also means that the truth of the statement doesn’t depend on rain being “real”, or just a construction of Descartes’ evil demon or the matrix, or a dream, or even a hallucination). It is also a bit disturbing because the truth of “the local speed of light is a constant in all reference frames” can never be determined in such a way. We could go to something like Popper’s truthlikeness, but then standard Bayesianism gets very confusing, since we then have to worry about the probability that a statement has a certain level of “truthlikeness”, which is a little mysterious. Truthlikeness is nice in how it relates to the map-territory analogy though.
I am inclined to think that standard Bayesian style statements about operationally-defined things based on our “maps” makes sense, i.e. “If I go and measure how long it takes light to travel from the Earth to Mars, the result will be proportional to c” (with this being influenced by the abstraction that is general relativity), but it still remains unclear to me precisely what this means, in terms of Bayes theorem that is: i.e. the probability P(“measure c” | “general relativity”) implies that P(“general relativity”) makes sense somehow, though the operational criteria cannot be where its meaning comes from. In addition we must somehow account for that fact “general relativity” is strictly False, in the “all models are wrong” sense, so we need to somehow rejig that proposition into something that might actually be true, since it makes no sense to condition our beliefs on things we know to be false.
I suppose we might be able to imagine some kind of super-representation theorem, in the style of de-Finetti, in which we show that degrees of belief in operational statements can be represented as the model average of the predictions from all computable theories, hoping to provide an operational basis for Solomonoff induction, but actually I am still not 100% sure what de-Finetti’s usual representation theorem really means. We can behave “as if” we had degrees of belief in these models weighted by some prior? Huh? Does this mean we don’t really have such degrees of belief in models but they are a convenient fiction? I am very unclear on the interpretation here.
The map-territory analogy does seem correct to me, but I find it hard to reconstruct ordinary Bayesian-style statements via this kind of thinking...
Lol that is a nice story in that link, but it isn’t a Dutch book. The bet in it isn’t set up to measure subjective probability either, so I don’t really see what the lesson in it is for logical probability.
Say that instead of the digits of pi, we were betting on the contents of some boxes. For concreteness let there be three boxes, one of which contains a prize. Say also that you have looked inside the boxes and know exactly where the prize is. For me, I have some subjective probability P( X_i | I_mine ) that the prize is inside box i. For you, all your subjective probabilities are either zero or one, since you know perfectly well where the prize is. However, if my beliefs about where the prize is follow the probability calculus correctly, you still cannot Dutch book me, even though you know where the prize is and I don’t.
So, how is the scenario about the digits of pi different to this? Do you have some example of an actual Dutch book that I would accept if I were to allow logical uncertainty?
edit:
Ok well I thought of what seems to be a typical Dutch book scenario, but it has made me yet more confused about what is special about the logical uncertainty case. So, let me present two scenarios, and I wonder if you can tell me what the difference is:
Consider two propositions, A and B. Let it be the case that A->B. However, say that we do not realise this, and say we assign the following probabilities to A and B:
P(A) = 0.5
P(B) = 0.5
P(B|A) = P(B)
P(A & B) = 0.25
indicating that we think A and B are independent. Based on these probabilities, we should accept the following arrangement of bets:
Sell bet for $0.50 that A is false, payoff $1 if correct
Sell bet for $0.25 that A & B are both true, payoff $1 if correct
The expected amount we must pay out is 0.5$1 + 0.25$1 = $0.75, which is how much we are selling the bets for, so everything seems fair to us.
Someone who understands that A->B will happily buy these bets from us, since they know that “not A” and “A & B” are actually equivalent to “not A” and “A”, i.e. he knows P(not A) + P(A & B) = 1, so he wins $1 from us no matter what is the case, making a profit of $0.25. So that seems to show that we are being incoherent if we don’t know that A->B.
But now consider the following scenario; instead of having the logical relation that A->B, say that our opponent just has some extra empirical information D that we do not, so that for him P(B|A,D) = 1. For him, then, he would still say that
P(not A | D) + P(A & B | D) = P(not A | D) + P(B|A,D)*P(A|D) = P(not A|D) + P(A|D) = 1
so that we, who do not know D, could still be screwed by the same kind of trade as in the first example. But then, this is sort of obviously possible, since having more information than your opponent should give you a betting advantage. But both situations seem equivalently bad for us, so why are we being incoherent in the first example, but not in the second? Or am I still missing something?
That sounds to me more like an argument for needing lower p-values, not higher ones. If there are many confounding factors, you need a higher threshold of evidence for claiming that you are seeing a real effect.
Physicists need low p-values for a different reason, namely that they do very large numbers of statistical tests. If you choose p=0.05 as your threshold then it means that you are going to be claiming a false detection at least one time in twenty (roughly speaking), so if physicists did this they would be claiming false detections every other day and their credibility would plummet like a rock.
Is there any more straightforward way to see the problem? I argued with you about this for a while and I think you convinced me, but it is still a little foggy. If there is a consistency problem, surely this means that we must be vulnerable to Dutch books doesn’t it? I.e. they would not seem to be Dutch books to us, with our limited resources, but a superior intelligence would know that they were and would use them to con us out of utility. Do you know of some argument like this?
Very well, then i will wait for the next entry. But i thought the fact that we were explicitly discussing things the robot could not compute made it clear that resources were limited. There is clearly no such thing as logical uncertainty to the magic logic god of the idealised case.
No we aren’t, we’re discussing a robot with finite resources. I obviously agree that an omnipotent god of logic can skip these problems.
It was your example, not mine. But you made the contradictory postulate that P(“wet outside”|”rain”)=1 follows from the robots prior knowledge and the probability axioms, and simultaneously that the robot was unable to compute this. To correct this I alter the robots probabilities such that P(“wet outside”|”rain”)=0.5 until such time as it has obtained a proof that “rain” correlates 100% with “wet outside”. Of course the axioms don’t determine this; it is part of the robots prior, which is not determined by any axioms.
You haven’t convinced nor shown me that this violates Cox’s theorem. I admit I have not tried to follow the proof of this theorem myself, but my understanding was that the requirement you speak of is that the probabilistic logic reproduces classical logic in the limit of certainty. Here, the robot is not in the limit of certainty because it cannot compute the required proof. So we should not expect to get the classical logic until updating on the proof and achieving said certainty.
You haven’t been very specific about what you think I’m doing incorrectly so it is kind of hard to figure out what you are objecting to. I corrected your example to what I think it should be so that it satisfies the product rule; where’s the problem? How do you propose that the robot can possibly set P(“wet outside”|”rain”)=1 when it can’t do the calculation?
Ok sure, so you can go through my reasoning leaving out the implication symbol, but retaining the dependence on the proof “p”, and it all works out the same. The point is only that the robot doesn’t know that A->B, therefore it doesn’t set P(B|A)=1 either.
You had “Suppose our robot knows that P(wet outside | raining) = 1. And it observes that it’s raining, so P(rain)=1. But it’s having trouble figuring out whether it’s wet outside within its time limit, so it just gives up and says P(wet outside)=0.5. Has it violated the product rule? Yes. P(wet outside) >= P(wet outside and raining) = P(wet outside | rain) * P(rain) = 1.”
But you say it is doing P(wet outside)=0.5 as an approximation. This isn’t true though, because it knows that it is raining, so it is setting P(wet outside|rain) = 0.5, which was the crux of my calculation anyway. Therefore when it calculates P(wet outside and raining) = P(wet outside | rain) * P(rain) it gets the answer 0.5, not 1, so it is still being consistent.
Hmm this does not feel the same as what I am suggesting.
Let me map my scenario onto yours:
A = “raining”
B = “wet outside”
A->B = “It will be wet outside if it is raining”
The robot does not know P(“wet outside” | “raining”) = 1. It only knows P(“wet outside” | “raining”, “raining->wet outside”) = 1. It observes that it is raining, so we’ll condition everything on “raining”, taking it as true.
We need some priors. Let P(“wet outside”) = 0.5. We also need a prior for “raining->wet outside”, let that be 0.5 as well. From this it follows that
P(“wet outside” | “raining”) = P(“wet outside” | “raining”, “raining->wet outside”) P(“raining->wet outside”|”raining”) + P(“wet outside” | “raining”, not “raining->wet outside”) P(not “raining->wet outside”|”raining”) = P(“raining->wet outside”|”raining”) = P(“raining->wet outside”) = 0.5
according to our priors [first and second equalities are the same as in my first post, third equality follow since whether or not it is “raining” is not relevant for figuring out if “raining->wet outside”].
So the product rule is not violated.
P(“wet outside”) >= P(“wet outside” and “raining”) = P(“wet outside” | “raining”) P(“raining”) = 0.5
Where the inequality is actually an equality because our prior was P(“wet outside”) = 0.5. Once the proof p that “raining->wet outside” is obtained, we can update this to
P(“wet outside” | p) >= P(“wet outside” and “raining” | p) = P(“wet outside” | “raining”, p) P(“raining” | p) = 1
But there is still no product rule violation because
P(“wet outside” | p) = P(“wet outside” | “raining”, p) P(“raining” | p) + P(“wet outside” | not “raining”, p) P(not “raining” | p) = P(“wet outside” | “raining”, p) P(“raining” | p) = 1.
In a nutshell: you need three pieces of information to apply this classical chain of reasoning; A, B, and A->B. All three of these propositions should have priors. Then everything seems fine to me. It seems to me you are neglecting the proposition “A->B”, or rather assuming its truth value to be known, when we are explicitly saying that the robot does not know this.
edit: I just realised that I was lucky for my first inequality to work out; I assumed I was free to choose any prior for P(“wet outside”), but it turns out I am not. My priors for “raining” and “raining->wet outside” determine the corresponding prior for “wet outside”, in order to be compatible with the product rule. I just happened to choose the correct one by accident.
But it turns out that there is one true probability distribution over mathematical statements, given the axioms. The right distribution is obtained by straightforward application of the product rule—never mind that it takes 4^^^3 steps—and if you deviate from the right distribution that means you violate the product rule at some point.
This does not seem right to me. I feel like you are sneakily trying to condition all of the robots probabilities on mathematical proofs that it does not have a-priori. E.g. consider A, A->B, therefore B. To learn that P(A->B)=1, the robot has to do a big calculation to obtain the proof. After this, it can conclude that P(B|A,A->B)=1. But before it has the proof, it should still have some P(B|A)!=1.
Sure, it seems tempting to call the probabilities you would have after obtaining all the proofs of everything the “true” probabilties, but to me it doesn’t actually seem different to the claim that “after I roll my dice an infinity of times, I will know the ‘true’ probability of rolling a 1″. I should still have some beliefs about a one being rolled before I have observed vast numbers of rolls.
In other words I suggest that proof of mathematical relationships should be treated exactly the same as any other data/evidence.
edit: in fact surely one has to consider this so that the robot can incorporate the cost of computing the proof into its loss function, in order to decide if it should bother doing it or not. Knowing the answer for certain may still not be worth the time it takes (not to mention that even after computing the proof the robot may still not have total confidence in it; if it is a really long proof, the probability that cosmic rays have caused lots of bit-flips to mess up the logic may become significant. If the robot knows it cannot ever get the answer with sufficient confidence within the given time constraints, it must choose an action which accounts for this. And the logic it uses should be just the same as how it knows when to stop rolling die).
edit2: I realised I was a little sloppy above; let me make it clearer here:
The robot knows P(B|A,A->B)=1 apriori. But it does not know “A->B” is true apriori. It therefore calculates
P(B|A) = P(B|A,A->B) P(A->B|A) + P(B|A,not A->B) P(not A->B|A) = P(A->B|A)
After it obtains proof that “A->B”, call this p, we have P(A->B|A,p) = 1, so
P(B|A,p) = P(B|A,A->B,p) P(A->B|A,p) + P(B|A,not A->B,p) P(not A->B|A,p)
collapses to
P(B|A,p) = P(B|A,A->B,p) = P(B|A,A->B) = 1
But I don’t think it is reasonable to skip straight to this final statement, unless the cost of obtaining p is negligible.
edit3: If this somehow violates Savage or Cox’s theorems I’d like to know why :).
Perhaps, though, you could argue it differently. I have been trying to understand so-called “operational” subjective statical methods recently (as advocated by Frank Lad and his friends), and he is insisting on only calling a thing a [meaningful, I guess] “quantity” when there is some well-defined operational procedure for measuring what it is. Where for him “measuring” does not rely on a model, he is refering to reading numbers off some device or other, I think. I don’t quite understand him yet, since it seems to me that the numbers reported by devices all rely on some model or other to define them, but maybe one can argue their way out of this...
Thanks, this seems interesting. It is pretty radical; he is very insistent on the idea that for all ‘quantities’ about which we want to reason there must some operational procedure we can follow in order to find out what it is. I don’t know what this means for the ontological status of physical principles, models, etc, but I can at least see the naive appeal… it makes it hard to understand why a model could ever have the power to predict new things we have never seen before though, like Higgs bosons...
An example of a “true number” is mass. We can measure the mass of a person or a car, and we use these values in engineering all the time. An example of a “fake number” is utility. I’ve never seen a concrete utility value used anywhere, though I always hear about nice mathematical laws that it must obey.
It is interesting that you choose mass as your prototypical “true” number. You say we can “measure” the mass of a person or car. This is true in the sense that we have a complex physical model of reality, and in one of the most superficial levels of this model (Newtonian mechanics) there exist some abstract numbers which characterise the motions of “objects” in response to “forces”. So “measuring” mass seems to only mean that we collect some data, fit this Newtonian model to that data, and extract relatively precise values for this parameter we call “mass”.
Most of your examples of “fake” numbers seem to be to be definable in exactly analogous terms. Your main gripe seems to be that different people try to use the same word to describe parameters in different models, or perhaps that there do not even exist mathematical models for some of them; do you agree? To use a fun phrase I saw recently, the problem is that we are wasting time with “linguistic warfare” when we should be busy building better models?
Sure, I don’t want to suggest we only use the word ‘probability’ for epistemic probabilities (although the world might be a better place if we did...), only that if we use the word to mean different sorts of probabilities in the same sentence, or even whole body of text, without explicit clarification, then it is just asking for confusion.
Hmm, do you know of any good material to learn more about this? I am actually extremely sympathetic to any attempt to rid model parameters of physical meaning; I mean in an abstract sense I am happy to have degrees of belief about them, but in a prior-elucidation sense I find it extremely difficult to argue about what it is sensible to believe a-priori about parameters, particularly given parameterisation dependence problems.
I am a particle physicist, and a particular problem I have is that parameters in particle physics are not constant; they vary with renormalisation scale (roughly, energy of the scattering process), so that if I want to argue about what it is a-priori reasonable to believe about (say) the mass of the Higgs boson, it matters a very great deal what energy scale I choose to define my prior for the parameters at. If I choose (naively) a flat prior over low-energy values for the Higgs mass, it implies I believe some really special and weird things about the high-scale Higgs mass parameter values (they have to be fine-tuned to the bejesus); while if I believe something more “flat” about the high scale parameters, it in turn implies something extremely informative about the low-scale values, namely that the Higgs mass should be really heavy (in the Standard Model—this is essentially the Hierarchy problem, translated into Bayesian words).
Anyway, if I can more directly reason about the physically observable things and detach from the abstract parameters, it might help clarify how one should think about this mess...
Hmm, interesting. I will go and learn more deeply what de Finetti was getting at. It is a little confusing… in this simple case ok fine p can be defined in a straightforward way in terms of the predictive distribution, but in more complicated cases this quickly becomes extremely difficult or impossible. For one thing, a single model with a single set of parameters may describe outcomes of vastly different experiments. E.g. consider Newtonian gravity. Ok fine strictly the Newtonian gravity part of the model has to be coupled to various other models to describe specific details of the setup, but in all cases there is a parameter G for the universal gravitation constant. G impacts on the predictive distributions for all such experiments, so it is pretty hard to see how it could be defined in terms of them, at least in a concrete sense.
Are you referring to De Finetti’s theorem? I can’t say I understand your point. Does it relate to the edit I made shortly before your post? i.e. Given a stochastic model with some parameters, you then have degrees of belief about certain outcomes, some of which may seem almost the same thing as the parameters themselves? I still maintain that the two are quite different: parameters characterise probability distributions, and just in certain cases happen to coincide with conditional degrees of belief. In this ‘beliefs about beliefs’ context, though, it is the parameters we have degrees of belief about, we do not have degrees of belief about the conditional degrees of belief to which said parameters may happen to coincide.
Keynes in his “Treatise on probability” talks a lot about analogies in the sense you use it here, particularly in “part 3: induction and analogy”. You might find it interesting.