Whatever the actual knowledge representation inside our brains looks like, it doesn’t seem like it can be easily translated into the structure of “hypothesis space, logical relations, degrees of belief.”
That strikes me as the main issue when trying to apply Bayesian logic to real-world problems.
How so? The knowledge representation inside our brains probably doesn’t look like real numbers either, but real numbers are still useful for real-world problems. Same with Bayesianism. It’s normative: if Bayes says you should switch doors in the Monty Hall problem, then that’s what you should do, even if your brain says no.
If I understand your objection correctly, it’s one I tried to answer already in my post.
In short: Bayesianism is normative for problems to you can actually state in its formalism. This can be used as an argument for at least trying to state problems in its formalism, and I do think this is often a good idea; many of the examples in Jaynes’ book show the value of doing this. But when the information you have actually does not fit the requirements of the formalism, you can only use it if you get more information (costly, sometimes impossible) or forget some of what you know to make the rest fit. I don’t think Bayes normatively tells you to do those kinds of things, or at least that would require a type of argument different from the usual Dutch Books etc.
Using the word “brain” there was probably a mistake. This is only about brains insofar as it’s about the knowledge actually available to you in some situation, and the same idea applies to the knowledge available to some robot you are building, or some agent in a hypothetical decision problem (so long as it is a problem with the same property, of not fitting well into the formalism without extra work or forgetting).
Yeah, I’m not saying Bayesianism solves all problems. But point 7b in your post still sounds weird to me. You assume a creature that can’t see all logical consequences of hypotheses—that’s fine. Then you make it realize new facts about logical consequences of hypotheses—that’s also fine. But you also insist that the creature’s updated probabilities must exactly match the original ones, or fit neatly between them. Why is that required? It seems to me that whatever the faults of “strong Bayesianism”, this argument against it doesn’t work.
You assume a creature that can’t see all logical consequences of hypotheses [...] Then you make it realize new facts about logical consequences of hypotheses
This is not quite what is going on in section 7b. The agent isn’t learning any new logical information. For instance, in jadagul’s “US in 2100″ example, all of the logical facts involved are things the agent already knows. ” ‘California is a US state in 2100’ implies ‘The US exists in 2100’ ” is not a new fact, it’s something we already knew before running through the exercise.
My argument in 7b is not really about updating—it’s about whether probabilities can adequately capture the agent’s knowledge, even at a single time.
This is in a context (typical of real decisions) where:
the agent knows a huge number of logical facts, because it can correctly interpret hypotheses written in a logically transparent way, like “A and B,” and because it knows lots of things about subsets in the world (like US / California)
but, the agent doesn’t have the time/memory to write down a “map” of every hypothesis connected by these facts (like a sigma-algebra). For example, you can read an arbitrary string of hypotheses “A and B and C and …” and know that this implies “A”, “A and C”, etc., but you don’t have in your mind a giant table containing every such construction.
So the agent can’t assign credences/probabilities simultaneously to every hypothesis on that map. Instead, they have some sort of “credence generator” that can take in a hypothesis and output how plausible it seems, using heuristics. In their raw form, these outputs may not be real numbers (they will have an order, but may not have e.g. a metric).
If we want to use Bayes here, we need to turn these raw credences into probabilities. But remember, the agent knows a lot of logical facts, and via the probability axioms, these all translate to facts relating probabilities to one another. There may not be any mapping from raw credence-generator-output to probabilities that preserves all of these facts, and so the agent’s probabilities will not be consistent.
To be more concrete about the “credence generator”: I find that when I am asked to produce subjective probabilities, I am translating them from internal representations like
Event A feels “very likely”
Event B, which is not logically entailed by A or vice versa, feels “pretty likely”
Event (A and B) feels “pretty likely”
If we demand that these map one-to-one to probabilities in any natural way, this is inconsistent. But I don’t think it’s inconsistent in itself; it just reflects that my heuristics have limited resolution. There isn’t a conjunction fallacy here because I’m not treating these representations as probabilities—but if I decide to do so, then I will have a conjunction fallacy! If I notice this happening, I can “plug the leak” by changing the probabilities, but I will expect to keep seeing new leaks, since I know so many logical facts, and thus there are so many consequences of the probability axioms that can fail to hold. And because I expect this to happen going forward, I am skeptical now that my reported probabilities reflect my actual beliefs—not even approximately, since I expect to keep deriving very wrong things like an event being impossible instead of likely.
None of this is meant disapprove of using probability estimates to, say, make more grounded estimates of cost/benefit in real-world decisions. I do find that useful, but I think it is useful for a non-Bayesian reason: even if you don’t demand a universal mapping from raw credences, you can get a lot of value out of saying things like “this decision isn’t worth it unless you think P(A) > 97%”, and then doing a one-time mapping of that back onto a raw credence, and this has a lot of pragmatic value even if you know the mappings will break down if you push them too hard.
If I notice this happening, I can “plug the leak” by changing the probabilities, but I will expect to keep seeing new leaks, since I know so many logical facts, and thus there are so many consequences of the probability axioms that can fail to hold. And because I expect this to happen going forward, I am skeptical now that my reported probabilities reflect my actual beliefs
Hmm, I think your “actual beliefs” are your betting odds at each moment, messy as they are. And what you call “plugging the leaks” seems to be Bayesian updating of your actual beliefs, which should converge for the usual reasons.
For example, if you haven’t thought much about the connection between A, B and (A and B), you could say your feelings about these sentences haven’t yet updated on the connection between them. (Think of it as a prior over sentences in sealed envelopes, ignorant of the sigma algebra structure.) Then you update and get closer to the truth. Does that make sense?
1. You seem to be suggesting that the standard Bayesian framework handles logical uncertainty as a special case. (Here we are not exactly “uncertain” about sentences, but we have to update on their truth from some prior that did not account for it, which amounts to the same thing.) If this were true, the research on handling logical uncertainty through new criteria and constructions would be superfluous. I haven’t actually seen a proposal like this laid out in detail, but I think they’ve been proposed and found wanting, so I’ll be skeptical at least until I’m shown the details of such a proposal.
(In particular, this would need to involve some notion of conditional probabilities like P(A | A ⇒ B), and perhaps priors like P(A ⇒ B), which are not a part of any treatment of Bayes I’ve seen.)
2. Even if this sort of thing does work in principle, it doesn’t seem to help in the practical case at hand. We’re now told to update on “noticing” A ⇒ B by using objects like P(A | A ⇒ B), but these too have to be guessed using heuristics (we don’t have a map of them either), so it inherits the same problem it was introduced to solve.
I’m a bit confused by your mention of logical uncertainty. Isn’t plain old probability sufficient for this problem? If A and B are statements about the world, and you have a prior over possible worlds (combinations of truth values for A and B), then probabilities like P(A ⇒ B) or P(A | A ⇒ B) seem well-defined to me. For example, P(A ⇒ B) = P(A and B) + P(not A).
Let’s try to walk through the US and California example. At the start, you feel that A = “California will be a US state in 2100” and B = “US will exist in 2100″ both have probability 98% and are independent, because you haven’t thought much about the connection between them. Then you notice that A ⇒ B, so you remove the option “A and not B” from your prior and renormalize, leading to 97.96% for A and 99.96% for B. The probabilities are nudged apart, just like you wanted!
Of course you could say these numbers still look fake. The update for B was much stronger than the update for A, what’s up with that? But that’s because our prior was very ignorant to begin with. As we get more data, we’ll converge on the truth. Bayes comes out of this exercise looking pretty good, if you ask me.
Ah, yeah, you’re right that it’s possible to do this. I’m used to thinking in the Kolmogorov picture, and keep forgetting that in the Jaynesian propositional logic picture you can treat material conditionals as contingent facts. In fact, I went through the process of realizing this in a similar argument about the same post a while ago, and then forgot about it in the meantime!
That said, I am not sure what this procedure has to recommend it, besides that it is possible and (technically) Bayesian. The starting prior, with independence, does not really reflect our state of knowledge at any time, even at the time before we have “noticed” the implication(s). For, if we actually write down that prior, we have an entry in every cell of the truth table, and if we inspect each of those cells and think “do I really believe this?”, we cannot answer the question without asking whether we know facts such as A ⇒ B—at which point we notice the implication!
It seems more accurate to say that, before we consider the connection of A to B, those cells are “not even filled in.” The independence prior is not somehow logically agnostic; it assigns a specific probability to the conditional, just as our posterior does, except that in the prior that probability is, wrongly, not one.
Okay, one might say, but can’t this still be a good enough place to start, allowing us to converge eventually? I’m actually unsure about this, because (see below) the logical updates tend to push the probabilities of the “ends” of a logical chain further towards 0 and 1; at any finite time the distribution obeys Cromwell’s Rule, but whether it converges to the truth might depend on the way in which we take the limit over logical and empirical updates (supposing we do arbitrarily many of each type as time goes on).
I got curious about this and wrote some code to do these updates with arbitrary numbers of variables and arbitrary conditionals. What I found is that as we consider longer chains A ⇒ B ⇒ C ⇒ …, the propositions at one end get pushed to 1 or 0, and we don’t need very long chains for this to get extreme. With all starting probabilities set to 0.7 and three variables 0 ⇒ 1 ⇒ 2, the probability of variable 2 is 0.95; with five variables the probability of the last one is 0.99 (see the plot below). With ten variables, the last one reaches 0.99988. We can easily come up with long chains in the California example or similar, and following this procedure would lead us to absurdly extreme confidence in such examples.
I’ve also given a second plot below, where all the starting probabilities are 0.5. This shows that the growing confidence does not rely on an initial hunch one way or the other; simply updating on the logical relationships from initial neutrality (plus independences) pushes us to high confidence about the ends of the chain.
Yeah, if the evidence you see (including logical evidence) is filtered by your adversary, but you treat it as coming from an impartial process, you can be made to believe extreme stuff. That problem doesn’t seem to be specific to Bayes, or at least I can’t imagine any other method that would be immune to it.
Here’s a simple model: the adversary flips a coin ten times and reveals some of the results to you, which happen to be all heads. You believe that the choice of which results to reveal is independent of the results themselves, but in fact the adversary only reveals heads. So your beliefs about the coin’s bias are predictably pushed toward heads.
The usual Bayesian answer is that you should have nonzero probability that evidence is revealed adversarially. Then over time that probability will dominate. Similarly in our problem, you should have nonzero probability that someone is coming up with intermediate statements between A and Z and showing you only those, instead of other statements that would appear elsewhere in the graph and temper your beliefs a bit. That makes the model complicated enough that I can’t work it out on a napkin anymore, but I’m pretty sure it’s the only way.
To quote Abram Demski in “All Mathematicians are Trollable”:
The main concern is not so much whether GLS-coherent mathematicians are trollable as whether they are trolling themselves. Vulnerability to an external agent is somewhat concerning, but the existence of misleading proof-orderings brings up the question: are there principles we need to follow when deciding what proofs to look at next, to avoid misleading ourselves?
My concern is not with the dangers of an actual adversary, it’s with the wild oscillations and extreme confidences that can arise even when logical facts arrive in a “fair” way, so long as it is still possible to get unlucky and experience a “clump” of successive observations that push P(A) way up or down.
We should expect such clumps sometimes unless the observation order is somehow specially chosen to discourage them, say via the kind of “principles” Demski wonders about.
One can also prevent observation order from mattering by doing what the Eisenstat prior does: adopt an observation model that does not treat logical observations as coming from some fixed underlying reality (so that learning “B or ~A” rules out some ways A could have been true), but as consistency-constrained samples from a fixed distribution. This works as far as it goes, but is hard to reconcile with common intuitions about how e.g. P=NP is unlikely because so many “ways it could have been true” have failed (Scott Aaronson has a post about this somewhere, arguing against Lubos Motl who seems to think like the Eisenstat prior), and more generally with any kind of mathematical intuition — or with the simple fact that the implications of axioms are fixed in advance and not determined dynamically as we observe them. Moreover, I don’t know of any way to (approximately) apply this model in real-world decisions, although maybe someone will come up with one.
This is all to say that I don’t think there is (yet) any standard Bayesian answer to the problem of self-trollability. It’s a serious problem and one at the very edge of current understanding, with only some partial stabs at solutions available.
That strikes me as the main issue when trying to apply Bayesian logic to real-world problems.
How so? The knowledge representation inside our brains probably doesn’t look like real numbers either, but real numbers are still useful for real-world problems. Same with Bayesianism. It’s normative: if Bayes says you should switch doors in the Monty Hall problem, then that’s what you should do, even if your brain says no.
If I understand your objection correctly, it’s one I tried to answer already in my post.
In short: Bayesianism is normative for problems to you can actually state in its formalism. This can be used as an argument for at least trying to state problems in its formalism, and I do think this is often a good idea; many of the examples in Jaynes’ book show the value of doing this. But when the information you have actually does not fit the requirements of the formalism, you can only use it if you get more information (costly, sometimes impossible) or forget some of what you know to make the rest fit. I don’t think Bayes normatively tells you to do those kinds of things, or at least that would require a type of argument different from the usual Dutch Books etc.
Using the word “brain” there was probably a mistake. This is only about brains insofar as it’s about the knowledge actually available to you in some situation, and the same idea applies to the knowledge available to some robot you are building, or some agent in a hypothetical decision problem (so long as it is a problem with the same property, of not fitting well into the formalism without extra work or forgetting).
Yeah, I’m not saying Bayesianism solves all problems. But point 7b in your post still sounds weird to me. You assume a creature that can’t see all logical consequences of hypotheses—that’s fine. Then you make it realize new facts about logical consequences of hypotheses—that’s also fine. But you also insist that the creature’s updated probabilities must exactly match the original ones, or fit neatly between them. Why is that required? It seems to me that whatever the faults of “strong Bayesianism”, this argument against it doesn’t work.
This is not quite what is going on in section 7b. The agent isn’t learning any new logical information. For instance, in jadagul’s “US in 2100″ example, all of the logical facts involved are things the agent already knows. ” ‘California is a US state in 2100’ implies ‘The US exists in 2100’ ” is not a new fact, it’s something we already knew before running through the exercise.
My argument in 7b is not really about updating—it’s about whether probabilities can adequately capture the agent’s knowledge, even at a single time.
This is in a context (typical of real decisions) where:
the agent knows a huge number of logical facts, because it can correctly interpret hypotheses written in a logically transparent way, like “A and B,” and because it knows lots of things about subsets in the world (like US / California)
but, the agent doesn’t have the time/memory to write down a “map” of every hypothesis connected by these facts (like a sigma-algebra). For example, you can read an arbitrary string of hypotheses “A and B and C and …” and know that this implies “A”, “A and C”, etc., but you don’t have in your mind a giant table containing every such construction.
So the agent can’t assign credences/probabilities simultaneously to every hypothesis on that map. Instead, they have some sort of “credence generator” that can take in a hypothesis and output how plausible it seems, using heuristics. In their raw form, these outputs may not be real numbers (they will have an order, but may not have e.g. a metric).
If we want to use Bayes here, we need to turn these raw credences into probabilities. But remember, the agent knows a lot of logical facts, and via the probability axioms, these all translate to facts relating probabilities to one another. There may not be any mapping from raw credence-generator-output to probabilities that preserves all of these facts, and so the agent’s probabilities will not be consistent.
To be more concrete about the “credence generator”: I find that when I am asked to produce subjective probabilities, I am translating them from internal representations like
Event A feels “very likely”
Event B, which is not logically entailed by A or vice versa, feels “pretty likely”
Event (A and B) feels “pretty likely”
If we demand that these map one-to-one to probabilities in any natural way, this is inconsistent. But I don’t think it’s inconsistent in itself; it just reflects that my heuristics have limited resolution. There isn’t a conjunction fallacy here because I’m not treating these representations as probabilities—but if I decide to do so, then I will have a conjunction fallacy! If I notice this happening, I can “plug the leak” by changing the probabilities, but I will expect to keep seeing new leaks, since I know so many logical facts, and thus there are so many consequences of the probability axioms that can fail to hold. And because I expect this to happen going forward, I am skeptical now that my reported probabilities reflect my actual beliefs—not even approximately, since I expect to keep deriving very wrong things like an event being impossible instead of likely.
None of this is meant disapprove of using probability estimates to, say, make more grounded estimates of cost/benefit in real-world decisions. I do find that useful, but I think it is useful for a non-Bayesian reason: even if you don’t demand a universal mapping from raw credences, you can get a lot of value out of saying things like “this decision isn’t worth it unless you think P(A) > 97%”, and then doing a one-time mapping of that back onto a raw credence, and this has a lot of pragmatic value even if you know the mappings will break down if you push them too hard.
Hmm, I think your “actual beliefs” are your betting odds at each moment, messy as they are. And what you call “plugging the leaks” seems to be Bayesian updating of your actual beliefs, which should converge for the usual reasons.
For example, if you haven’t thought much about the connection between A, B and (A and B), you could say your feelings about these sentences haven’t yet updated on the connection between them. (Think of it as a prior over sentences in sealed envelopes, ignorant of the sigma algebra structure.) Then you update and get closer to the truth. Does that make sense?
Two comments:
1. You seem to be suggesting that the standard Bayesian framework handles logical uncertainty as a special case. (Here we are not exactly “uncertain” about sentences, but we have to update on their truth from some prior that did not account for it, which amounts to the same thing.) If this were true, the research on handling logical uncertainty through new criteria and constructions would be superfluous. I haven’t actually seen a proposal like this laid out in detail, but I think they’ve been proposed and found wanting, so I’ll be skeptical at least until I’m shown the details of such a proposal.
(In particular, this would need to involve some notion of conditional probabilities like P(A | A ⇒ B), and perhaps priors like P(A ⇒ B), which are not a part of any treatment of Bayes I’ve seen.)
2. Even if this sort of thing does work in principle, it doesn’t seem to help in the practical case at hand. We’re now told to update on “noticing” A ⇒ B by using objects like P(A | A ⇒ B), but these too have to be guessed using heuristics (we don’t have a map of them either), so it inherits the same problem it was introduced to solve.
I’m a bit confused by your mention of logical uncertainty. Isn’t plain old probability sufficient for this problem? If A and B are statements about the world, and you have a prior over possible worlds (combinations of truth values for A and B), then probabilities like P(A ⇒ B) or P(A | A ⇒ B) seem well-defined to me. For example, P(A ⇒ B) = P(A and B) + P(not A).
Let’s try to walk through the US and California example. At the start, you feel that A = “California will be a US state in 2100” and B = “US will exist in 2100″ both have probability 98% and are independent, because you haven’t thought much about the connection between them. Then you notice that A ⇒ B, so you remove the option “A and not B” from your prior and renormalize, leading to 97.96% for A and 99.96% for B. The probabilities are nudged apart, just like you wanted!
Of course you could say these numbers still look fake. The update for B was much stronger than the update for A, what’s up with that? But that’s because our prior was very ignorant to begin with. As we get more data, we’ll converge on the truth. Bayes comes out of this exercise looking pretty good, if you ask me.
Ah, yeah, you’re right that it’s possible to do this. I’m used to thinking in the Kolmogorov picture, and keep forgetting that in the Jaynesian propositional logic picture you can treat material conditionals as contingent facts. In fact, I went through the process of realizing this in a similar argument about the same post a while ago, and then forgot about it in the meantime!
That said, I am not sure what this procedure has to recommend it, besides that it is possible and (technically) Bayesian. The starting prior, with independence, does not really reflect our state of knowledge at any time, even at the time before we have “noticed” the implication(s). For, if we actually write down that prior, we have an entry in every cell of the truth table, and if we inspect each of those cells and think “do I really believe this?”, we cannot answer the question without asking whether we know facts such as A ⇒ B—at which point we notice the implication!
It seems more accurate to say that, before we consider the connection of A to B, those cells are “not even filled in.” The independence prior is not somehow logically agnostic; it assigns a specific probability to the conditional, just as our posterior does, except that in the prior that probability is, wrongly, not one.
Okay, one might say, but can’t this still be a good enough place to start, allowing us to converge eventually? I’m actually unsure about this, because (see below) the logical updates tend to push the probabilities of the “ends” of a logical chain further towards 0 and 1; at any finite time the distribution obeys Cromwell’s Rule, but whether it converges to the truth might depend on the way in which we take the limit over logical and empirical updates (supposing we do arbitrarily many of each type as time goes on).
I got curious about this and wrote some code to do these updates with arbitrary numbers of variables and arbitrary conditionals. What I found is that as we consider longer chains A ⇒ B ⇒ C ⇒ …, the propositions at one end get pushed to 1 or 0, and we don’t need very long chains for this to get extreme. With all starting probabilities set to 0.7 and three variables 0 ⇒ 1 ⇒ 2, the probability of variable 2 is 0.95; with five variables the probability of the last one is 0.99 (see the plot below). With ten variables, the last one reaches 0.99988. We can easily come up with long chains in the California example or similar, and following this procedure would lead us to absurdly extreme confidence in such examples.
I’ve also given a second plot below, where all the starting probabilities are 0.5. This shows that the growing confidence does not rely on an initial hunch one way or the other; simply updating on the logical relationships from initial neutrality (plus independences) pushes us to high confidence about the ends of the chain.
Yeah, if the evidence you see (including logical evidence) is filtered by your adversary, but you treat it as coming from an impartial process, you can be made to believe extreme stuff. That problem doesn’t seem to be specific to Bayes, or at least I can’t imagine any other method that would be immune to it.
Here’s a simple model: the adversary flips a coin ten times and reveals some of the results to you, which happen to be all heads. You believe that the choice of which results to reveal is independent of the results themselves, but in fact the adversary only reveals heads. So your beliefs about the coin’s bias are predictably pushed toward heads.
The usual Bayesian answer is that you should have nonzero probability that evidence is revealed adversarially. Then over time that probability will dominate. Similarly in our problem, you should have nonzero probability that someone is coming up with intermediate statements between A and Z and showing you only those, instead of other statements that would appear elsewhere in the graph and temper your beliefs a bit. That makes the model complicated enough that I can’t work it out on a napkin anymore, but I’m pretty sure it’s the only way.
To quote Abram Demski in “All Mathematicians are Trollable”:
My concern is not with the dangers of an actual adversary, it’s with the wild oscillations and extreme confidences that can arise even when logical facts arrive in a “fair” way, so long as it is still possible to get unlucky and experience a “clump” of successive observations that push P(A) way up or down.
We should expect such clumps sometimes unless the observation order is somehow specially chosen to discourage them, say via the kind of “principles” Demski wonders about.
One can also prevent observation order from mattering by doing what the Eisenstat prior does: adopt an observation model that does not treat logical observations as coming from some fixed underlying reality (so that learning “B or ~A” rules out some ways A could have been true), but as consistency-constrained samples from a fixed distribution. This works as far as it goes, but is hard to reconcile with common intuitions about how e.g. P=NP is unlikely because so many “ways it could have been true” have failed (Scott Aaronson has a post about this somewhere, arguing against Lubos Motl who seems to think like the Eisenstat prior), and more generally with any kind of mathematical intuition — or with the simple fact that the implications of axioms are fixed in advance and not determined dynamically as we observe them. Moreover, I don’t know of any way to (approximately) apply this model in real-world decisions, although maybe someone will come up with one.
This is all to say that I don’t think there is (yet) any standard Bayesian answer to the problem of self-trollability. It’s a serious problem and one at the very edge of current understanding, with only some partial stabs at solutions available.