Bayesian probability as an approximate theory of uncertainty?
Many people believe that Bayesian probability is an exact theory of uncertainty, and other theories are imperfect approximations. In this post I’d like to tentatively argue the opposite: that Bayesian probability is an imperfect approximation of what we want from a theory of uncertainty. This post won’t contain any new results, and is probably very confused anyway.
I agree that Bayesian probability is provably the only correct theory for dealing with a certain idealized kind of uncertainty. But what kinds of uncertainty actually exist in our world, and how closely do they agree with what’s needed for Bayesianism to work?
In a Tegmark Level IV world (thanks to pragmatist for pointing out this assumption), uncertainty seems to be either indexical or logical. When I flip a coin, the information in my mind is either enough or not enough to determine the outcome in advance. If I have enough information—if it’s mathematically possible to determine which way the coin will fall, given the bits of information that I have received—then I have logical uncertainty, which is no different in principle from being uncertain about the trillionth digit of pi. On the other hand, if I don’t have enough information even given infinite mathematical power, it implies that the world must contain copies of me that will see different coinflip outcomes (if there was just one copy, mathematics would be able to pin it down), so I have indexical uncertainty.
The trouble is that both indexical and logical uncertainty are puzzling in their own ways.
With indexical uncertainty, the usual example that breaks probabilistic reasoning is the Absent-Minded Driver problem. When the probability of you being this or that copy depends on the decision you’re about to make, these probabilities are unusable for decision-making. Since Bayesian probability is in large part justified by decision-making, we’re in trouble. And the AMD is not an isolated problem. In many imaginable situations faced by humans or idealized agents, there’s a nonzero chance of returning to the same state of mind in the future, and that chance slightly depends on the current action. To the extent that’s true, Bayesian probability is an imperfect approximation.
With logical uncertainty, the situation is even worse. We don’t have a good theory of how logical uncertainty should work. (Though there have been several attempts, like Benja and Paul’s prior, Manfred’s prior, or my own recent attempt.) Since Bayesian probability is in large part justified by having perfect agreement with logic, it seems likely that the correct theory of logical uncertainty won’t look very Bayesian, because the whole point is to have limited computational resources and only approximate agreement with logic.
Another troubling point is that if Bayesian probability is suspect, the idea of “priors” becomes suspect by association. Our best ideas for decision-making under indexical uncertainty (UDT) and logical uncertainty (priors over theories) involve some kind of priors, or more generally probability distributions, so we might want to reexamine those as well. Though if we interpret a UDT-ish prior as a measure of care rather than belief, maybe the problem goes away...
- Original Research on Less Wrong by 29 Oct 2012 22:50 UTC; 48 points) (
- 27 Feb 2014 16:32 UTC; 5 points) 's comment on Approaching Logical Probability by (
As an aside, here’s a funny variation of the Absent-Minded Driver problem that I just came up with. Nothing really new, but maybe it will make some features sharper.
You’re invited to take part in an experiment, which will last for 10 days. Every day you’re offered two envelopes, a red one and a blue one. One of the envelopes contains a thousand dollars, the other is empty. You pick one of the envelopes and receive the money. Also, at the beginning of each day you are given an amnesia pill that makes you forget which day it is, what happened before, and how much money you have so far. At the end of the experiment, you go home with the total money received.
The money is distributed between envelopes in this way: on the first day, the money has 60% chance of being in the red envelope. On each subsequent day, the money is in the envelope that you didn’t pick on the first day.
Fun features of this problem:
1) It’s a one-player game where you strictly prefer a randomized strategy to any deterministic one. This is similar to the AMD problem, and impossible if you’re making decisions using Bayesian probability.
2) Your decision affects the contents of the sealed envelopes in front of you. For example, if you pick the red envelope deterministically, it will be empty >90% of the time. Same for the blue one.
3) Even after you decide to pick a certain envelope, the chance of finding money inside depends not just on your decision, but on the decision process that you used. If you used a coinflip, the chance of finding money is higher than if you chose deterministically.
Having a random-number generator is equivalent to having a certain very restricted kind of memory.
For example, if you have a pseudo-random number generator in a computer, then the generator requires a seed, and this seed cannot be the same every day. The change of the seed from day to day constitutes a trace in the computer of the days’ passing. Therefore, you and the computer, taken together, “remember”, in a certain very restricted sense, the passing of the days. Fortunately, this restricted kind of memory turns out to be just enough to let you do far better than you could have done with no memory at all. (I gave this argument in slightly more detail in this old comment thread.)
So, the presence of a random-number generator is just a weakening of the requirement of complete amnesia. However, given this restricted kind of memory, you are making your decisions in accordance with Bayesian probability theory. [ETA: I misunderstood cousin_it’s point when I wrote that last sentence.]
It seems to me that if you have a coin, your probability distribution on envelopes should still depend on the strategy you adopt, not just on the coin. Are you sure you’re not sneaking in “planning-optimality” somehow? Can you explain in more detail why the decision on each day is separately “action-optimal”?
I think I misunderstood what you meant by “impossible if you’re making decisions using Bayesian probability.” I wasn’t trying to avoid being “planning-optimal”. It is not as though the agent is thinking, “The PRNG just output 0.31. Therefore, this envelope is more likely to contain the money today.”, which I guess is what “action-optimal” reasoning would look like in this case.
When I said that “you are making your decisions in accordance with Bayesian probability theory”, I meant that your choice of plan is based on your beliefs about the distribution of outputs generated by the PRNG. These beliefs, in turn, could be the result of applying Bayesian epistemology to your prior empirical experience with PRNGs.
Yeah. It looks like there’s a discontinuity between using a RNG and having perfect memory. Perfect memory lets us get away with “action-optimal” reasoning, but if it’s even a little imperfect, we need to go “planning-optimal”.
You are mistaken in your conception of memory as it relates to the field of statistics.
“Episodic and semantic memory give rise to two different states of consciousness, autonoetic and noetic, which influence two kinds of subjective experience, remembering and knowing, respectively.[2] Autonoetic consciousness refers to the ability of recovering the episode in which an item originally occurred. In noetic consciousness, an item is familiar but the episode in which it was first encountered is absent and cannot be recollected. Remembering involves retrieval from episodic memory and knowing involves retrieval from semantic memory.[2]
In his SPI model, Tulving stated that encoding into episodic and semantic memory is serial, storage is parallel, and retrieval is independent.[2] By this model, events are first encoded in semantic memory before being encoded in episodic memory; thus, both systems may have an influence on the recognition of the event.[2] High Threshold Model
The original high-threshold model held that recognition is a probabilistic process. [5] It is assumed that there is some probability that previously studied items will exceed a memory threshold. If an item exceeds the threshold then it is in a discrete memory state. If an item does not exceed the threshold then it is not remembered, but it may still be endorsed as old on the basis of a random guess.[6] According to this model, a test item is either recognized (i.e., it falls above a threshold) or it is not (i.e., it falls below a threshold), with no degrees of recognition occurring between these extremes.[5] Only target items can generate an above-threshold recognition response because only they appeared on the list.[5] The lures, along with any targets that are forgotten, fall below threshold, which means that they generate no memory signal whatsoever. For these items, the participant has the option of declaring them to be new (as a conservative participant might do) or guessing that some of them are old (as a more liberal participant might do).[5] False alarms in this model reflect memory-free guesses that are made to some of the lures.[5] This simple and intuitively appealing model yields the once widely used correction for guessing formula, and it predicts a linear receiver operating characteristic (ROC). An ROC is simply a plot of the hit rate versus the false alarm rate for different levels of bias. [5] A typical ROC is obtained by asking participants to supply confidence ratings for their recognition memory decisions.[5] Several pairs of hit and false alarm rates can then be computed by accumulating ratings from different points on the confidence scale (beginning with the most confident responses). The high-threshold model of recognition memory predicts that a plot of the hit rate versus the false alarm rate (i.e., the ROC) will be linear it also predicts that the z-ROC will be curvilinear.[5] Dual-process accounts
The dual-process account states that recognition decisions are based on the processes of recollection and familiarity.[5] Recollection is a conscious, effortful process in which specific details of the context in which an item was encountered are retrieved.[5] Familiarity is a relatively fast, automatic process in which one gets the feeling the item has been encountered before, but the context in which it was encountered is not retrieved.[5] According to this view, remember responses reflect recollections of past experiences and know responses are associated with recognition on the basis of familiarity.[7] Signal-detection theory
According to this theory, recognition decisions are based on the strength of a memory trace in reference to a certain decision threshold. A memory that exceeds this threshold is perceived as old, and trace that does not exceed the threshold is perceived as new. According to this theory, remember and know responses are products of different degrees of memory strength. There are two criteria on a decision axis; a point low on the axis is associated with a know decision, and a point high on the axis is associated with a remember decision.[5] If memory strength is high, individuals make a “remember” response, and if memory strength is low, individuals make a “know” response.[5]
Probably the strongest support for the use of signal detection theory in recognition memory came from the analysis of ROCs. An ROC is the function that relates the proportion of correct recognitions (hit rate) to the proportion of incorrect recognitions (false-alarm rate).[8]
Signal-detection theory assumed a preeminent position in the field of recognition memory in large part because its predictions about the shape of the ROC were almost always shown to be more accurate than the predictions of the intuitively plausible high-threshold model. [5] More specifically, the signal-detection model, which assumes that memory strength is a graded phenomenon (not a discrete, probabilistic phenomenon) predicts that the ROC will be curvilinear, and because every recognition memory ROC analyzed between 1958 and 1997 was curvilinear, the high-threshold model was abandoned in favor of signal-detection theory.[5] Although signal-detection theory predicts a curvilinear ROC when the hit rate is plotted against the false alarm rate, it predicts a linear ROC when the hit and false alarm rates are converted to z scores (yielding a z-ROC).[5]
“The predictive power of the signal detection modem seems to rely on know responses being related to transient feelings of familiarity without conscious recollection, rather than Tulving’s (1985) original definition of know awareness. [9] Dual-process signal-detection/high-threshold theory
The dual-process signal-detection/high-threshold theory tries to reconcile dual-process theory and signal-detection theory into one main theory. This theory states that recollection is governed by a threshold process, while familiarity is not.[5] Recollection is a high-threshold process (i.e., recollection either occurs or does not occur), whereas familiarity is a continuous variable that is governed by an equal-variance detection model.[5] On a recognition test, item recognition is based on recollection if the target item has exceeded threshold, producing an “old” response.[5] If the target item does not reach threshold, the individual must make an item recognition decision based on familiarity.[5] According to this theory, an individual makes a “remember” response when recollection has occurred. A know response is made when recollection has not occurred, and the individual must decide whether they recognize the target item solely on familiarity.[5] Thus, in this model, the participant is thought to resort to familiarity as a backup process whenever recollection fails to occur.[5] Distinctiveness/fluency model
In the past, it was suggested that remembering is associated with conceptual processing and knowing is associated with perceptual processing. However, recent studies have reported that there are some conceptual factors that influence knowing and some perceptual factors that influence remembering.[2] Findings suggest that regardless of perceptual or conceptual factors, distinctiveness of processing at encoding is what affects remembering, and fluency of processing is what affects knowing.[2] Remembering is associated with distinctiveness because it is seen as an effortful, consciously controlled process.[2] Knowing, on the other hand, depends on fluency as it is more automatic and reflexic and requires much less effort.[2]”″
Solution: I choose red with probability (written out and ROT13) avargl bar bire bar uhaqerq naq rvtugl.
EDIT: V’z fhecevfrq ubj pybfr guvf vf gb n unys.
I get that too. More generally, if there are n+1 rounds and on the first round the difference in probability between red and blue is z, then the optimal probability for choosing red is 1⁄2 + z/2n. It has to be close to 1⁄2 for large n, because 1⁄2 is optimal for the game where z=0, and over ten rounds the loss from deviating from 1⁄2 after the first round dominates the gain from knowing red is initially favoured.
Sure that’s not 1⁄2 + z/4n?
I think he meant “the difference between the probability of red and 1/2” when he said “the difference in probability between red and blue”.
Er, right, something like that.
That works too.
I don’t think that UDT is about decision-making under indexical uncertainty. I think that UDT is a clever way to reason without indexical uncertainty.
Suppose that several copies of an agent might exist. “Decision-making under indexical uncertainty” would mean, choosing an action while uncertain about which of these copies “you” might be. Thus, the problem presupposes that “you” are a particular physical instance of the agent.
The UDT approach, in contrast, is to identify “you” with the abstract algorithm responsible for “your” decisions. Since there is only one such abstract algorithm, there are no copies of you, and thus none of the attendant problems of indexical uncertainty. The only uncertainty is the logical uncertainty about how the abstract algorithm’s outputs will control the histories in the multiverse.
That’s a fair point, but I’m not sure it convinces me completely.
Decision-making under Bayesian probability looks like maximizing a certain weighted sum. The weights are probabilities, and you’re supposed to come up with them before making a decision. The AMD problem points out that some of the weights might depend on your decision, so you can’t use them for decision-making.
Decision-making in UDT also looks like maximizing a weighted sum. The weights are “degrees of caring” about different mathematical structures, and you’re supposed to come up with them before making a decision. Are we sure that similar problems can’t arise there?
I may be missing your point. As you’ve written about before, things go haywire when the agent knows too much about its own decisions in advance. Hence hacks like “playing chicken with the universe”.
So, the agent can’t know too much about its own decisions in advance. But is this an example of indexical uncertainty? Or is it (as it seems to me) an example of a kind of logical uncertainty that an agent apparently needs to have? Apparently, an agent needs to be sufficiently uncertainty, or to have uncertainty of some particular kind, about the output of the algorithm that the agent is. But uncertainty about the output of an algorithm requires only logical uncertainty.
I have also been leaning towards the existence of a theory more general than probability theory, based on a few threads of thinking.
One thread is anthropic reasoning, where it is sometimes clear how to make decision, yet probabilities don’t make sense and it feels to me that the information available in some anthropic situations just “doesn’t decompose” into probabilities. Stuart Armstrong’s paper on the sleeping beauty problem is, I think, valuable and greatly overlooked here.
Another thread is the limited-computation issue. We would all like to have a theory that pins down ideal reasoning, and then work out how to efficiently approximate that theory in a turing machine as a completely separate problem. My intuition is that things just don’t decompose this way. I think that a complete theory of reasoning will make direct reference to models of computation.
This site has collected quite a repertoire of decision problems that challenge causal decision theory. They all share the following property (including your example in the comment above): that in a causal graph containing as a node, there are links from to that do not go via your (for newcomb-like problems) or that do not go via (anthropic problems). Or in other words, your decisions are not independent of your beliefs about the world. The UDT solution says: “instead of drawing a graph containing , draw one that contains and you will see that the independence between beliefs and decisions is restored!”. This feels to me like a patch rather than a full solution, similar to saying “if your variables are correlated and you don’t know how to deal with correlated distributions, try a linear change of variables—maybe you’ll find one that de-correlates them!”. This only works if you’re lucky enough to find a de-correlating change of variables. An alternate approach would be to work out how to deal with non-independent beliefs/decision directly.
One thought experiment I like to do is to ask probability theory to justify itself in a non-circular way. For example, let’s say I propose the following Completely Stupid Theory Of Reasoning. In CSTOR, belief states are represented by a large sheet of paper where I write down everything that I have ever observed. What is my belief state at time t, you ask? Why, it is simply the contents of the entire sheet of paper. But what is my belief state about a specific event? Again, the contents of the entire sheet of paper. How does CSTOR update on new evidence? Easy! I simply add a line of writing to the bottom of the sheet. How does CSTOR marginalize? It doesn’t! Marginalization is just for dummies who use probability theory, and, as you can see, CSTOR can do all the things that a theory of reasoning should do without need for silly marginalization.
So what really distinguishes CSTOR from probability theory? I think the best non-circular answer is that probability theory gives rise to a specific algorithm for making decisions, where CSTOR doesn’t. So I think we should look at decision making as primary and then figure out how to decompose decision making into some abstract belief representation plus abstract notion of utility, plus some abstract algorithm for making decisions.
Can you try to come up with a situation where that independence is not restored? If we follow the analogy with correlations, it’s always possible to find a linear map that decorrelates variables...
Ha, indeed. I should have made the analogy with finding a linear change of variables such that the result is decomposable into a product of independent distributions—ie if (x,y) is distributed on a narrow band about the unit circle in R^2 then there is no linear change of variables that renders this distribution independent, yet a (nonlinear) change to polar coordinates does give independence.
Perhaps the way to construct a counterexample to UDT is to try to create causal links between and of the same nature as the links between and the in e.g. Newcomb’s problem. I haven’t thought this through any further.
L. J. Savage does this in his book “The Foundations of Statistics.” This was mentioned by pragmatist upthread, and is summarised here. This is written in 1954, and so it doesn’t deal with weird LW-style situations, but it does found probability in decision theory.
Just for reference, Wei has pointed out that VNM doesn’t work for indexical uncertainty because the axiom of independence is violated. I guess Savage’s theory fails for the same reason. Maybe it’s worthwhile to figure out what mathematical structures would appear if we dropped the axiom of independence, and if there’s any other axiom that can pin down a unique such structure for LW-style problems. I’m trying to think in that direction now, but it’s difficult.
I don’t get this implication. Are you suppressing some premises here? I am sympathetic to the idea that all non-logical uncertainty is best thought of as indexical, but I usually think about this as indexical uncertainty about which of various logically possible worlds I live in, not indexical uncertainty about who I am in this world.
As an exception to your dichotomy, consider uncertainty about the laws of nature. I very much doubt we actually possess enough information about the world so that if we had infinite mathematical power we could logically deduce the correct laws of nature from our current evidence (although maybe you believe this, in which case, would you also believe it for people in the Stone Age?), but it also doesn’t seem right that there are copies of me in this world living in environments with different laws of nature, at least not if we mean something sufficiently fundamental by “law of nature”.
Maybe by “the world” you mean a Tegmark Level IV multiverse? If that’s the case, it’s probably worth making clear, since that’s definitely not the usual sense of the word.
That’s a good point, thanks! I guess the post assumes Tegmark Level IV, but since I’m uncertain whether Tegmark Level IV is true, that’s definitely a third kind of uncertainty :-) Edited the post.
Would you mind adding links for “Penja and Paul’s prior, Manfred’s prior, or my own recent attempt.”?
The first one reproduces the construction in section 5 of Hutter’s paper, I think. The others are described here and here.
I feel like you’re begging the question somewhere in here. You can write a function for the absentminded driver problem that takes indexical probabilities as an input and ouputs correct decisions. That function just doesn’t look like CDT or EDT—so are you writing EDT into your definition of “what decision-making looks like”?
I’m not sure I understand. What function do you have in mind?
Let’s figure it out! Suppose the payoffs are as used here.
Now, the correct answer is to go straight with p=2/3. How is that figured out? Because it maximizes the expected value given by p · (1-p) · 4 + p · p.
Since the probabilities are dependent on your strategy, we will leave them as equations—if you go straight with probability p, then P(first crossing) is 1/(1+p), and P(second crossing) is p/(1+p). So the question is, what is the correct mixed strategy to take, in terms of these probabilities and the utilities?
Well, there’s a rather dumb way: you just extract p from the probabilities and plug it into the expected value.
That is, you look at the ratio P(second crossing)/P(first crossing), and then choose a strategy such that that ratio maximizes the expected utility equation. That’s the function.
I see. For any math problem where I can figure out the right answer, I can write a function that receives the phase of the moon as argument and returns the right answer. Okay, I guess you’re right, I do want the function to look like expected utility maximization based on the supplied probabilities, rather than some random formula.
Hardly random—it makes perfect sense that the ratio of probabilities is equal to the parameter of the strategy, and so any maximization of the parameter can be rewritten as a maximization of the ratio of probabilities.
I have also been having suspicions that I might have some issues with standard Bayesian probability. Specifically, I have been trying to see if I can do decision theory without defining probability theory, then define probabilities from decision theory. I will likely share my results in the near future.
Are you familiar with Leonard Savage’s representation theorem? It sounds like what you’re trying to do is pretty similar, so if you’re unaware of Savage’s work you might want to look into it, just to make sure you don’t waste time retreading territory that has already been explored.
Also relevant: David Wallace’s work on recovering the quantum mechanical Born probabilities from decision theory.
Thank you. I have not seen that theorem, and this is very helpful and interesting. It is incredibly similar to what I was doing. I strongly encourage anyone reading this to vote up pragmatist’s comment.
I think most LWers working on these topics are already aware of Savage’s approach. It doesn’t work on AMD-like problems.
Are there any posts describing what goes wrong?
Piccione’s paper, mentioned in Wei’s post on AMD, says:
What about dutch book arguments though? Don’t they show that any rule for accepting/rejecting bets that isn’t probability theory will lead you to accept certain losses?
I think you can look up how to do it, actually. I’ve heard of this kind of derivation in other LW comments. Looking it up might be quicker than figuring it out. Either way, I’d like to hear what you find.