AmagicalFishy comments on Open thread, Jan. 26 - Feb. 1, 2015

AmagicalFishy Jan 26, 2015, 3:39 PM
6 points
I still don’t understand the apparently substantial difference between Frequentist and Bayesian reasoning. The subject was brought up again in a class I just attended—and I was still left with a distinct ”… those… those aren’t different things” feeling.

I am beginning to come to the conclusion that the whole “debate” is a case of Red vs. Blue nonsense. So far, whenever one tries to elaborate on a difference, it is done via some hypothetical anecdote, and said anecdote rarely amounts to anything outside of “Different people sometimes treat uncertainty differently in different situations, depending on the situation.” (Usually by having one’s preferred side make a very reasonable conclusion, and the other side make some absurd leap of psuedo-logic).

Furthermore, these two things hardly ever seem to have anything to do with the fundamental definition of probability, and have everything to do with the assumed simplicity of a given system.

I AM ANGRY
- Kindly Jan 26, 2015, 4:29 PM
  9 points
  Parent
  The whole thing is made more complicated by the debate between frequentist and Bayesian methods in statistics. (It obviously matters which you use even if you don’t care what to believe about “what probability is”, or don’t see a difference.)
- IlyaShpitser Jan 27, 2015, 11:37 AM
  8 points
  Parent
  This debate is boring and old, people getting work done in ML/stats have long ago moved past it. My suggestion is to find something better to talk about: it’s mostly wankery if people other than ML/stats people are talking.
  - Richard_Kennaway Jan 27, 2015, 1:09 PM
    2 points
    Parent
    What is it when it is ML/stats people who are talking? For example, it’s a frequent theme at the blogs of Andrew Gelman and Deborah Mayo, and anyone teaching statistics has to deal with the issues.
    - IlyaShpitser Jan 27, 2015, 1:13 PM
      5 points
      Parent
      I teach statistics and I don’t deal with the debate very much. Have you read the exchange started by Robins/Wasserman’s missing data example here:
      
      https://normaldeviate.wordpress.com/2012/08/28/robins-and-wasserman-respond-to-a-nobel-prize-winner/
      
      What do you make of it? It is an argument against certain kinds of “Bayesian universality” people talk about (but it’s not really the type of argument folks here have). Here they have a specific technical point to make.
      - Richard_Kennaway Jan 27, 2015, 2:26 PM
        3 points
        Parent
        It will take a while to understand it, but by the end of section 3 I was wondering when the assumption that X is a binary string was going to be used. Not at all, so far. The space might as well have been defined as just a set of 2^d arbitrary things. So I anticipate that introducing a smoothness assumption on theta, foreshadowed at this point, won’t help—there is no structure for theta to be smooth with respect to. Surely this is why the only information about X that can be used to estimate Y is π(X)? That is the only information about X that is available, the way the problem is set up.
        
        More when I’ve studied the rest.
        IlyaShpitser Jan 27, 2015, 2:45 PM
        5 points
        Parent
        The binary thing isn’t important, what’s important is that there are real situations where likelihood based methods (including Bayes) don’t work well (because by assumption there is only strong info on the part of the likelihood we aren’t using in our functional, and the part of the likelihood we are using in our functional is very complicated).
        
        I think my point wasn’t so much the technical specifics of that example, but rather that these are the types of B vs F arguments that actually have something to say, rather than going around and around in circles. I had a rephrase of this example using causal language somewhere on LW (if that will help, not sure if it will).
        
        Robins and Ritov have something of paper length, rather than blog post length if you are interested.
        AmagicalFishy Jan 27, 2015, 5:20 PM
        2 points
        Parent
        Wait, IlyaShipitser—I think you overestimate my knowledge of the field of statistics. From what it sounds like, there’s an actual, quantitative difference between Bayesian and Frequentist methods. That is, in a given situation, the two will come to totally different results. Is this true?
        
        I should have made it more clear that I don’t care about some abstract philosophical difference if said difference doesn’t mean there are different results (because those differences usually come down to a nonsensical distinction [à la free will]). I was under the impression that there is a claim that some interpretation of the philosophy will fruit different results—but I was missing it, because everything I’ve been introduced to seems to give the same answer.
        
        Is it true that they’re different methods that actually give different answers?
        DanielLC Feb 2, 2015, 7:48 AM
        0 points
        Parent
        I think it’s more that there are times when frequentists claim there isn’t an answer. It’s very common for statistical tests to talk about likelihood. The likelihood of a hypothesis given an experimental result is defined as the probability of the result given the hypothesis. If you want to know the probability of the hypothesis, you take the likelihood and multiply it by the prior probability. Frequentists deny that there always is a prior probability. As a result, they tend to just use the base rate as if it were a probability. Conflating the two is equivalent to the base rate fallacy.
        polymathwannabe Jan 27, 2015, 6:43 PM
        0 points
        Parent
        EY believes so.
        Richard_Kennaway Jan 27, 2015, 11:36 PM
        1 point
        Parent
        I think I’m beginning to see the problem for the Bayesian, although I not yet sure what the correct response to it is. I have some more or less rambling thoughts about it.
        
        It appears that the Bayesian is being supposed to start from a flat prior over the space of all possible thetas. This is a very large space (all possible strings of 2^100000 probabilities), almost all of which consists of thetas which are independent of pi. (ETA: Here I mistakenly took X to be a product of two-point sets {0,1}, when in fact it is a product of unit intervals [0,1]. I don’t think this makes much difference to the argument though, or if it does, it would be best addressed by letting this one stand as is and discussing that case separately.) When theta is independent of pi, it seems to me that the Bayesian would simply take the average of sampled values of Y as an estimate of P(Y=1), and be very likely to get almost the same value as the frequentist. Indirectly observing a few values of theta (through the observed values of Y) gives no information about any other values of theta, because the prior was flat. This is why the likelihood calculated in the blog post contains almost no information about theta.
        
        Here is what seems to be to be a related problem. You will be presented with a series of some number of booleans, say 100. After each one, you are to guess the next. If your prior is a flat distribution over {0,1}^100, your prediction will be 50% each way at every stage, regardless of what the sequence so far has been, because all continuations are equally likely. It is impossible to learn from such a prior, which has built into it the belief that the past cannot predict the future.
        
        As noted in the blog post, smoothness of theta with respect to e.g. the metric structure of {0,1}^100000 doesn’t help, because a sample of only 1000 from this space is overwhelmingly likely to consist of points that are all at a Manhattan distance of about 50000 from each other. No substantial extrapolation of theta is possible from such a sample unless it is smooth at the scale of the whole space.
        
        The flat prior over theta seems to be of a similar nature to the flat prior over sequences. If in this sample of 1000 you noticed that when pi was high, the corresponding value of Y, when sampled, was very likely to be 1, and similarly that when pi was low, Y was usually 0 among those rare times it was sampled, you might find it reasonable to conclude that pi and theta were related and use something like the Horwitz-Thompson estimator. But the flat prior over theta does not allow this inference. However many values of theta you have gained some partial information about by sampling Y, they tell you nothing about any other values.
        
        My guess so far is that that is a problem with the flat prior over theta. The problem for the Bayesian is to come up with a better one that is capable of seeing a dependency between pi and theta.
        
        Is the Robins and Ritov paper the one cited in the blog post, “Toward a Curse of Dimensionality Appropriate (CODA) Asymptotic Theory for Semi-parametric Models”? I looked at that briefly, only enough to see that their example, though somewhat similar, deals with a relatively low dimensional case (5), which in practical terms counts as high dimensional, and what they describe as a “moderate” sample size of 10000. So that’s rather different from the present example, and I don’t know if anything I just said will be relevant to it.
        
        On reading further in the blog post, I see that a lot of what I said is said more briefly in the comments there, especially comment (4) by Chris Sims:
        
        If theta and pi were independent, we could just throw out the observations where we don’t see Y and use the remaining sample as if there were no “R” variable. So specifying that theta and pi are independent is not a reasonable way to say we have little knowedge. It amounts to saying we are sure the main potential complication in the model is not present, and therefore opens us up to making seriously incorrect inference.
        
        And a flat prior on theta is an assumption that theta and pi are almost certainly independent.
        IlyaShpitser Jan 28, 2015, 8:44 AM
        1 point
        Parent
        Yes the CODA paper is what I meant.
        
        The right way out is to have a “weird” prior that mirrors frequentist behavior. Which, as the authors point out, is perfectly fine, but why bother? By the way Bayes can’t use Horvitz-Thompson directly because it’s not a likelihood based estimator, I think you have to somehow bake the entire thing into the prior.
        
        The insight that lets you structure your B setup properly here is sort of coming from “the outside the problem,” too.
        one_forward Feb 2, 2015, 7:34 PM
        0 points
        Parent
        A note on notation - [0,1] with square brackets generally refers to the closed interval between 0 and 1. X is a continuous variable, not a boolean one.
        Richard_Kennaway Feb 2, 2015, 9:43 PM
        1 point
        Parent
        Actually, I should have been using curly brackets, as when I wrote (0,1) I meant the set with two elements, 0 and 1, which is what I had taken X to be a product of copies of, hence my obtaining 50000 as the expected Manhattan distance between any two members. I’ll correct the post to make that clear. I think everything I said would still apply to the continuous case. If it doesn’t, that would be better addressed with a separate comment.
        one_forward Feb 4, 2015, 8:34 PM
        0 points
        Parent
        Yeah, I don’t think it makes much difference in high-dimensions. It’s just more natural to talk about smoothness in the continuous case.
- polymathwannabe Jan 26, 2015, 4:03 PM
  2 points
  Parent
  What “fundamental definition of probability” are you using?
  - AmagicalFishy Jan 26, 2015, 4:11 PM
    0 points
    Parent
    A quantitative thing that indicates how likely it is for an event to happen.
    - Lumifer Jan 26, 2015, 4:48 PM
      5 points
      Parent
      Let’s say Alice and Bob are in two different rooms and can’t see each other. Alice rolls a 6-sided die and looks at the outcome. Bob doesn’t know the outcome, but knows that the die has been rolled. In your interpretation of the word “probability”, can Bob talk about the probabilities of the different roll outcomes after Alice rolled?
      - AmagicalFishy Jan 26, 2015, 5:02 PM
        0 points
        Parent
        I’m having a hard time answering this question with “yes” or “no”:
        
        The event in question is “Alice rolling a particular number on a 6-sided die.” Bob, not knowing what Alice rolled, can talk about the probabilities associated with rolling a fair die many times, and base whatever decision he has to make from this probability (assuming that she is, in fact, using a fair die). Depending on the assumed complexity of the system (does he know that this is a loaded die?), he could convolute a bunch of other probabilities together to increase the chances that his decision is accurate.
        
        Yes… I guess?
        
        (Or, are you referring to something like: If Alice rolled a 5, then there is a 100% chance she rolled a 5?)
        Lumifer Jan 26, 2015, 5:19 PM
        8 points
        Parent
        Well, the key point here is whether the word “probability” can be applied to things which already happened but you don’t know what exactly happened. You said
        
        A quantitative thing that indicates how likely it is for an event to happen.
        
        which implies that probabilities apply only to the future. The question is whether you can speak of probabilities as lack of knowledge about something which is already “fixed”.
        
        Another issue is that in your definition you just shifted the burden of work to the word “likely”. What does it mean that an event is “likely” or “not likely” to happen?
        emr Jan 26, 2015, 7:13 PM
        0 points
        Parent
        EDIT: The neighboring comment here, raises the same point (using the same type of example!). I wouldn’t have posted this duplicate comment if I had caught this in time.
        
        I’m also confused about the debate.
        
        Isn’t the “thing that hasn’t happened yet” always an anticipated experience? (Even if we use a linguistic shorthand like “the dice roll is 6 with probability .5″.)
        
        Suppose Alice tells Bob she has rolled the dice, but in reality she waits until after Bob has already done his calculations and secretly rolls the dice right before Bob walks in the room. Could Bob have any valid complaint about this?
        
        Once you translate into anticipated experiences of some observer in some situation, it seems like the difference between the two camps is about the general leniency with which we grant that the observer can make additional assumptions about their situation. But I don’t see how you can opt out of assuming something: Any framing of the P(“sun will rise tomorrow”) problem has to implicitly specify a model, even if it’s the infinite-coin-flip model.
        AmagicalFishy Jan 26, 2015, 5:53 PM
        0 points
        Parent
        Sorry, I didn’t mean to imply that probabilities only apply to the future. Probabilities apply only to uncertainty.
        
        That is, given the same set of data, there should be no difference between event A happening, and you having to guess whether or not it happened, and event A not having happened yet—and you having to guess whether or not it will happen.
        
        When you say “apply a probability to something,” I think:
        
        “If one were to have to make a decision based on whether or not event A will happen, how would one consider the available data in making this decision?”
        
        The only time event A happening matters is if it happening generated new data. In the Bob-Alice situation, Alice rolling a die in separate room gives zero information to Bob—so whether or not she already rolled it doesn’t matter. Here are a couple of different situations to illustrate:
        
        A) Bob and Alice are in different rooms. Alice rolls the die and Bob has to guess the number she rolled. B) Bob has to guess the number that Alice’s die will roll. Alice then rolls the die. C) Bob watches alice roll the die, but did not see the outcome. Bob must guess the number rolled. D) Bob is a supercomputer which can factor in every infinitesimal fact about how Alice rolls the die, and the die itself upon seeing the roll. Bob-the-supercomputer watches Alice roll the die, but did not see the outcome.
        
        In situations A, B, and C—whether or not Alice rolls the die before or after Bob’s guess is irrelevant. It doesn’t change anything about Bob’s decison. For all intents and purposes, the questions “What did Alice roll?” and “What will Alice roll?” are exactly the same question. That is: We assume the system is simple enough that rolling a fair die is always the same. In situation D, the questions are different because there’s different information available depending on whether or not Alice rolled already. That is, the assumption of a simple-system isn’t there because Bob is able to see the complexity of the situation and make the exact same kind of decision. Alice having actually rolled the dice does matter.
        
        I don’t quite understand your “likely or not likely” question. To try to answer: If an event is likely to happen, then your uncertainty that it will happen is low. If it is not likely, then your uncertainty that it will happen is high.
        
        (Sorry, I totally did not expect this reply to be so long.)
        What links here?
        emr's comment on Open thread, Jan. 26 - Feb. 1, 2015 by Gondolinian (Jan 26, 2015, 7:13 PM; 0 points)
        Lumifer Jan 26, 2015, 6:15 PM
        3 points
        Parent
        So, you are interpreting probabilities as subjective beliefs, then? That is a Bayesian, but not the frequentist approach.
        
        Having said that, it’s useful to realize that the concept of probability has many different… aspects and in some situations it’s better to concentrate on some particular aspects. For example if you’re dealing with quality control and acceptable tolerances in an industrial mass production environment, I would guess that the frequentist aspect would be much more convenient to you than a Bayesian one :-)
        
        If an event is likely to happen, then your uncertainty that it will happen is low.
        
        You may want to reformulate this, as otherwise there’s lack of clarity with respect to the uncertainty about the event vs. the uncertainty about your probability for the event. But otherwise you’re still saying that probabilities are subjective beliefs, right?
- [deleted]Jan 26, 2015, 8:35 PM
  1 point
  Parent
  My best try: Frequentist statistics are built upon deductive logic; essentially a single hypothesis. They can be used for inductive logic (multiple hypotheses), but only at the more advanced levels which most people never learn. With Bayesian reasoning inductive logic is incorporated into the framework from the very beginning. This makes it harder to learn at first, but introduces fewer complications later on. Now math majors feel free to rip this explanation to shreds.
- Mark_Friedenbach Jan 26, 2015, 6:37 PM
  −1 points
  Parent
  They are the same thing. Gertrude Stein had it right: probability is probability is probability. It doesn’t matter whether your interpretation is Bayesian or frequentist. The distinction between the two is simply how one chooses to apply probability: as a property of the world (frequentist) or as a description of our mental world-models (Bayesian). In either case the rules of probability are the same.
  - Luke_A_Somers Jan 27, 2015, 7:16 PM
    0 points
    Parent
    This phrasing suggests that Bayesians can’t accept quantum mechanics except via hidden variables. This is not the case.
    - Mark_Friedenbach Jan 27, 2015, 9:43 PM
      2 points
      Parent
      Taboo the word Bayesian.
      
      I was talking about the Bayesian interpretation of probability. An interpretation, not a category of person. Quantum mechanics without hidden variables uses the frequentist interpretation of probability.
      
      Sometimes in life we use probability in ways that are frequentist. Other times we use probability in ways that are Bayesian. This should not be alarming.
      - Luke_A_Somers Jan 27, 2015, 11:31 PM
        2 points
        Parent
        Fair enough. The idea of calling QM ‘frequentist’ really stretches the reason for using that term under anything but an explicit collapse interpretation. Maybe it would be more of a third way -
        
        Frequentism would be that the world is itself stochastic.
        Fractionism would be that the world takes both paths and we will find ourselves in one.
        Bayes gets to keep its definition.