othercriteria comments on Teaching Bayesianism

othercriteria 10 Jun 2012 0:53 UTC
5 points
Ignoring, temporarily, everything but the first paragraph, there are two ways I might proceed.

Acting as a frequentist, I would suppose that die rolls could be modeled as independent identically distributed draws from a multinomial distribution with fixed but unknown parameters. (The independence, and to a lesser degree the identically distributed, assumption could also be verified although this gets a bit tricky.) I would roll the die some fixed number of times (possibly determined according to a a priori calculation of statistical power) and take the MLE as a point estimate of the unknown parameters. I would report this parameter as the probability of the die landing on the various sides. I might also report a 95% confidence region for the estimate, which is not to be interpreted as containing the true probabilities 95% of the time (it either does or does not, with certainty).

Acting as a Bayesian, I would assume the same data model, but I would also place a prior distribution on the unknown parameter. A natural prior in this case is the Dirichlet distribution, which is conjugate to the multinomial distribution. I would also use the same data collection approach, although the Bayesian formulation makes it easy to work with the special case of observing a single roll. Given the model likelihood and the prior distribution, Bayes’ law tells me the new posterior distribution to which I should update to represent my uncertainty over the unknown parameter. I would continue to roll the die and update until the posterior distribution is sufficiently concentrated according to some reasonable stopping criterion. I would then report the posterior mean (or maybe the MAP estimate) as the probability of the die landing on the various sides. I would also report 95% credible region for the estimate, which I would give a 95% credence to containing the truth (although under questioning, I would probably be evasive/unclear about exactly what that means). I would also need to communicate a justification for my prior distribution and ideally evidence that the inference is not overly sensitive to it. I ought to just report the posterior distribution itself, but people tend to find it easier to base decision on point estimates.

There are obvious similarities to these two inferential approaches, but they are answering slightly different questions using vastly different methods.
- private_messaging 10 Jun 2012 8:16 UTC
  −1 points
  Parent
  Suppose you are denied experimentation and denied extremely powerful computer (e.g. you can only do <100 simulated trials but want reasonable accuracy), or need high accuracy in limited time. I was more interested about what you do when you are to try to analytically solve something like this, finding probabilities for the 3 distinct sides.
  
  The point here is that you want to go for physically justified stuff, and anything not physically justified that you are doing anywhere, is same in principle as wilfully putting cognitive bias into your calculations, and is just plain wrong, no philosophical stuff here, you’ll end up losing games vs someone who solves it better. Maybe you guys need “Overcoming Bayes” blog.
  - othercriteria 10 Jun 2012 15:23 UTC
    2 points
    Parent
    
    The point here is that you want to go for physically justified stuff, and anything not physically justified that you are doing anywhere, is same in principle as wilfully putting cognitive bias into your calculations, and is just plain wrong, no philosophical stuff here, you’ll end up losing games vs someone who solves it better.
    
    Statisticians, by and large, don’t lose sleep over this problem. Even in your not-quite-fair die problem, the calculations involved are really hard. It wasn’t made explicit in my comment but I wasn’t even assuming that opposite sides have equal probability, because some subtle error in the setup could break the symmetry. In the Bayesian case, I was considered mentioning a mixture model that would take advantage of the symmetry if the data supported it. In KDD Cup types of problems, nobody is worried that a domain expert will show up with a winning solution that doesn’t even need to see the training data (why would it if it were maximally physically justified?).
    
    putting cognitive bias into your calculations, and is just plain wrong, no philosophical stuff here, you’ll end up losing games vs someone who solves it better. Maybe you guys need “Overcoming Bayes” blog.
    
    Bayesians have made peace with bias. In fact, decision rules that are both Bayes and unbiased have zero risk, which is a nice way of saying that they don’t exist in non-trivial situations. Noorbaloochi and Meeden (1983) have to go through definitional contortions to establish a positive connection between being Bayes and unbiased.
    
    Bias is what lets you get good inferential performance in small-sample regimes. If I observe side counts (2, 0, 1, 3, 2, 2), I’d be okay with my estimator inferring equal side probabilities, because that will be closer to the truth than the unbiased estimator which guesses (0.2, 0.0, 0.1, 0.3, 0.2, 0.2); ten rolls is not enough data to tell me that I should never see a “2”. On the other hand, with side counts (200, 0, 100, 300, 200, 200), something closer to the unbiased estimator seems like a good idea. As long as the estimator is asymptotically unbiased, you can even still have consistency.
    
    Unlike cognitive bias, we have control over our statistical bias and we should not be squeamish about using it to learn about the parts of the world that are hard to model with complete accuracy to the extent that we wouldn’t need statistics anyways.
    - private_messaging 10 Jun 2012 20:14 UTC
      −4 points
      Parent
      The point of the not quite fair die example was to demonstrate where ‘probabilities’ are coming from. The fair die, after several bounces, maps the initial state space into the final side-up states in a particular way, so that 1/6th of even a very tiny part (hypervolume) of initial state maps to each side-up final state. The not totally fair die is somewhat biased from that. Any problems involving die can be solved from first principles all the way from this through selection of the parts of initial state that are compatible with observation, to the answer.
      
      With regard to the statisticians not losing sleep over that, there is a zillion examples in practice where you have to deal with e.g. electric current, or temperature, or illumination, or any other fundamentally statistical property, and you have limited computational power. A lot of my work is for doing this on illumination; I have to compute illumination in a huge number of points on the screen (and no you can’t bruteforce even if you had 1000x the computing power, not to mention that when there’s 1000x the power you’ll have tighter constraints on error and time). I don’t really care if some people don’t find anything wrong with doing a wrong thing “because we won’t be beaten in practice”, when I am earning some of my money by beating those folks in practice. So better for me that some folks just don’t understand that you shouldn’t get to choose some arbitrary numbers. Yes, in various really fuzzy problems, you can do what ever you subjectively please. But to see this as fundamental—that’s quite seriously silly.
      
      There are many methods for finding out the resulting distribution; one particular method involves more regular sampling of the initial state than random (e.g. grid with jittering), so that you get error that improves much better than 1/sqrt(N) ; it can in principle be used for die simulation, and is used in practice in similar problems that are less messy (molecular dynamics comes to mind) . I generally find that nowadays a lot of very important insights are within the more applied fields; the knowledge has not yet propagated into this meta-ish land of arguing mostly over terminology and not having to be maximally correct against golden standard of reality.
      - othercriteria 10 Jun 2012 22:58 UTC
        4 points
        Parent
        
        Any problems involving die can be solved from first principles all the way from this through selection of the parts of initial state that are compatible with observation, to the answer.
        
        You’re sketching out a methodology for solving forward problems (given model, determine observations), which is fine but it’s not what motivates statisticians. Statisticians are generally concerned with the backward/inverse problem (given observations, determine model).
        
        In reality, we’re not presented with complete and accurate technical specifications for the die/table/thrower system we encounter. All we get to see is the sequence of sides that landed on top. If we’re playing a game that uses the die, it’s of interest to know how this sequence will continue into the future.
        
        One general approach to figuring this out might involve inferring technical specifications. Maybe if we’re really clever, we can figure out what grade of steel the die is made of just from the observed side counts. Less ambitiously, we might try to recover the relative side lengths and rounding radius. With all this information, we can then simulate forward to estimate the sequence of future throws. The number of parameters involved here may number in the tens or hundreds, or into the millions if we want to capture all the physiological details of a human thrower. It’s also not quite clear whether a system like this would even converge to any stationary long-term behavior from which limiting relative frequencies could be calculated.
        
        Another approach is ignore all the detail, assume independent identically distributed tosses, and just try to learn the five parameters (P(side 1), …, P(side 5); P(side 6) = 1 - P(side 1) - … - P(side 5)). Forward simulation in this case is just repeated sampling from the learned distribution.
        
        Moreover, let’s suppose that (effective) independence emerges from the technical specification model. Then we have a huge identifiability problem; all those hundreds of parameters are just providing a redundant parameterization of the iid model. We can’t hope to learn all of the parameters from the data we get to observe.
        
        I guess as long as you want to stick to forward problems, you can invoke Occam and deny that probability even exists. But don’t assume that your understanding carries over to inverse problems. Probability is a useful technical tool there, and applying it to real problems requires translation/operationalization. Two different frameworks for this are frequentism and Bayesianism.
        
        I don’t really care if some people don’t find anything wrong with doing a wrong thing “because we won’t be beaten in practice”, when I am earning some of my money by beating those folks in practice.
        
        If you want to put your money where your mouth is, I have a proposal. Take a die of your choosing, or manufacture one according to your own specifications; it doesn’t have to be remotely fair. Also supply a plate onto which it can be tossed if you desire. Do whatever measurements you want on them. Then convey them to a mutually-accepted third party. The third party rolls the die 200 times, according to instructions you publicly post, and then publicly posts the first half of the sequence of rolls and a hash of the second half of the sequence. We both predict the side counts in the second half of the sequence and post the predictions publicly. The third party reveals the second half of the sequence (which can be checked against the hash) and whoever was closer to the true side counts (in squared error distance) wins. The loser pays the winner some mutually-accepted amount, plus or minus half the die/plate shipping expenses as appropriate to split that cost.
        private_messaging 11 Jun 2012 6:13 UTC
        −1 points
        Parent
        I am an applied mathematician who actually does work on finding the values of probabilistic quantities in better computing time than straightforward numerical experimentation. Probability is not just statistics.
        
        In so much as what you think Bayesians do deviates from what I know has to be done, you have a wrong idea of what Bayesians do (or giving you benefit of the doubt at expense of others, are referring to some “Bayesians” whom are plain wrong), or something like that but the discussion is too fuzzy for me to tell which. (Ditto for frequentists)
        
        The point of frequentism is seeing the probability as frequency in infinite number of trials. The point of my die example is to demonstrate that physically the probability plain comes in as frequency, via a function from initial phase space to final phase space that maps, for fair die, ¹⁄₆ of initial phase space to each final side-up, this being the objective property of a system that has to be adequately captured by what ever methods you are using. And I do not give a slightest damn if you don’t know that in practice—not for dies but for many other systems—you have to find probabilities bottom up from e.g. laws of physics. If you are given steel die to physically experiment with, there again are a lot better (faster) ways to find out the probabilities, than just tossing (do you even understand that your errors converge as 1/sqrt(N) , or how important of an issue is that in practice?!). Of course I won’t bother making for you some example with actually the die, the point is the principle and i’ve done such solutions before with things that unfortunately don’t make great examples.
        
        edit: also, on science, the reason we do ‘probability of data given model’ is because science follows a strategy of committing to rarely (with certain probability) throwing out valid model. ‘Probability of model given the data’ is not well defined, unless you count stuff like ‘Solomonoff induction as a prior’, where it is defined but not computable (and is mathematically homologous to assigning probability of 1 to the ‘we live inside Turing machine’ model). The experimental physicists publish probability of data given model; people can then combine that with their priors if they want.
        othercriteria 13 Jun 2012 20:51 UTC
        0 points
        Parent
        
        If you are given steel die to physically experiment with, there again are a lot better (faster) ways to find out the probabilities, than just tossing (do you even understand that your errors converge as 1/sqrt(N) , or how important of an issue is that in practice?!).
        
        The world often isn’t nice enough to give us the steel die. Figuratively, the steel die may be inside someone’s skull, thousands of years in the past, millions of light-years away, or you may have five slightly different dice and really want to learn about the properties of all dice.
        
        I do understand the O(N^(-1/2)) convergence of errors. I spend a lot of time working on problems where even consistency isn’t guaranteed (i.e., nonparametric problems where the “number of parameters” grows in some sense with the amount of data) and finding estimators with such convergence properties would be great there.
        
        ‘Probability of model given the data’ is not well defined, unless you count stuff like ‘Solomonoff induction as a prior’, where it is defined but not computable (and is mathematically homologous to assigning probability of 1 to the ‘we live inside Turing machine’ model).
        
        It’s perfectly well-defined. It’s just subjective in a way that makes you (and a great number of informed, capable, and thoughtful statisticians) apparently very uneasy. There’s some theory that gives pretty general conditions under which Bayesian procedures converge to the true answer, in spite of choice of prior, given enough data. You probably wouldn’t be happy with rates of convergence for these methods, because they tend to be slower and harder to obtain than for, e.g., MLE estimation of iid normally-distributed data.
        
        The experimental physicists publish probability of data given model; people can then combine that with their priors if they want.
        
        They might well do this. As a frequentist, this is a natural step in establishing confidence intervals and such, after they have estimated the quantity of interest by choosing the model that maximizes the probability of the data. This choice may not look like “Standard Model versus something else” but it probably looks like “semi-empirical model of the system with parameter 1 = X” where X can range over some reasonable interval.
        
        unless you count stuff like ‘Solomonoff induction as a prior’
        
        I don’t see what role Solomonoff induction plays in a discussion of frequentism versus Bayesianism. I never mentioned it, I don’t know enough about it to use it, and I agree with you that it shows up on LW more as a mantra than as an actual tool.
        private_messaging 14 Jun 2012 7:32 UTC
        0 points
        Parent
        
        The world often isn’t nice enough to give us the steel die.
        
        The point is that the probability with die comes in as frequency (the fraction of initial phase space). Yes, sometimes nature doesn’t give you die; that does not invalidate the fact that there exists probability as objective property of a physical process, as per frequentism (related to how the process maps initial phase space to final phase space); the methods employing subjectivity have to try to conform to this objective property as closely as possible (e.g. by trying to know more about how the system works). The Bayesianism is not opposed to this, unless we are to speak of some terribly broken Bayesianism.
        
        ‘Probability of model given the data’ is not well defined,
        
        It’s perfectly well-defined.
        
        Nope. Only the change to probability of model given the data is well defined. The probability itself isn’t. You can pick arbitrary start point.
        
        There’s some theory that gives pretty general conditions under which Bayesian procedures converge to the true answer,
        
        The notion of ‘true answer’ is frequentist....
        
        edit: Recall that the original argument was about the trope of Bayesianism being opposed to frequentism etc. here. The point with Solomonoff induction is that once you declare something like this a source of priors, all math youll be doing should be completely identical to frequentist math (when frequencies are within turing machines fed random tape, and the math is done as in my top level post for die), just as long as you don’t simply screw your math up. The point with die example was that no Bayesianist worth their salt opposes to there being a property of chaotic process, what fraction of initial phase space gets mapped to where, because there really is this property.