Sebastian_Hagen comments on Frequentist Statistics are Frequently Subjective

Sebastian_Hagen 5 Dec 2009 15:45 UTC
2 points

Well, it should be noted that, if a theory based on a subset of the data predicts the whole data set, then that theory has a higher probability of being correct than a theory based on the whole data set.

But that’s exactly because you don’t trust the scientist who came up with the hypothesis looking at the whole data set to discount correctly for the complexity of their hypothesis. This might happen either because you think they’re irrational, or because you’re worried about intellectual dishonesty—though in the latter case you should alwo worry about the scientist with the allegedly limited-data theory having snuck a peek at the full set, or just having come up with enough overly specific theories that one of them was likely to survive the follow-up test.

As the comments to that post say, if you can actually look at the hypotheses in question, and you’re completely confident in your own judgement of simplicity, that judgement completely screens off how much data was used in formulating them.
- Stuart_Armstrong 7 Dec 2009 11:16 UTC
  3 points
  Parent
  The idea behind the scientific method is to design procedures that are robust to the scientist being biased or incompetent or even corrupt. Any approach that starts with “assume a perfect scientist” is not going to work in reality.
  - Sebastian_Hagen 7 Dec 2009 11:42 UTC
    2 points
    0
    Parent
    Science is a set of hacks to get usable modelling out of humans, accepting that
    
    There are things that humans do which are critical to modelling reality, and which you do not understand to the point of being able to reimplement them, but
    you also can’t just leave humans to do free-form theorizing, because that has been conclusively shown to lead to all kinds of problems.
    
    The critical black box in this specific case is about how to judge a theory’s simplicity, and what the best way to build a prior from that is. As long as either of these things is a black box to you, you won’t be able to do much better than using high-level heuristical hacks of the sort science is made out of. But that’s going to bite you every time you don’t have the luxury of being able to apply these hacks—say because you’re modelling (some aspect of) human history, and can’t rerun the experiment. Also, you won’t be able to build an AGI.
    
    In addition, if you’re really worried about corruption, the holding-back-data-on-purpose thing is setting up great profits to be made this way:
    
    Corrupt scientist takes out a loan for BIGNUM $.
    Corrupt scientist pays this money to someone with access to the still-secret data.
    Bribed data keeper gives corrupt scientist a copy of the data.
    Corrupt scientist fits their hypothesis to the whole data set.
    Corrupt scientist publishes hypothesis.
    Full data set is released officially.
    Hypothesis of corrupt scientist is verified to match whole data set. Corrupt scientist gains great prestige, and uses that to obtain sufficient money to pay off the loan from 1, and then some.
    
    You could try to set up the data keeper organization so that a premature limited data release is unlikely even in the face of potentially large bribes, but that seems like a fairly tough problem (and are they even thinking about it seriously?). Data is very easy to copy, preventing it from being copied is hard. And in this case, more so than in most cases where you’re worried about leaks, figuring out that a leak has in fact happened might be extremely difficult—at least if you really are ignorant about what hypothesis simplicity looks like.
    - Stuart_Armstrong 8 Dec 2009 13:15 UTC
      1 point
      Parent
      
      But that’s going to bite you every time you don’t have the luxury of being able to apply these hacks—say because you’re modelling (some aspect of) human history, and can’t rerun the experiment.
      
      ? History sounds like exactly the situation where “hold back half the data, hypothsise on the other half, then look at the whole” is the only way of reasonably going about this.
      
      Also, you won’t be able to build an AGI.
      
      Don’t follow that argument at all—in the worst case scenario, you can brute force it by scanning and moddelling a human brain. But even if true, it’s not really an issue for social scientists and their ilk. And there the “look at half the data” would cause definite improvements in their proceedures. It would make science work for the “flawed but honest” crowd.
      
      As for deliberately holding back half the data from other scientists (as opposed to one guy simply choosing to only look at half), that’s a different issue. I’ve got no really strong feelings on that. It could go either way.
      - Sebastian_Hagen 12 Dec 2009 0:43 UTC
        0 points
        Parent
        It’s an ok hack for someone in the “flawed but honest” crowd, individually. But note that it really doesn’t scale to allowing you to deal with corruption (which was one of the problems I assumed in the post you replied to).
        
        Extended to an entire field, this means that you may end up with N papers, all about the same data set, all proposing a different hypothesis that produces a good match on the set, and all of them claiming that their hypothesis was formulated using this procedure. IOW, you end up with unverifiable “trust us, we didn’t cheat” claims for each of those hypotheses. Which is not a good basis for arriving at a consensus in the field.
        
        Re AI design, assuming you actually understand what you implemented (as opposed to just blindly copying algorithms from the human brain without understanding what they do), the reason this method would work is that you’ve successfully extracted the human built-in simplicity prior (and I don’t know how good that one is exactly, but it has to be a halfway workable approximation; otherwise humans couldn’t model reality at all).
- Tyrrell_McAllister 5 Dec 2009 17:05 UTC
  0 points
  Parent
  
  As the comments to that post say, if you can actually look at the hypotheses in question, and you’re completely confident in your own judgement of simplicity, that judgement completely screens off how much data was used in formulating them.
  
  I agree that it wouldn’t matter how much data we gave the scientists if they had fixed a method for turning data into a theory beforehand.
  
  And I agree that such a method should settle on the simplest theory among all candidates. It should implement Occam’s razor.
  
  But we shouldn’t expect the scientists to fix such a method before seeing the data. Occam’s razor is not enough. You first have to have a computationally feasible way to generate good candidate theories from which you choose the simplest one. And we have every reason to expect that cosmologists will eventually come up with better methods for turning cosmological data into good candidate theories. Therefore, it doesn’t make sense to force the cosmologists to bind themselves to a method now. They need the freedom to discover better methods than any that they’ve yet found.
  
  The requirement of “computational feasibility” means that we can expect to have several candidate methods with no a priori way to judge confidently that one is better than the other. We will need recourse to empirical observations to compare the methods.
  
  In this comment of mine to the post linked above, I showed that if a method produces a theory that predicts the whole data set from a subset, then that method is probably superior to a method that uses the whole data set. The proof goes through even if we assume that each method has a step where it applies Occam’s razor:
  
  Define a method to be a map that takes in a batch of evidence and returns a theory. We have two assumptions
  
  ASSUMPTION 1: The theory produced by giving an input batch to a method will at least predict that input. That is, no matter how flawed a method of theory-construction is, it won’t contradict the evidence fed into it. More precisely,
  
  p( M(B) predicts B ) = 1.
  
  [...]
  
  ASSUMPTION 2: If a method M is known to be flawed, then its theories are less likely to make correct predictions of future observations. More precisely, if B2 is not contained in B1, then
  
  p( M(B1) predicts B2 | M flawed ) < P( M(B1) predicts B2 ).
  
  Now, let B1 and B2 be two disjoint and nonempty sets of input data. [E.g., two subsets of the cosmological data whose union is the whole data set.]
  
  [...] Let
  
  P1 := p( M is flawed | M(B1) predicts B2 ),
  
  P2 := p( M is flawed | M(B1 union B2) predicts B2 ).
  
  Then P1 < P2[.]
  
  (See the comment for a proof.)
  - Sebastian_Hagen 5 Dec 2009 17:27 UTC
    0 points
    Parent
    And I agree that such a method should settle on the simplest theory among all candidate theories. It should implement Occam’s razor.
    
    It’s not quite that simple in practice. There’s a tradeoff here, between accuracy in retrospect and theory simplicity. The two extreme pathological cases are:
    
    You demand absolute accuracy in retrospect, i.e. P(observed data | hypothesis) = 1. This is the limit case of overfitting, and yields a GLUT, which makes no or completely useless predictions about the future.
    
    You demand maximum simplicity. This is the limit case of underfitting, and yields a maximum-entropy distribution.
    
    You want something inbetween those cases. I don’t know where exactly, but you would have to figure out some way to determine that point if you were, say, building an AGI.
    
    I can’t really follow your earlier post. Specifically, I can’t parse your use of ” predicts ”, which you seem to use as a boolean value. But theories don’t “predict” or “not predict” outcomes in any absolute sense, they just assign probabilities to outcomes. Please explain your use of the phrase.
    - Tyrrell_McAllister 5 Dec 2009 17:55 UTC
      0 points
      Parent
      
      I can’t really follow your earlier post. Specifically, I can’t parse your use of ” predicts ”, which you seem to use as a boolean value. But theories don’t “predict” or “not predict” outcomes in any absolute sense, they just assign probabilities to outcomes. Please explain your use of the phrase.
      
      Sorry, the earlier post was in the context of a toy problem in which predictions were boolean. I should have mentioned that. (I had made this assumption explicit in an earlier comment.)
      
      My argument shows that, in the limiting case of boolean predictions, we should trust successful theories constructed using a subset of the data over theories constructed using all the data, even if all the theories were constructed using Occam’s razor. This at least strongly suggests the same possibility in more realistic cases where the theories assign probability distributions.
      - Sebastian_Hagen 6 Dec 2009 15:30 UTC
        1 point
        Parent
        Ok, I think I get your earlier post now. I think you might be overcomplicating things here.
        
        Sure, if you’re not confident what the correct simplicity prior is, you can get real evidence about which theory is likely to be stronger by observing things like their ability to correctly predict the outcome of new experiments. And to the extent that this tells you something about the way the originating scientist generates theories, there should even be some shifting of probability mass regarding the power of other theories proiduced by the same scientist. But that’s quite a lot of indirection, and there’s significant unknown factors that will dilute these shifts.
        
        Attempting this is somewhat like trying to estimate the probability of a scientist being right about a famous problem in their field based on their prestige. There’s a signal, but it’s quite noisy.
        
        If you know what simplicity looks like (and of course that’s uncomputable, but you can always approximate) - and how much it’s worth in terms of probability mass—you can make a much better guess as to which hypothesis is stronger by just looking at the actual hypotheses.
        
        Looking at things like “how many experimental results did this hypothesis actually predict correctly” is only informative to the extent that your understanding of simplicity and its value is lacking. Note that the phrase lacking understanding of simplicity isn’t meant to be especially disparaging; good understanding of simplicity is hard. There’s a reason the scientific process includes an inelegant workaround instead.