RobinZ comments on Why We Can’t Take Expected Value Estimates Literally (Even When They’re Unbiased)

RobinZ 27 Dec 2011 21:43 UTC
4 points
necroreply: Back up to the actual use of the data, which is identification of tasty beers—an “inherently meaningful” confidence level is one which provides the most useful recommendations to the end user. This is reflected in the way the post describes BeerAdvocate changing their system—they had their confidence level set so high that only extremely popular beers could move significantly away from the average, and they concluded that this was reducing the value of their ratings.
- handoflixue 12 Jan 2012 23:27 UTC
  2 points
  Parent
  Fair, but I think capturing that is possibly beyond the scope of their article. If you can come up with a good way to evaluate that beyond gut instinct and vague heuristics on how a specific data set “ought” to behave/look, I would love to hear it—it’s been an area I’ve had trouble with before :)
  - RobinZ 13 Jan 2012 2:55 UTC
    0 points
    Parent
    I can think of two possibilities right off the bat—there are probably others (customer satisfactions surveys?) that I’m not thinking of that would work:
    
    Measure the ability of the scoring rubric to correlate with trusted expert rankings.
    
    Measure the ability of the scoring rubric to predict future votes.
    
    (Of course, 2 has the problem that it is basically measuring the variable that Bayesians maximize...)
    - handoflixue 19 Jan 2012 0:20 UTC
      0 points
      Parent
      Item 1 would only seem useful when you have sufficient trusted expert ranking to calibrate, but still need to use the votes to extrapolate elsewhere (and where you expect trusted experts to align with your audience—if experts routinely downvote dark ales, and your audience prefers them, you’re going to get a wonky heuristic). Basically, at that point, you’re JUST using votes as a method to try predicting and extrapolating expert rankings, and I’d expect there’s usually better heuristics for that which don’t require user votes.
      
      Item 2 strikes me as clever and ideal, but I’d think you’d need quite a lot of data before you’d be able to actually calibrate that. So you’re stuck using 0.05 until you have quite a lot of data.
      
      (Customer satisfaction surveys, etc. also run in to the “resource intensive” issue)
      
      (edit: apparently pound makes the whole row a header or something)
      - RobinZ 19 Jan 2012 3:38 UTC
        0 points
        Parent
        
        Item 1 would only seem useful when you have sufficient trusted expert ranking to calibrate, but still need to use the votes to extrapolate elsewhere [...]
        
        Exactly. Remember, the whole point of this procedure is to tweak how much credibility you give to voters as a function of the number of voters you have—the only reason I mention experts is that they bypass the sample size problem.
        
        (and where you expect trusted experts to align with your audience—if experts routinely downvote dark ales, and your audience prefers them, you’re going to get a wonky heuristic)
        
        Okay, that’s a problem. I think it falls as a subset of the earlier problem of finding trusted expert rankings, however.
        
        Item 2 strikes me as clever and ideal, but I’d think you’d need quite a lot of data before you’d be able to actually calibrate that. So you’re stuck using 0.05 until you have quite a lot of data.
        
        If you don’t have a lot of data, you’re not going to have much to offer your users anyway.