Stuart_Armstrong comments on Checking Kurzweil’s track record

Stuart_Armstrong 31 Oct 2012 17:57 UTC
1 point
Thanks! I’ll think about the error checking process; certainly there is some possibility for factual errors, but most of the work is in interpreting qualifiers like “ubiquitous” and “most” and mapping that to what happened in the world.
- satt 3 Nov 2012 17:32 UTC
  3 points
  Parent
  One way to gauge how reliable people’s judgements are: have multiple people rate each Kurzweil prediction and see how well their ratings agree. So far LWers have committed to checking at least 200 predictions, so if everyone pulls through you’ll be able to get multiple ratings of at least 28 questions. Those multiple ratings could then be cross-checked for each question.
  
  (I won’t volunteer to rate any statements myself because (1) I’m lazy; (2) I already have a mildly negative view of Kurzweil’s predictive ability, which might make me biased; and (3) I read your earlier post and re-rated the 10 Age of Spiritual Machines predictions in that post myself, so I’ve already been primed in that respect.)
  - Unnamed 4 Nov 2012 2:59 UTC
    4 points
    Parent
    
    One way to gauge how reliable people’s judgements are: have multiple people rate each Kurzweil prediction and see how well their ratings agree.
    
    This is a good idea. It’s standard operating procedure (for measures which require a rater’s judgment) to have 2 raters for at least some of the items, and to report the agreement rate on those items (“inter-rater reliability”). Be sure to vary which raters are overlapping; for example, don’t give gwern and bsterrett the same 10 predictions (instead have maybe one prediction that they both rate, and one where bsterrett & Tenoke overlap, etc.) - that way the agreement rate tells you something about how much agreement there is between all of the raters (and not just between particular pairs of raters).
    
    In cases where the 2 raters disagree, you could just have a 3rd rater rate it and then go with their rating, or you could do something more complicated (like having the two raters discuss it and try to reach a consensus).