To properly do this, you have to do named entity recognition and normalization. I just collected the most frequent capitalized words, threw away the ones recognized by my morphological analyzer, and did a small amount of manual postprocessing. Note that Bacon, Wells and Hawking are recognized by my morphological analyzer.
This reflects particularly well on Yvain, Robin and Michael, all of whom managed to be both prolific and reliable in providing value with their quotes. I’m trying to think of a suitable metric by which I can formalise my intuitive evaluation.
I consider quotes with 0 votes to be a net negative contribution and it also raises the chance that other quotes by the poster are faux-wisdom. That is, that they appear deep at first glance for a casual reader but wouldn’t stand up to scrutiny by someone who is paying close attention to actual meaning. That is, I would rate the comments that are posted via an ‘accuracy by volume’ approach as even worse than the average suggests because it signals a greater degree of superficiality bias.
Above considerations aside volume does provide some degree of increased value. In considering the question “Which contributor’s page should I read in order to absorb the greatest improvement in quotey wisdom?” i may be better off with “16 in 5“ than “22 in 2”. On the other hand reading a “5 in 50” page may make me net sillier as I unconsciously absorb nonsense. Perhaps the ranking I’m looking for could be something as trivial as “Sum—Count * 4”.
I think a good metric is this: Assuming we independently draw from the observed distribution of achieved karma scores, what is the probability that someone gets at least as much karma as Yvain when she posts as many quotes as Yvain? You can calculate this by iterated convolution. The assumption of total independence heavily favors Yvain, but I am fine with that.
I loaded the actual observed distribution, and calculated this score:
I am afraid I don’t understand either of your questions. I work with the karma distribution only in the quotes domain. It doesn’t have to be determined, I collected all the data myself. The list is sorted by p-value.
We have the total list of quotes, with scores and posters. We know that Kutta scored 90 points from 7 quotes. Our null hypothesis is that he randomly selected 7 quotes from the total set of 1138 quotes. The p-value is the probability that he could achieve at least 90 points by this process. If his actual method yields better scores then random drawing, then the p-value will be low.
I have very low opinion of classical frequentist statistics, but it seemed to be very suitable for this task. I am sure that there is already a name for this method I reinvented. Of course, the null hypothesis is ridiculous, so we shouldn’t assign much meaning to these numbers. It is just one of the many ways we can solve this ranking task.
Okay, that makes sense—the number is the probability that they could have picked up as many points as they did by picking randomly from the set of all quotes. I understand now.
If I were to venture a suggestion: statistical significance may be relevant to your valuation of high-average high-number posters like Yvain, MichaelGR, and myself over higher-average low-number posters like michaelkeenan. If poorly-selected quotes nevertheless have a small but significant probability of being highly ranked (but a simultaneous large probability of being low-ranked) and most quoters select poorly, someone with only one high-rated quote is not much likelier to be a good selector of quotes than not. In contrast, someone with many quotes, most of which are highly regarded, could be expected to be unusually discerning, as the probability of this result by chance is low.
I garnered much more karma than I thought I did from the quotes; must be all the low-ranked ones since I don’t have all that many highly ranked quotes.
Top quote contributors by total karma score collected:
710 RichardKennaway
674 Rain
477 Eliezer_Yudkowsky
443 anonym
355 MichaelGR
279 RobinZ
264 Yvain
191 CronoDAS
129 gwern
128 Kaj_Sotala
108 ABranco
107 Rune
99 Morendil
99 Cyan
94 Unnamed
91 billswift
90 Kutta
88 wuwei
87 roland
84 NancyLebovitz
78 Jayson_Virissimo
77 Nic_Smith
73 sketerpot
71 ata
68 Tesseract
67 James_Miller
66 Matt_Duing
65 DSimon
64 Thomas
62 Lightwave
62 djcb
58 Kazuo_Thow
57 XiXiDu
57 Vladimir_Nesov
55 Kyre
54 michaelkeenan
54 komponisto
54 Apprentice
52 cousin_it
51 gjm
While you have the software open… :-)
Top average score? (total / number of quotes)
Top people quoted?
To properly do this, you have to do named entity recognition and normalization. I just collected the most frequent capitalized words, threw away the ones recognized by my morphological analyzer, and did a small amount of manual postprocessing. Note that Bacon, Wells and Hawking are recognized by my morphological analyzer.
16 Russell
12 Nietzsche
12 Feynman
11 Pratchett
10 Einstein
9 Chesterton
9 Asimov
8 Taleb
8 Scott
8 Johnson
8 Heinlein
8 Dennett
7 Wilson
7 Voltaire
7 Dawkins
6 Thoreau
6 Rochefoucauld
6 Neumann
6 Marx
6 Gould
6 Dijkstra
6 Binmore
5 Jaynes
5 Huxley
5 Galileo
5 Egan
5 Descartes
5 Darwin
5 Buffett
5 Ayn
5 Aristotle
4 Yudkowsky
4 Wittgenstein
4 Wilde
4 Thompson
4 Suzumiya
4 Simpson
4 Schopenhauer
4 Sagan
4 Rommel
4 Rollins
Cool :)
Top people quoted by total karma?
You definitely used up all your wishes. :) The above list reordered by total karma collected:
158 Russell
109 Pratchett
106 Asimov
101 Dennett
100 Chesterton
82 Buffett
81 Egan
79 Nietzsche
77 Feynman
72 Voltaire
66 Scott
66 Neumann
66 Descartes
61 Heinlein
59 Dijkstra
58 Marx
57 Aristotle
52 Darwin
49 Galileo
48 Einstein
46 Taleb
46 Binmore
45 Johnson
43 Jaynes
42 Rollins
39 Sagan
34 Wilde
34 Dawkins
28 Gould
25 Wilson
25 Rochefoucauld
23 Huxley
22 Ayn
15 Simpson
13 Wittgenstein
13 Schopenhauer
12 Yudkowsky
12 Thoreau
11 Thompson
8 Suzumiya
2 Rommel
OK. Who quoted Yudkowsky? Hopefully it was quotes from elsewhere. :)
Hacker News, for one—I don’t know where the other eight points may be from.
Edit: Six more points from Methods of Rationality
Yeah, I wasn’t precise enough on that second wish. Oh well, World Peace will have to wait.
The source code is open, too. :) Anyway:
Top average score:
54 in 1: michaelkeenan
23 in 1: Vlad
22.6667 in 3: Tesseract
22 in 1: DaveInNYC
20 in 1: CSmith
19.5 in 2: knb
19 in 1: Marcello
18.8 in 5: Unnamed
18.3333 in 3: Kyre
18.25 in 4: sketerpot
18 in 1: cata
17 in 1: MarcTheEngineer
16 in 3: Hariant
16 in 1: Tyrrell_McAllister
16 in 1: CaptainOblivious2
15.5294 in 17: Yvain
15.5 in 4: Lightwave
15 in 1: teageegeepea
15 in 1: Patrick
15 in 1: Nisan
15 in 1: loqi
15 in 1: Automaton
14 in 3: MichaelHoward
14 in 3: jaimeastorga2000
14 in 1: torekp
14 in 1: sparrowsfall
14 in 1: Sniffnoy
14 in 1: Shalmanese
14 in 1: Kobayashi
14 in 1: bogus
13.5 in 4: komponisto
13.5 in 4: Apprentice
13.5 in 2: JamesAndrix
13.2857 in 21: RobinZ
13.1481 in 27: MichaelGR
13 in 2: BenAlbahari
13 in 1: KatjaGrace
13 in 1: josht
12.8571 in 7: Kutta
12.5714 in 7: wuwei
This reflects particularly well on Yvain, Robin and Michael, all of whom managed to be both prolific and reliable in providing value with their quotes. I’m trying to think of a suitable metric by which I can formalise my intuitive evaluation.
I consider quotes with 0 votes to be a net negative contribution and it also raises the chance that other quotes by the poster are faux-wisdom. That is, that they appear deep at first glance for a casual reader but wouldn’t stand up to scrutiny by someone who is paying close attention to actual meaning. That is, I would rate the comments that are posted via an ‘accuracy by volume’ approach as even worse than the average suggests because it signals a greater degree of superficiality bias.
Above considerations aside volume does provide some degree of increased value. In considering the question “Which contributor’s page should I read in order to absorb the greatest improvement in quotey wisdom?” i may be better off with “16 in 5“ than “22 in 2”. On the other hand reading a “5 in 50” page may make me net sillier as I unconsciously absorb nonsense. Perhaps the ranking I’m looking for could be something as trivial as “Sum—Count * 4”.
I think a good metric is this: Assuming we independently draw from the observed distribution of achieved karma scores, what is the probability that someone gets at least as much karma as Yvain when she posts as many quotes as Yvain? You can calculate this by iterated convolution. The assumption of total independence heavily favors Yvain, but I am fine with that.
I loaded the actual observed distribution, and calculated this score:
0.00008 (12.48 in 54): Rain
0.00066 (15.53 in 17): Yvain
0.00128 (13.15 in 27): MichaelGR
0.00174 (54.00 in 1): michaelkeenan
0.00312 (13.29 in 21): RobinZ
0.00766 (22.67 in 3): Tesseract
0.00836 (18.80 in 5): Unnamed
0.01499 (18.25 in 4): sketerpot
0.02368 (10.15 in 47): Eliezer_Yudkowsky
0.02473 (18.33 in 3): Kyre
0.03460 (19.50 in 2): knb
0.03831 (15.50 in 4): Lightwave
0.04265 (23.00 in 1): Vlad
0.04817 (16.00 in 3): Hariant
0.05266 (12.86 in 7): Kutta
0.05396 (22.00 in 1): DaveInNYC
0.06051 (12.57 in 7): wuwei
0.06789 (20.00 in 1): CSmith
0.07663 (13.50 in 4): Apprentice
0.07663 (13.50 in 4): komponisto
0.08094 (19.00 in 1): Marcello
0.08622 (14.00 in 3): jaimeastorga2000
0.08622 (14.00 in 3): MichaelHoward
0.09554 (11.38 in 8): billswift
0.10009 (18.00 in 1): cata
0.11401 (17.00 in 1): MarcTheEngineer
0.12449 (8.77 in 81): RichardKennaway
0.12763 (12.00 in 4): SilasBarta
0.13055 (16.00 in 1): CaptainOblivious2
0.13055 (16.00 in 1): Tyrrell_McAllister
0.13092 (13.50 in 2): JamesAndrix
0.13828 (12.33 in 3): Randaly
0.14534 (15.00 in 1): Automaton
0.14534 (15.00 in 1): loqi
0.14534 (15.00 in 1): Nisan
0.14534 (15.00 in 1): Patrick
0.14534 (15.00 in 1): teageegeepea
0.14695 (13.00 in 2): BenAlbahari
0.15183 (10.83 in 6): DSimon
I don’t quite understand the methodology—how do you determine the karma distribution for each poster? And how is the list sorted?
I am afraid I don’t understand either of your questions. I work with the karma distribution only in the quotes domain. It doesn’t have to be determined, I collected all the data myself. The list is sorted by p-value.
We have the total list of quotes, with scores and posters. We know that Kutta scored 90 points from 7 quotes. Our null hypothesis is that he randomly selected 7 quotes from the total set of 1138 quotes. The p-value is the probability that he could achieve at least 90 points by this process. If his actual method yields better scores then random drawing, then the p-value will be low.
I have very low opinion of classical frequentist statistics, but it seemed to be very suitable for this task. I am sure that there is already a name for this method I reinvented. Of course, the null hypothesis is ridiculous, so we shouldn’t assign much meaning to these numbers. It is just one of the many ways we can solve this ranking task.
Okay, that makes sense—the number is the probability that they could have picked up as many points as they did by picking randomly from the set of all quotes. I understand now.
That’s brilliant. I like the theory and the ranking matches about what my intuitive manual ranking would have been too.
If I were to venture a suggestion: statistical significance may be relevant to your valuation of high-average high-number posters like Yvain, MichaelGR, and myself over higher-average low-number posters like michaelkeenan. If poorly-selected quotes nevertheless have a small but significant probability of being highly ranked (but a simultaneous large probability of being low-ranked) and most quoters select poorly, someone with only one high-rated quote is not much likelier to be a good selector of quotes than not. In contrast, someone with many quotes, most of which are highly regarded, could be expected to be unusually discerning, as the probability of this result by chance is low.
While you have the software open… :-)
Top average score? (total / number of quotes) Top people quoted?
I garnered much more karma than I thought I did from the quotes; must be all the low-ranked ones since I don’t have all that many highly ranked quotes.