I scraped the last few hundred pages of comments on Main and Discussion, and made a simple application for pulling the highest TF-IDF-scoring words for any given user.
I’ll provide these values for the first ten respondents who want them. [Edit: that’s ten]
EDIT: some meta-information—the corpus comprises 23.8 MB, and spans the past 400 comment pages on Main and Discussion (around six months and two and a half months respectively). The most prolific contributor is gwern with ~780kB. Eliezer clocks in at ~280kB.
Scraping is done with python and lxml, and the scoring is done in Java. It came about as I needed to brush up on my Java for work, and was looking for an extensible project.
I also didn’t push it to my personal repo, so all requests will have to wait until I’m back at work.
The word ‘who’ seems to come up a lot for the contributors at the more prolific end of the scale. I don’t have a satisfactory answer why this should be the case. Your contribution comprises ~170kB of plain text.
political → 28.733 power → 27.093 moldbug → 26.135 structural → 24.192 he → 24.082 reactionary → 23.480 blog → 21.973 good → 21.373 social → 20.470 his → 20.470 very → 20.169
If someone uses rot13, that is a highly informative. Is there any principled reason to like quoted words showing up, but not liking rot13? Anyhow, I think the disappeal of rot13 for TF-IDF is that it seems like a lower level feature than words. In particular, it is wasteful for it to show up more than once, if you’re only doing top 11.
In some sense, I think the reason that the low level feature of rot13 is mixing with the high level feature of words is that you’ve jumped to the high level by fiat. Before looking a word frequency, you should look at letter frequency. With a sufficiently large corpus, rot13 should show up already there. I doubt that the corpus is big enough to detect the small usage by people here, but I think it might show up in bigrams or trigrams. I don’t have a concrete suggestion, but when you look at bigrams, you should use both corpus bigrams and document letter frequencies to decide which document bigrams are surprising.
You’ve already surmised why rot13 words are undesirable. Just to check, are you suggesting I use n-gram frequency to identify rot13 words, or replace TF-IDF with some sort of n-gram frequency metric instead?
You could use TF-IDF on n-grams. That’s what I was thinking. But when I said to combine combine the local n-gram frequencies and the global n+1-gram frequencies to get a prediction of local n+1-gram frequencies to compare against, you might say it’s too complicated to continue calling it TF-IDF.
If all you want to do is recognize rot13 words, then a dictionary and/or bigram frequencies sound pretty reasonable. But don’t just eliminate rot13 words from the top 11 list; also include some kind of score of how much people use rot13. For example, you could use turn every word to 0 or 1, depending on rot13, and use TF-IDF. But it would be better to score each word and aggregate the scores, rather than thresholding.
What I was suggesting was a complicated (and unspecified) approach that does not assume knowledge of rot13 ahead of time. The point is to identify strange letter frequencies and bigrams as signs of a different language and then not take as significant words that are rare just because they are part of the other language. I think this would work if someone wrote 50⁄50 rot13, but if the individual used just a little rot13 that happened to repeat the same word a lot, it probably wouldn’t work. (cf. “phyg”)
There are two problems here, to distinguish individuals and to communicate to a human how the computer distinguishes. Even if you accept that my suggestion would be a good thing for the computer to do, there’s the second step of describing the human the claim that it has identified another language that the individual is using. The computer could report unusual letter frequencies or bigrams, but that wouldn’t mean much to the human. It could use the unusual frequencies to generate text, but that would be gibberish. It could find words in the corpus that score highly by the individual’s bigrams and low by the corpus bigrams.
Interestingly, the only one of those that I recognize as clearly one of my verbal quirks is “clear,” which I use a lot in “it’s not clear to me that …”, but it barely made it onto the list. I participate in most of the discussions on intelligence testing, so it’s no surprise that “sat,” “iq,” and “scores” are high. “Cholesterol” seems likely to be an artifact from a single detailed conversation about it, and then apparently I like words like “because,” “should,” and “much” more than normal, which is not that surprising given my general verbosity. I know I use the word “evidence” more than the general population, but am surprised I use it that much more than LW, and “comments” is unclear. Probably meta-discussion?
Most incidence of “comments” seems to be in the context of moderator actions. There are 44 occurrences in your contribution to the corpus, which is around 50,000 words.
As for “evidence”, there are 70 occurrences in 50,000 words. So on average, every 715th word you say in comments is “evidence”.
Haha, I should’ve foreseen “maitzens”, “causal”, “argument” & “turtle” showing up there. (I’m lucky your corpus didn’t go back far enough to capture this never-ending back-and-forth, otherwise my top 10 would probably be nothing but “HIV”, “AIDS”, “cases”, “CDC”, “Duesberg”, “CD4″, and such.) Thanks for running the numbers.
Cool! This (judging the relevance of words in documents in a corpus and analogous problems) is a subject I muse about sometimes. Thanks for introducing me to TF-IDF.
Your contribution to the corpus amount to ~47kB of plain text. For reference, Eliezer is ~190kB and gwern is ~515kB. The scores are unadjusted for document size and not amazingly meaningful outside of this specific context.
I’ve just fixed a bug in my scraper that was causing it to abandon 25% of the corpus. This has ended up tripling your contribution. Some new values for you:
This retains a similar “flavour” to the previous set, (AGI and ev-psych). The best way I’ve found to interpret it is “what sort of words describe what I use Less Wrong to talk about?”
As an interesting side-note, rot13 really messes with TF-IDF.
“The document” is my wild sample that I’ve gone out and caught. TF-IDF tells me what it’s broadly about. For this purpose, quoted text provides useful information.
If I want to infer personal facts about the author (beyond “what are the key terms in the posts they write”), it would make sense to weight original text higher than quoted text, but it would also make sense to use something other than TF-IDF for that purpose.
I scraped the last few hundred pages of comments on Main and Discussion, and made a simple application for pulling the highest TF-IDF-scoring words for any given user.
I’ll provide these values for the first ten respondents who want them. [Edit: that’s ten]
EDIT: some meta-information—the corpus comprises 23.8 MB, and spans the past 400 comment pages on Main and Discussion (around six months and two and a half months respectively). The most prolific contributor is gwern with ~780kB. Eliezer clocks in at ~280kB.
What about for the site overall?
This was my eventual plan, but I haven’t settled on a general corpus to compare it to yet.
Can you comment on your methodology—tools, wget scripts or what?
Scraping is done with python and lxml, and the scoring is done in Java. It came about as I needed to brush up on my Java for work, and was looking for an extensible project.
I also didn’t push it to my personal repo, so all requests will have to wait until I’m back at work.
Yes please. I have no idea what they will look like.
suffering → 25.000
god → 24.508
does → 24.383
causal → 21.584
np → 21.259
utility → 20.470
agi → 20.470
who → 20.169
pill → 19.353
bayesian → 18.965
u1 → 17.567
The word ‘who’ seems to come up a lot for the contributors at the more prolific end of the scale. I don’t have a satisfactory answer why this should be the case. Your contribution comprises ~170kB of plain text.
If I’m counting the replies correctly, nine respondents requested them so far. I’d like my word values. Thank you!
political → 28.733
power → 27.093
moldbug → 26.135
structural → 24.192
he → 24.082
reactionary → 23.480
blog → 21.973
good → 21.373
social → 20.470
his → 20.470
very → 20.169
Your contribution is ~167kB.
May I have mine? Thanks.
moral → 35.017
thread → 34.250
bob → 25.163
preferences → 24.383
eu → 23.739
column → 23.537
matrix → 23.419
mugging → 22.367
pascals → 21.479
lord → 19.515
eg → 19.266
Your contribution to the corpus is ~100kB.
An alternative would be to ask people for donations to Against Malaria Foundation or your preferred charity.
I’d like mine, please.
gvrq → 9.457
puppies → 8.784
cute → 7.141
creprag → 7.119
gb → 6.901
rewind → 6.305
fvatyr → 5.100
deck → 4.838
stuff → 4.816
vf → 4.739
boom → 4.221
As mentioned to other respondents, rot13 really messes with TF-IDF. I’m still not sure of the best way to deal with this.
If someone uses rot13, that is a highly informative. Is there any principled reason to like quoted words showing up, but not liking rot13? Anyhow, I think the disappeal of rot13 for TF-IDF is that it seems like a lower level feature than words. In particular, it is wasteful for it to show up more than once, if you’re only doing top 11.
In some sense, I think the reason that the low level feature of rot13 is mixing with the high level feature of words is that you’ve jumped to the high level by fiat. Before looking a word frequency, you should look at letter frequency. With a sufficiently large corpus, rot13 should show up already there. I doubt that the corpus is big enough to detect the small usage by people here, but I think it might show up in bigrams or trigrams. I don’t have a concrete suggestion, but when you look at bigrams, you should use both corpus bigrams and document letter frequencies to decide which document bigrams are surprising.
You’ve already surmised why rot13 words are undesirable. Just to check, are you suggesting I use n-gram frequency to identify rot13 words, or replace TF-IDF with some sort of n-gram frequency metric instead?
You could use TF-IDF on n-grams. That’s what I was thinking. But when I said to combine combine the local n-gram frequencies and the global n+1-gram frequencies to get a prediction of local n+1-gram frequencies to compare against, you might say it’s too complicated to continue calling it TF-IDF.
If all you want to do is recognize rot13 words, then a dictionary and/or bigram frequencies sound pretty reasonable. But don’t just eliminate rot13 words from the top 11 list; also include some kind of score of how much people use rot13. For example, you could use turn every word to 0 or 1, depending on rot13, and use TF-IDF. But it would be better to score each word and aggregate the scores, rather than thresholding.
What I was suggesting was a complicated (and unspecified) approach that does not assume knowledge of rot13 ahead of time. The point is to identify strange letter frequencies and bigrams as signs of a different language and then not take as significant words that are rare just because they are part of the other language. I think this would work if someone wrote 50⁄50 rot13, but if the individual used just a little rot13 that happened to repeat the same word a lot, it probably wouldn’t work. (cf. “phyg”)
There are two problems here, to distinguish individuals and to communicate to a human how the computer distinguishes. Even if you accept that my suggestion would be a good thing for the computer to do, there’s the second step of describing the human the claim that it has identified another language that the individual is using. The computer could report unusual letter frequencies or bigrams, but that wouldn’t mean much to the human. It could use the unusual frequencies to generate text, but that would be gibberish. It could find words in the corpus that score highly by the individual’s bigrams and low by the corpus bigrams.
mine, please.
sats → 22.952
htt → 22.810
sat → 22.157
princeton → 21.356
mathematicians → 17.903
crack → 16.812
harvard → 16.661
delete → 16.563
proofs → 15.745
graph → 15.565
regressions → 15.301
Your corpus comprises ~77kB of plain text.
I’d like mine, please!
because → 41.241
p → 38.129
should → 34.016
sat → 33.974
much → 33.113
cholesterol → 33.056
evidence → 32.444
iq → 32.092
comments → 31.454
scores → 30.690
clear → 28.899
Your contribution comprises ~284kB of plain text, and is the thirteenth-largest in the corpus.
Thanks!
Interestingly, the only one of those that I recognize as clearly one of my verbal quirks is “clear,” which I use a lot in “it’s not clear to me that …”, but it barely made it onto the list. I participate in most of the discussions on intelligence testing, so it’s no surprise that “sat,” “iq,” and “scores” are high. “Cholesterol” seems likely to be an artifact from a single detailed conversation about it, and then apparently I like words like “because,” “should,” and “much” more than normal, which is not that surprising given my general verbosity. I know I use the word “evidence” more than the general population, but am surprised I use it that much more than LW, and “comments” is unclear. Probably meta-discussion?
Most incidence of “comments” seems to be in the context of moderator actions. There are 44 occurrences in your contribution to the corpus, which is around 50,000 words.
As for “evidence”, there are 70 occurrences in 50,000 words. So on average, every 715th word you say in comments is “evidence”.
Ooh, go on then.
phd → 34.505
teleology → 25.661
maitzens → 20.402
neutron → 19.191
fusion → 17.502
causal → 17.267
argument → 16.222
turtle → 16.137
greenhouse → 15.736
p1 → 15.353
might → 15.353
Your contribution comprises ~116kB.
Haha, I should’ve foreseen “maitzens”, “causal”, “argument” & “turtle” showing up there. (I’m lucky your corpus didn’t go back far enough to capture this never-ending back-and-forth, otherwise my top 10 would probably be nothing but “HIV”, “AIDS”, “cases”, “CDC”, “Duesberg”, “CD4″, and such.) Thanks for running the numbers.
Sure, why not? Thanks!
x → 98.136
confidence → 87.600
value → 66.797
agree → 65.843
endorse → 63.750
ok → 60.507
said → 59.640
evidence → 54.869
say → 54.185
bamboozled → 53.497
values → 53.122
Your contribution comprises ~420kB of plain text, and is the fifth largest in the corpus.
Cool! This (judging the relevance of words in documents in a corpus and analogous problems) is a subject I muse about sometimes. Thanks for introducing me to TF-IDF.
I’d like my top scoring words please.
comte → 17.852
m1 → 12.664
grumble → 9.813
altruism → 8.787
rotating → 8.442
olive → 8.150
comtes → 8.025
m → 7.383
workshop → 7.157
egoistic → 6.916
happiness → 6.475
Your contribution comprises ~21kB of plain text.
Curious to hear mine.
intelligence → 17.119
machine → 15.353
environments → 15.052
reference → 13.546
machines → 12.304
views → 12.253
legg → 12.252
friedman → 11.417
papers → 10.792
we → 10.536
exercises → 9.532
Your contribution to the corpus amount to ~47kB of plain text. For reference, Eliezer is ~190kB and gwern is ~515kB. The scores are unadjusted for document size and not amazingly meaningful outside of this specific context.
Huh, that seems different from what I’d have expected—but then again, I’m not sure of what I would have expected. Thanks.
I’ve just fixed a bug in my scraper that was causing it to abandon 25% of the corpus. This has ended up tripling your contribution. Some new values for you:
agi → 37.328
intelligence → 22.367
moral → 21.010
agis → 20.087
eea → 18.647
takeoff → 17.500
credences → 17.108
machine → 16.902
our → 16.222
environments → 15.919
deer → 15.761
This retains a similar “flavour” to the previous set, (AGI and ev-psych). The best way I’ve found to interpret it is “what sort of words describe what I use Less Wrong to talk about?”
As an interesting side-note, rot13 really messes with TF-IDF.
Okay, that feels like it makes more sense. I’m a little confused about the “deer”, though.
Blame this comment.
Hah, okay.
You’re not distinguishing original from quoted text, then?
It’s not obvious to me that I should. TF-IDF is about identifying key terms in a document. Quoted text counts towards that.
That depends on what “the document” is. Everything appearing in a posting by a given author, or all of the text written by a given author?
“The document” is my wild sample that I’ve gone out and caught. TF-IDF tells me what it’s broadly about. For this purpose, quoted text provides useful information.
If I want to infer personal facts about the author (beyond “what are the key terms in the posts they write”), it would make sense to weight original text higher than quoted text, but it would also make sense to use something other than TF-IDF for that purpose.