sixes_and_sevens comments on Open Thread, June 2-15, 2013

sixes_and_sevens 4 Jun 2013 16:00 UTC
17 points
I scraped the last few hundred pages of comments on Main and Discussion, and made a simple application for pulling the highest TF-IDF-scoring words for any given user.

I’ll provide these values for the first ten respondents who want them. [Edit: that’s ten]

EDIT: some meta-information—the corpus comprises 23.8 MB, and spans the past 400 comment pages on Main and Discussion (around six months and two and a half months respectively). The most prolific contributor is gwern with ~780kB. Eliezer clocks in at ~280kB.
What links here?
- sixes_and_sevens's comment on Open thread, Nov. 17 - Nov. 23, 2014 by MrMind (21 Nov 2014 13:03 UTC; 7 points)
- jefftk 4 Jun 2013 18:00 UTC
  6 points
  Parent
  What about for the site overall?
  - sixes_and_sevens 4 Jun 2013 22:12 UTC
    6 points
    Parent
    This was my eventual plan, but I haven’t settled on a general corpus to compare it to yet.
- Kawoomba 6 Jun 2013 20:16 UTC
  4 points
  Parent
  Can you comment on your methodology—tools, wget scripts or what?
  - sixes_and_sevens 6 Jun 2013 23:12 UTC
    2 points
    Parent
    Scraping is done with python and lxml, and the scoring is done in Java. It came about as I needed to brush up on my Java for work, and was looking for an extensible project.
    
    I also didn’t push it to my personal repo, so all requests will have to wait until I’m back at work.
- Richard_Kennaway 6 Jun 2013 10:57 UTC
  2 points
  Parent
  
  I’ll provide these values for the first ten respondents who want them.
  
  Yes please. I have no idea what they will look like.
  - sixes_and_sevens 6 Jun 2013 11:26 UTC
    2 points
    Parent
    suffering → 25.000
    god → 24.508
    does → 24.383
    causal → 21.584
    np → 21.259
    utility → 20.470
    agi → 20.470
    who → 20.169
    pill → 19.353
    bayesian → 18.965
    u1 → 17.567
    
    The word ‘who’ seems to come up a lot for the contributors at the more prolific end of the scale. I don’t have a satisfactory answer why this should be the case. Your contribution comprises ~170kB of plain text.
- [deleted] 11 Jun 2013 17:52 UTC
  0 points
  Parent
  If I’m counting the replies correctly, nine respondents requested them so far. I’d like my word values. Thank you!
  - sixes_and_sevens 12 Jun 2013 10:09 UTC
    2 points
    Parent
    political → 28.733
    power → 27.093
    moldbug → 26.135
    structural → 24.192
    he → 24.082
    reactionary → 23.480
    blog → 21.973
    good → 21.373
    social → 20.470
    his → 20.470
    very → 20.169
    
    Your contribution is ~167kB.
- ArisKatsaris 10 Jun 2013 10:48 UTC
  0 points
  Parent
  May I have mine? Thanks.
  - sixes_and_sevens 10 Jun 2013 11:46 UTC
    0 points
    Parent
    moral → 35.017
    thread → 34.250
    bob → 25.163
    preferences → 24.383
    eu → 23.739
    column → 23.537
    matrix → 23.419
    mugging → 22.367
    pascals → 21.479
    lord → 19.515
    eg → 19.266
    
    Your contribution to the corpus is ~100kB.
- FiftyTwo 9 Jun 2013 20:41 UTC
  0 points
  Parent
  An alternative would be to ask people for donations to Against Malaria Foundation or your preferred charity.
- Dorikka 9 Jun 2013 19:10 UTC
  0 points
  Parent
  
  I’ll provide these values for the first ten respondents who want them.
  
  I’d like mine, please.
  - sixes_and_sevens 10 Jun 2013 9:08 UTC
    4 points
    Parent
    gvrq → 9.457
    puppies → 8.784
    cute → 7.141
    creprag → 7.119
    gb → 6.901
    rewind → 6.305
    fvatyr → 5.100
    deck → 4.838
    stuff → 4.816
    vf → 4.739
    boom → 4.221
    
    As mentioned to other respondents, rot13 really messes with TF-IDF. I’m still not sure of the best way to deal with this.
    - Douglas_Knight 11 Jun 2013 5:55 UTC
      0 points
      Parent
      If someone uses rot13, that is a highly informative. Is there any principled reason to like quoted words showing up, but not liking rot13? Anyhow, I think the disappeal of rot13 for TF-IDF is that it seems like a lower level feature than words. In particular, it is wasteful for it to show up more than once, if you’re only doing top 11.
      
      In some sense, I think the reason that the low level feature of rot13 is mixing with the high level feature of words is that you’ve jumped to the high level by fiat. Before looking a word frequency, you should look at letter frequency. With a sufficiently large corpus, rot13 should show up already there. I doubt that the corpus is big enough to detect the small usage by people here, but I think it might show up in bigrams or trigrams. I don’t have a concrete suggestion, but when you look at bigrams, you should use both corpus bigrams and document letter frequencies to decide which document bigrams are surprising.
      - sixes_and_sevens 11 Jun 2013 9:52 UTC
        0 points
        Parent
        You’ve already surmised why rot13 words are undesirable. Just to check, are you suggesting I use n-gram frequency to identify rot13 words, or replace TF-IDF with some sort of n-gram frequency metric instead?
        Douglas_Knight 12 Jun 2013 21:43 UTC
        0 points
        Parent
        You could use TF-IDF on n-grams. That’s what I was thinking. But when I said to combine combine the local n-gram frequencies and the global n+1-gram frequencies to get a prediction of local n+1-gram frequencies to compare against, you might say it’s too complicated to continue calling it TF-IDF.
        
        If all you want to do is recognize rot13 words, then a dictionary and/or bigram frequencies sound pretty reasonable. But don’t just eliminate rot13 words from the top 11 list; also include some kind of score of how much people use rot13. For example, you could use turn every word to 0 or 1, depending on rot13, and use TF-IDF. But it would be better to score each word and aggregate the scores, rather than thresholding.
        
        What I was suggesting was a complicated (and unspecified) approach that does not assume knowledge of rot13 ahead of time. The point is to identify strange letter frequencies and bigrams as signs of a different language and then not take as significant words that are rare just because they are part of the other language. I think this would work if someone wrote ⁵⁰⁄₅₀ rot13, but if the individual used just a little rot13 that happened to repeat the same word a lot, it probably wouldn’t work. (cf. “phyg”)
        
        There are two problems here, to distinguish individuals and to communicate to a human how the computer distinguishes. Even if you accept that my suggestion would be a good thing for the computer to do, there’s the second step of describing the human the claim that it has identified another language that the individual is using. The computer could report unusual letter frequencies or bigrams, but that wouldn’t mean much to the human. It could use the unusual frequencies to generate text, but that would be gibberish. It could find words in the corpus that score highly by the individual’s bigrams and low by the corpus bigrams.
- Douglas_Knight 7 Jun 2013 3:04 UTC
  0 points
  Parent
  mine, please.
  - sixes_and_sevens 7 Jun 2013 9:17 UTC
    0 points
    Parent
    sats → 22.952
    htt → 22.810
    sat → 22.157
    princeton → 21.356
    mathematicians → 17.903
    crack → 16.812
    harvard → 16.661
    delete → 16.563
    proofs → 15.745
    graph → 15.565
    regressions → 15.301
    
    Your corpus comprises ~77kB of plain text.
- Vaniver 6 Jun 2013 22:47 UTC
  0 points
  Parent
  I’d like mine, please!
  - sixes_and_sevens 7 Jun 2013 9:16 UTC
    4 points
    Parent
    because → 41.241
    p → 38.129
    should → 34.016
    sat → 33.974
    much → 33.113
    cholesterol → 33.056
    evidence → 32.444
    iq → 32.092
    comments → 31.454
    scores → 30.690
    clear → 28.899
    
    Your contribution comprises ~284kB of plain text, and is the thirteenth-largest in the corpus.
    - Vaniver 7 Jun 2013 17:04 UTC
      2 points
      Parent
      Thanks!
      
      Interestingly, the only one of those that I recognize as clearly one of my verbal quirks is “clear,” which I use a lot in “it’s not clear to me that …”, but it barely made it onto the list. I participate in most of the discussions on intelligence testing, so it’s no surprise that “sat,” “iq,” and “scores” are high. “Cholesterol” seems likely to be an artifact from a single detailed conversation about it, and then apparently I like words like “because,” “should,” and “much” more than normal, which is not that surprising given my general verbosity. I know I use the word “evidence” more than the general population, but am surprised I use it that much more than LW, and “comments” is unclear. Probably meta-discussion?
      - sixes_and_sevens 7 Jun 2013 17:23 UTC
        4 points
        Parent
        Most incidence of “comments” seems to be in the context of moderator actions. There are 44 occurrences in your contribution to the corpus, which is around 50,000 words.
        
        As for “evidence”, there are 70 occurrences in 50,000 words. So on average, every 715th word you say in comments is “evidence”.
- satt 6 Jun 2013 21:36 UTC
  0 points
  Parent
  Ooh, go on then.
  - sixes_and_sevens 7 Jun 2013 9:13 UTC
    2 points
    Parent
    phd → 34.505
    teleology → 25.661
    maitzens → 20.402
    neutron → 19.191
    fusion → 17.502
    causal → 17.267
    argument → 16.222
    turtle → 16.137
    greenhouse → 15.736
    p1 → 15.353
    might → 15.353
    
    Your contribution comprises ~116kB.
    - satt 7 Jun 2013 21:07 UTC
      0 points
      Parent
      Haha, I should’ve foreseen “maitzens”, “causal”, “argument” & “turtle” showing up there. (I’m lucky your corpus didn’t go back far enough to capture this never-ending back-and-forth, otherwise my top 10 would probably be nothing but “HIV”, “AIDS”, “cases”, “CDC”, “Duesberg”, “CD4″, and such.) Thanks for running the numbers.
- TheOtherDave 6 Jun 2013 18:53 UTC
  0 points
  Parent
  Sure, why not? Thanks!
  - sixes_and_sevens 7 Jun 2013 9:10 UTC
    0 points
    Parent
    x → 98.136
    confidence → 87.600
    value → 66.797
    agree → 65.843
    endorse → 63.750
    ok → 60.507
    said → 59.640
    evidence → 54.869
    say → 54.185
    bamboozled → 53.497
    values → 53.122
    
    Your contribution comprises ~420kB of plain text, and is the fifth largest in the corpus.
- arundelo 6 Jun 2013 16:53 UTC
  0 points
  Parent
  Cool! This (judging the relevance of words in documents in a corpus and analogous problems) is a subject I muse about sometimes. Thanks for introducing me to TF-IDF.
  
  I’d like my top scoring words please.
  - sixes_and_sevens 6 Jun 2013 16:56 UTC
    2 points
    Parent
    comte → 17.852
    m1 → 12.664
    grumble → 9.813
    altruism → 8.787
    rotating → 8.442
    olive → 8.150
    comtes → 8.025
    m → 7.383
    workshop → 7.157
    egoistic → 6.916
    happiness → 6.475
    
    Your contribution comprises ~21kB of plain text.
- Kaj_Sotala 6 Jun 2013 9:04 UTC
  0 points
  Parent
  Curious to hear mine.
  - sixes_and_sevens 6 Jun 2013 9:54 UTC
    2 points
    Parent
    intelligence → 17.119
    machine → 15.353
    environments → 15.052
    reference → 13.546
    machines → 12.304
    views → 12.253
    legg → 12.252
    friedman → 11.417
    papers → 10.792
    we → 10.536
    exercises → 9.532
    
    Your contribution to the corpus amount to ~47kB of plain text. For reference, Eliezer is ~190kB and gwern is ~515kB. The scores are unadjusted for document size and not amazingly meaningful outside of this specific context.
    - Kaj_Sotala 6 Jun 2013 10:43 UTC
      0 points
      Parent
      Huh, that seems different from what I’d have expected—but then again, I’m not sure of what I would have expected. Thanks.
      - sixes_and_sevens 6 Jun 2013 10:59 UTC
        4 points
        Parent
        I’ve just fixed a bug in my scraper that was causing it to abandon 25% of the corpus. This has ended up tripling your contribution. Some new values for you:
        
        agi → 37.328
        intelligence → 22.367
        moral → 21.010
        agis → 20.087
        eea → 18.647
        takeoff → 17.500
        credences → 17.108
        machine → 16.902
        our → 16.222
        environments → 15.919
        deer → 15.761
        
        This retains a similar “flavour” to the previous set, (AGI and ev-psych). The best way I’ve found to interpret it is “what sort of words describe what I use Less Wrong to talk about?”
        
        As an interesting side-note, rot13 really messes with TF-IDF.
        Kaj_Sotala 6 Jun 2013 12:24 UTC
        4 points
        Parent
        Okay, that feels like it makes more sense. I’m a little confused about the “deer”, though.
        sixes_and_sevens 6 Jun 2013 12:28 UTC
        6 points
        Parent
        Blame this comment.
        Kaj_Sotala 6 Jun 2013 14:35 UTC
        2 points
        Parent
        Hah, okay.
        Richard_Kennaway 6 Jun 2013 14:35 UTC
        0 points
        Parent
        You’re not distinguishing original from quoted text, then?
        sixes_and_sevens 6 Jun 2013 15:37 UTC
        0 points
        Parent
        It’s not obvious to me that I should. TF-IDF is about identifying key terms in a document. Quoted text counts towards that.
        Richard_Kennaway 6 Jun 2013 16:14 UTC
        0 points
        Parent
        
        TF-IDF is about identifying key terms in a document. Quoted text counts towards that.
        
        That depends on what “the document” is. Everything appearing in a posting by a given author, or all of the text written by a given author?
        sixes_and_sevens 6 Jun 2013 16:48 UTC
        0 points
        Parent
        “The document” is my wild sample that I’ve gone out and caught. TF-IDF tells me what it’s broadly about. For this purpose, quoted text provides useful information.
        
        If I want to infer personal facts about the author (beyond “what are the key terms in the posts they write”), it would make sense to weight original text higher than quoted text, but it would also make sense to use something other than TF-IDF for that purpose.