Kaj_Sotala comments on Open Thread, June 2-15, 2013

Kaj_Sotala 6 Jun 2013 9:04 UTC
0 points
Curious to hear mine.
- sixes_and_sevens 6 Jun 2013 9:54 UTC
  2 points
  Parent
  intelligence → 17.119
  machine → 15.353
  environments → 15.052
  reference → 13.546
  machines → 12.304
  views → 12.253
  legg → 12.252
  friedman → 11.417
  papers → 10.792
  we → 10.536
  exercises → 9.532
  
  Your contribution to the corpus amount to ~47kB of plain text. For reference, Eliezer is ~190kB and gwern is ~515kB. The scores are unadjusted for document size and not amazingly meaningful outside of this specific context.
  - Kaj_Sotala 6 Jun 2013 10:43 UTC
    0 points
    Parent
    Huh, that seems different from what I’d have expected—but then again, I’m not sure of what I would have expected. Thanks.
    - sixes_and_sevens 6 Jun 2013 10:59 UTC
      4 points
      Parent
      I’ve just fixed a bug in my scraper that was causing it to abandon 25% of the corpus. This has ended up tripling your contribution. Some new values for you:
      
      agi → 37.328
      intelligence → 22.367
      moral → 21.010
      agis → 20.087
      eea → 18.647
      takeoff → 17.500
      credences → 17.108
      machine → 16.902
      our → 16.222
      environments → 15.919
      deer → 15.761
      
      This retains a similar “flavour” to the previous set, (AGI and ev-psych). The best way I’ve found to interpret it is “what sort of words describe what I use Less Wrong to talk about?”
      
      As an interesting side-note, rot13 really messes with TF-IDF.
      - Kaj_Sotala 6 Jun 2013 12:24 UTC
        4 points
        Parent
        Okay, that feels like it makes more sense. I’m a little confused about the “deer”, though.
        sixes_and_sevens 6 Jun 2013 12:28 UTC
        6 points
        Parent
        Blame this comment.
        Kaj_Sotala 6 Jun 2013 14:35 UTC
        2 points
        Parent
        Hah, okay.
        Richard_Kennaway 6 Jun 2013 14:35 UTC
        0 points
        Parent
        You’re not distinguishing original from quoted text, then?
        sixes_and_sevens 6 Jun 2013 15:37 UTC
        0 points
        Parent
        It’s not obvious to me that I should. TF-IDF is about identifying key terms in a document. Quoted text counts towards that.
        Richard_Kennaway 6 Jun 2013 16:14 UTC
        0 points
        Parent
        
        TF-IDF is about identifying key terms in a document. Quoted text counts towards that.
        
        That depends on what “the document” is. Everything appearing in a posting by a given author, or all of the text written by a given author?
        sixes_and_sevens 6 Jun 2013 16:48 UTC
        0 points
        Parent
        “The document” is my wild sample that I’ve gone out and caught. TF-IDF tells me what it’s broadly about. For this purpose, quoted text provides useful information.
        
        If I want to infer personal facts about the author (beyond “what are the key terms in the posts they write”), it would make sense to weight original text higher than quoted text, but it would also make sense to use something other than TF-IDF for that purpose.