Your contribution to the corpus amount to ~47kB of plain text. For reference, Eliezer is ~190kB and gwern is ~515kB. The scores are unadjusted for document size and not amazingly meaningful outside of this specific context.
I’ve just fixed a bug in my scraper that was causing it to abandon 25% of the corpus. This has ended up tripling your contribution. Some new values for you:
This retains a similar “flavour” to the previous set, (AGI and ev-psych). The best way I’ve found to interpret it is “what sort of words describe what I use Less Wrong to talk about?”
As an interesting side-note, rot13 really messes with TF-IDF.
“The document” is my wild sample that I’ve gone out and caught. TF-IDF tells me what it’s broadly about. For this purpose, quoted text provides useful information.
If I want to infer personal facts about the author (beyond “what are the key terms in the posts they write”), it would make sense to weight original text higher than quoted text, but it would also make sense to use something other than TF-IDF for that purpose.
Curious to hear mine.
intelligence → 17.119
machine → 15.353
environments → 15.052
reference → 13.546
machines → 12.304
views → 12.253
legg → 12.252
friedman → 11.417
papers → 10.792
we → 10.536
exercises → 9.532
Your contribution to the corpus amount to ~47kB of plain text. For reference, Eliezer is ~190kB and gwern is ~515kB. The scores are unadjusted for document size and not amazingly meaningful outside of this specific context.
Huh, that seems different from what I’d have expected—but then again, I’m not sure of what I would have expected. Thanks.
I’ve just fixed a bug in my scraper that was causing it to abandon 25% of the corpus. This has ended up tripling your contribution. Some new values for you:
agi → 37.328
intelligence → 22.367
moral → 21.010
agis → 20.087
eea → 18.647
takeoff → 17.500
credences → 17.108
machine → 16.902
our → 16.222
environments → 15.919
deer → 15.761
This retains a similar “flavour” to the previous set, (AGI and ev-psych). The best way I’ve found to interpret it is “what sort of words describe what I use Less Wrong to talk about?”
As an interesting side-note, rot13 really messes with TF-IDF.
Okay, that feels like it makes more sense. I’m a little confused about the “deer”, though.
Blame this comment.
Hah, okay.
You’re not distinguishing original from quoted text, then?
It’s not obvious to me that I should. TF-IDF is about identifying key terms in a document. Quoted text counts towards that.
That depends on what “the document” is. Everything appearing in a posting by a given author, or all of the text written by a given author?
“The document” is my wild sample that I’ve gone out and caught. TF-IDF tells me what it’s broadly about. For this purpose, quoted text provides useful information.
If I want to infer personal facts about the author (beyond “what are the key terms in the posts they write”), it would make sense to weight original text higher than quoted text, but it would also make sense to use something other than TF-IDF for that purpose.