“The document” is my wild sample that I’ve gone out and caught. TF-IDF tells me what it’s broadly about. For this purpose, quoted text provides useful information.
If I want to infer personal facts about the author (beyond “what are the key terms in the posts they write”), it would make sense to weight original text higher than quoted text, but it would also make sense to use something other than TF-IDF for that purpose.
You’re not distinguishing original from quoted text, then?
It’s not obvious to me that I should. TF-IDF is about identifying key terms in a document. Quoted text counts towards that.
That depends on what “the document” is. Everything appearing in a posting by a given author, or all of the text written by a given author?
“The document” is my wild sample that I’ve gone out and caught. TF-IDF tells me what it’s broadly about. For this purpose, quoted text provides useful information.
If I want to infer personal facts about the author (beyond “what are the key terms in the posts they write”), it would make sense to weight original text higher than quoted text, but it would also make sense to use something other than TF-IDF for that purpose.