If someone uses rot13, that is a highly informative. Is there any principled reason to like quoted words showing up, but not liking rot13? Anyhow, I think the disappeal of rot13 for TF-IDF is that it seems like a lower level feature than words. In particular, it is wasteful for it to show up more than once, if you’re only doing top 11.
In some sense, I think the reason that the low level feature of rot13 is mixing with the high level feature of words is that you’ve jumped to the high level by fiat. Before looking a word frequency, you should look at letter frequency. With a sufficiently large corpus, rot13 should show up already there. I doubt that the corpus is big enough to detect the small usage by people here, but I think it might show up in bigrams or trigrams. I don’t have a concrete suggestion, but when you look at bigrams, you should use both corpus bigrams and document letter frequencies to decide which document bigrams are surprising.
You’ve already surmised why rot13 words are undesirable. Just to check, are you suggesting I use n-gram frequency to identify rot13 words, or replace TF-IDF with some sort of n-gram frequency metric instead?
You could use TF-IDF on n-grams. That’s what I was thinking. But when I said to combine combine the local n-gram frequencies and the global n+1-gram frequencies to get a prediction of local n+1-gram frequencies to compare against, you might say it’s too complicated to continue calling it TF-IDF.
If all you want to do is recognize rot13 words, then a dictionary and/or bigram frequencies sound pretty reasonable. But don’t just eliminate rot13 words from the top 11 list; also include some kind of score of how much people use rot13. For example, you could use turn every word to 0 or 1, depending on rot13, and use TF-IDF. But it would be better to score each word and aggregate the scores, rather than thresholding.
What I was suggesting was a complicated (and unspecified) approach that does not assume knowledge of rot13 ahead of time. The point is to identify strange letter frequencies and bigrams as signs of a different language and then not take as significant words that are rare just because they are part of the other language. I think this would work if someone wrote 50⁄50 rot13, but if the individual used just a little rot13 that happened to repeat the same word a lot, it probably wouldn’t work. (cf. “phyg”)
There are two problems here, to distinguish individuals and to communicate to a human how the computer distinguishes. Even if you accept that my suggestion would be a good thing for the computer to do, there’s the second step of describing the human the claim that it has identified another language that the individual is using. The computer could report unusual letter frequencies or bigrams, but that wouldn’t mean much to the human. It could use the unusual frequencies to generate text, but that would be gibberish. It could find words in the corpus that score highly by the individual’s bigrams and low by the corpus bigrams.
gvrq → 9.457
puppies → 8.784
cute → 7.141
creprag → 7.119
gb → 6.901
rewind → 6.305
fvatyr → 5.100
deck → 4.838
stuff → 4.816
vf → 4.739
boom → 4.221
As mentioned to other respondents, rot13 really messes with TF-IDF. I’m still not sure of the best way to deal with this.
If someone uses rot13, that is a highly informative. Is there any principled reason to like quoted words showing up, but not liking rot13? Anyhow, I think the disappeal of rot13 for TF-IDF is that it seems like a lower level feature than words. In particular, it is wasteful for it to show up more than once, if you’re only doing top 11.
In some sense, I think the reason that the low level feature of rot13 is mixing with the high level feature of words is that you’ve jumped to the high level by fiat. Before looking a word frequency, you should look at letter frequency. With a sufficiently large corpus, rot13 should show up already there. I doubt that the corpus is big enough to detect the small usage by people here, but I think it might show up in bigrams or trigrams. I don’t have a concrete suggestion, but when you look at bigrams, you should use both corpus bigrams and document letter frequencies to decide which document bigrams are surprising.
You’ve already surmised why rot13 words are undesirable. Just to check, are you suggesting I use n-gram frequency to identify rot13 words, or replace TF-IDF with some sort of n-gram frequency metric instead?
You could use TF-IDF on n-grams. That’s what I was thinking. But when I said to combine combine the local n-gram frequencies and the global n+1-gram frequencies to get a prediction of local n+1-gram frequencies to compare against, you might say it’s too complicated to continue calling it TF-IDF.
If all you want to do is recognize rot13 words, then a dictionary and/or bigram frequencies sound pretty reasonable. But don’t just eliminate rot13 words from the top 11 list; also include some kind of score of how much people use rot13. For example, you could use turn every word to 0 or 1, depending on rot13, and use TF-IDF. But it would be better to score each word and aggregate the scores, rather than thresholding.
What I was suggesting was a complicated (and unspecified) approach that does not assume knowledge of rot13 ahead of time. The point is to identify strange letter frequencies and bigrams as signs of a different language and then not take as significant words that are rare just because they are part of the other language. I think this would work if someone wrote 50⁄50 rot13, but if the individual used just a little rot13 that happened to repeat the same word a lot, it probably wouldn’t work. (cf. “phyg”)
There are two problems here, to distinguish individuals and to communicate to a human how the computer distinguishes. Even if you accept that my suggestion would be a good thing for the computer to do, there’s the second step of describing the human the claim that it has identified another language that the individual is using. The computer could report unusual letter frequencies or bigrams, but that wouldn’t mean much to the human. It could use the unusual frequencies to generate text, but that would be gibberish. It could find words in the corpus that score highly by the individual’s bigrams and low by the corpus bigrams.