You’ve already surmised why rot13 words are undesirable. Just to check, are you suggesting I use n-gram frequency to identify rot13 words, or replace TF-IDF with some sort of n-gram frequency metric instead?
You could use TF-IDF on n-grams. That’s what I was thinking. But when I said to combine combine the local n-gram frequencies and the global n+1-gram frequencies to get a prediction of local n+1-gram frequencies to compare against, you might say it’s too complicated to continue calling it TF-IDF.
If all you want to do is recognize rot13 words, then a dictionary and/or bigram frequencies sound pretty reasonable. But don’t just eliminate rot13 words from the top 11 list; also include some kind of score of how much people use rot13. For example, you could use turn every word to 0 or 1, depending on rot13, and use TF-IDF. But it would be better to score each word and aggregate the scores, rather than thresholding.
What I was suggesting was a complicated (and unspecified) approach that does not assume knowledge of rot13 ahead of time. The point is to identify strange letter frequencies and bigrams as signs of a different language and then not take as significant words that are rare just because they are part of the other language. I think this would work if someone wrote 50⁄50 rot13, but if the individual used just a little rot13 that happened to repeat the same word a lot, it probably wouldn’t work. (cf. “phyg”)
There are two problems here, to distinguish individuals and to communicate to a human how the computer distinguishes. Even if you accept that my suggestion would be a good thing for the computer to do, there’s the second step of describing the human the claim that it has identified another language that the individual is using. The computer could report unusual letter frequencies or bigrams, but that wouldn’t mean much to the human. It could use the unusual frequencies to generate text, but that would be gibberish. It could find words in the corpus that score highly by the individual’s bigrams and low by the corpus bigrams.
You’ve already surmised why rot13 words are undesirable. Just to check, are you suggesting I use n-gram frequency to identify rot13 words, or replace TF-IDF with some sort of n-gram frequency metric instead?
You could use TF-IDF on n-grams. That’s what I was thinking. But when I said to combine combine the local n-gram frequencies and the global n+1-gram frequencies to get a prediction of local n+1-gram frequencies to compare against, you might say it’s too complicated to continue calling it TF-IDF.
If all you want to do is recognize rot13 words, then a dictionary and/or bigram frequencies sound pretty reasonable. But don’t just eliminate rot13 words from the top 11 list; also include some kind of score of how much people use rot13. For example, you could use turn every word to 0 or 1, depending on rot13, and use TF-IDF. But it would be better to score each word and aggregate the scores, rather than thresholding.
What I was suggesting was a complicated (and unspecified) approach that does not assume knowledge of rot13 ahead of time. The point is to identify strange letter frequencies and bigrams as signs of a different language and then not take as significant words that are rare just because they are part of the other language. I think this would work if someone wrote 50⁄50 rot13, but if the individual used just a little rot13 that happened to repeat the same word a lot, it probably wouldn’t work. (cf. “phyg”)
There are two problems here, to distinguish individuals and to communicate to a human how the computer distinguishes. Even if you accept that my suggestion would be a good thing for the computer to do, there’s the second step of describing the human the claim that it has identified another language that the individual is using. The computer could report unusual letter frequencies or bigrams, but that wouldn’t mean much to the human. It could use the unusual frequencies to generate text, but that would be gibberish. It could find words in the corpus that score highly by the individual’s bigrams and low by the corpus bigrams.