It’s not obvious that ‘uncommon’ tokens are good or that that’s a good approach.
They could also just be unlikely or garbage, and your screening method for filtering for ‘uncommon’ tokens may ensure that they are garbage, or otherwise sabotage your model. (This is the ‘mammogram screening problem’: even if you have a good filter, if you run it across trillions of tokens, you will wind up throwing out many good tokens and keeping many bad tokens. There are a number of LLM-related papers about the horrificly bad data you can wind up compiling if you neglect data cleaning, particularly in multilingual translation when you’re trying to scrape rare languages off the general Internet.)
Nor are good datapoints necessarily made up of uncommon tokens: there are zero uncommon tokens in my ‘microwave’ example.
It’s not obvious that ‘uncommon’ tokens are good or that that’s a good approach.
They could also just be unlikely or garbage, and your screening method for filtering for ‘uncommon’ tokens may ensure that they are garbage, or otherwise sabotage your model. (This is the ‘mammogram screening problem’: even if you have a good filter, if you run it across trillions of tokens, you will wind up throwing out many good tokens and keeping many bad tokens. There are a number of LLM-related papers about the horrificly bad data you can wind up compiling if you neglect data cleaning, particularly in multilingual translation when you’re trying to scrape rare languages off the general Internet.)
Nor are good datapoints necessarily made up of uncommon tokens: there are zero uncommon tokens in my ‘microwave’ example.
(Data pruning & active learning are hard.)