Lao Mein comments on Lao Mein’s Shortform

Lao Mein 1 Dec 2024 4:50 UTC
9 points
0
I’m at ~50-50 for large amounts of machine-translated being present in the dataset.
Having worked in Chinese academia myself, “use Google Translate on the dataset” just seems like something we’re extremely likely to do. It’s a hard-to-explain gut feeling. I’ll try poking around in the tokenizer to see if “uncommon Chinese phrases that would only appear in machine-translated COT” are present as tokens. (I think this is unlikely to be true even if they did do it, however)
I’ve done a cursory internet search, and it seems that there aren’t many native Chinese COT datasets, at least compared to English ones—and one of the first results on Google is a machine-translated English dataset.
I’m also vaguely remembering o1 chain of thought having better Chinese grammar in its COT, but I’m having trouble finding many examples. I think this is the easiest piece of evidence to check—if other (non-Chinese-origin) LLMs consistently use good Chinese grammar in their COT, that would shift my probabilities considerably.