This is extremely weird—no one actually writes like this in Chinese. “等一下” is far more common than “等待一下”, which seems to mash the direct translation of the “wait” [等待] in “wait a moment” - 等待 is actually closer to “to wait”. The use of “所以” instead of “因此” and other tics may also indicate the use of machine-translated COT from English during training.
The funniest answer would be “COT as seen in English GPT4-o1 logs are correlated with generating quality COT. Chinese text is also correlated with highly rated COT. Therefore, using the grammar and structure of English GPT4 COT but with Chinese tokens elicits the best COT”.
The use of “所以” instead of “因此” and other tics may also indicate the use of machine-translated COT from English during training.
I don’t see why there would necessarily be machine-translated inner-monologues, though.
If they are doing the synthesis or stitching-together of various inner-monologues with the pivot phrases like “wait” or “no, that’s wrong”, they could simply do that with the ‘native’ Chinese versions of every problem rather than go through a risky lossy translation pass. Chinese-language-centric LLMs are not so bad at this point that you need to juice them with translations of English corpuses—are they?
Or if they are using precanned English inner-monologues from somewhere, why do they need to translate at all? You would think that it would be easy for multi-lingual models like current LLMs, which so easily switch back and forth between languages, to train on o1-style English inner-monologues and then be able to ‘zero-shot’ generate o1-style Chinese inner-monologues on demand. Maybe the weirdness is due to that, instead: it’s being very conservative and is imitating the o1-style English in Chinese as literally as possible.
I’m at ~50-50 for large amounts of machine-translated being present in the dataset.
Having worked in Chinese academia myself, “use Google Translate on the dataset” just seems like something we’re extremely likely to do. It’s a hard-to-explain gut feeling. I’ll try poking around in the tokenizer to see if “uncommon Chinese phrases that would only appear in machine-translated COT” are present as tokens. (I think this is unlikely to be true even if they did do it, however)
I’ve done a cursory internet search, and it seems that there aren’t many native Chinese COT datasets, at least compared to English ones—and one of the first results on Google is a machine-translated English dataset.
I’m also vaguely remembering o1 chain of thought having better Chinese grammar in its COT, but I’m having trouble finding many examples. I think this is the easiest piece of evidence to check—if other (non-Chinese-origin) LLMs consistently use good Chinese grammar in their COT, that would shift my probabilities considerably.
Julien Chaumond on X: “Qwen QwQ switching to Chinese when it needs to _really think_ about something, then switching back to English, is pretty cool @Alibaba_Qwen https://t.co/jpTIHWyXim” / X
This is extremely weird—no one actually writes like this in Chinese. “等一下” is far more common than “等待一下”, which seems to mash the direct translation of the “wait” [等待] in “wait a moment” - 等待 is actually closer to “to wait”. The use of “所以” instead of “因此” and other tics may also indicate the use of machine-translated COT from English during training.
The funniest answer would be “COT as seen in English GPT4-o1 logs are correlated with generating quality COT. Chinese text is also correlated with highly rated COT. Therefore, using the grammar and structure of English GPT4 COT but with Chinese tokens elicits the best COT”.
I don’t see why there would necessarily be machine-translated inner-monologues, though.
If they are doing the synthesis or stitching-together of various inner-monologues with the pivot phrases like “wait” or “no, that’s wrong”, they could simply do that with the ‘native’ Chinese versions of every problem rather than go through a risky lossy translation pass. Chinese-language-centric LLMs are not so bad at this point that you need to juice them with translations of English corpuses—are they?
Or if they are using precanned English inner-monologues from somewhere, why do they need to translate at all? You would think that it would be easy for multi-lingual models like current LLMs, which so easily switch back and forth between languages, to train on o1-style English inner-monologues and then be able to ‘zero-shot’ generate o1-style Chinese inner-monologues on demand. Maybe the weirdness is due to that, instead: it’s being very conservative and is imitating the o1-style English in Chinese as literally as possible.
I’m at ~50-50 for large amounts of machine-translated being present in the dataset.
Having worked in Chinese academia myself, “use Google Translate on the dataset” just seems like something we’re extremely likely to do. It’s a hard-to-explain gut feeling. I’ll try poking around in the tokenizer to see if “uncommon Chinese phrases that would only appear in machine-translated COT” are present as tokens. (I think this is unlikely to be true even if they did do it, however)
I’ve done a cursory internet search, and it seems that there aren’t many native Chinese COT datasets, at least compared to English ones—and one of the first results on Google is a machine-translated English dataset.
I’m also vaguely remembering o1 chain of thought having better Chinese grammar in its COT, but I’m having trouble finding many examples. I think this is the easiest piece of evidence to check—if other (non-Chinese-origin) LLMs consistently use good Chinese grammar in their COT, that would shift my probabilities considerably.