The use of “所以” instead of “因此” and other tics may also indicate the use of machine-translated COT from English during training.
I don’t see why there would necessarily be machine-translated inner-monologues, though.
If they are doing the synthesis or stitching-together of various inner-monologues with the pivot phrases like “wait” or “no, that’s wrong”, they could simply do that with the ‘native’ Chinese versions of every problem rather than go through a risky lossy translation pass. Chinese-language-centric LLMs are not so bad at this point that you need to juice them with translations of English corpuses—are they?
Or if they are using precanned English inner-monologues from somewhere, why do they need to translate at all? You would think that it would be easy for multi-lingual models like current LLMs, which so easily switch back and forth between languages, to train on o1-style English inner-monologues and then be able to ‘zero-shot’ generate o1-style Chinese inner-monologues on demand. Maybe the weirdness is due to that, instead: it’s being very conservative and is imitating the o1-style English in Chinese as literally as possible.
I’m at ~50-50 for large amounts of machine-translated being present in the dataset.
Having worked in Chinese academia myself, “use Google Translate on the dataset” just seems like something we’re extremely likely to do. It’s a hard-to-explain gut feeling. I’ll try poking around in the tokenizer to see if “uncommon Chinese phrases that would only appear in machine-translated COT” are present as tokens. (I think this is unlikely to be true even if they did do it, however)
I’ve done a cursory internet search, and it seems that there aren’t many native Chinese COT datasets, at least compared to English ones—and one of the first results on Google is a machine-translated English dataset.
I’m also vaguely remembering o1 chain of thought having better Chinese grammar in its COT, but I’m having trouble finding many examples. I think this is the easiest piece of evidence to check—if other (non-Chinese-origin) LLMs consistently use good Chinese grammar in their COT, that would shift my probabilities considerably.
I don’t see why there would necessarily be machine-translated inner-monologues, though.
If they are doing the synthesis or stitching-together of various inner-monologues with the pivot phrases like “wait” or “no, that’s wrong”, they could simply do that with the ‘native’ Chinese versions of every problem rather than go through a risky lossy translation pass. Chinese-language-centric LLMs are not so bad at this point that you need to juice them with translations of English corpuses—are they?
Or if they are using precanned English inner-monologues from somewhere, why do they need to translate at all? You would think that it would be easy for multi-lingual models like current LLMs, which so easily switch back and forth between languages, to train on o1-style English inner-monologues and then be able to ‘zero-shot’ generate o1-style Chinese inner-monologues on demand. Maybe the weirdness is due to that, instead: it’s being very conservative and is imitating the o1-style English in Chinese as literally as possible.
I’m at ~50-50 for large amounts of machine-translated being present in the dataset.
Having worked in Chinese academia myself, “use Google Translate on the dataset” just seems like something we’re extremely likely to do. It’s a hard-to-explain gut feeling. I’ll try poking around in the tokenizer to see if “uncommon Chinese phrases that would only appear in machine-translated COT” are present as tokens. (I think this is unlikely to be true even if they did do it, however)
I’ve done a cursory internet search, and it seems that there aren’t many native Chinese COT datasets, at least compared to English ones—and one of the first results on Google is a machine-translated English dataset.
I’m also vaguely remembering o1 chain of thought having better Chinese grammar in its COT, but I’m having trouble finding many examples. I think this is the easiest piece of evidence to check—if other (non-Chinese-origin) LLMs consistently use good Chinese grammar in their COT, that would shift my probabilities considerably.