(Random thought I had and figured this was the right place to set it down:) Given how centally important token-based word embeddings as to the current LLM paradigm, how plausible is it that (put loosely) “doing it all in Chinese” (instead of English) is actually just plain a more powerful/less error-prone/generally better background assumption?
Associated helpful intuition pump: LLM word tokenization is like a logographic writing system, where each word corresponds to a character of the logography. There need be no particular correspondence between the form of the token and the pronunciation/”alphabetical spelling”/other things about the word, though it might have some connection to the meaning of the word—and it often makes just as little sense to be worried about the number of grass radicals in “草莓” as it does to worry about the number of r’s in a “strawberry” token.
(And yes, I am aware that in Mandarin Chinese, there’s lots of multi-character words and expressions!)
(Random thought I had and figured this was the right place to set it down:) Given how centally important token-based word embeddings as to the current LLM paradigm, how plausible is it that (put loosely) “doing it all in Chinese” (instead of English) is actually just plain a more powerful/less error-prone/generally better background assumption?
Associated helpful intuition pump: LLM word tokenization is like a logographic writing system, where each word corresponds to a character of the logography. There need be no particular correspondence between the form of the token and the pronunciation/”alphabetical spelling”/other things about the word, though it might have some connection to the meaning of the word—and it often makes just as little sense to be worried about the number of grass radicals in “草莓” as it does to worry about the number of r’s in a “strawberry” token.
(And yes, I am aware that in Mandarin Chinese, there’s lots of multi-character words and expressions!)