Unfortunately, it is hardly possible to answer this question empirically using data from human languages. Large text dumps of, say, English and Chinese contain a lot of “Rosetta Stone” content. Bilingual documents, common expressions, translations into related third languages like Japanese, literal English-Chinese dictionaries etc. Since LLMs require a substantial amount of training text, it is not feasible to reliably filter out all this translation content.
I don’t think this is clear. I think you might be able to train an LLM a conlang created after the data cutoff for instance.
As far as human languages, I bet it works ok for big LLMs.
I don’t think this was a statement about whether it’s possible in principle, but about whether it’s actually feasible in practice. I’m not aware of any conlangs, before the cutoff date or not, that have a training corpus large enough for the LLM to be trained to the same extent that major natural languages are.
Esperanto is certainly the most widespread conlang, but (1) is very strongly related to European languages, (2) is well before the cutoff date for any LLM, (3) all training corpora of which I am aware contain a great many references to other languages and their cross-translations, and (4) the largest corpora are still less than 0.1% of those available for most common natural languages.
I don’t think this is clear. I think you might be able to train an LLM a conlang created after the data cutoff for instance.
As far as human languages, I bet it works ok for big LLMs.
I don’t think this was a statement about whether it’s possible in principle, but about whether it’s actually feasible in practice. I’m not aware of any conlangs, before the cutoff date or not, that have a training corpus large enough for the LLM to be trained to the same extent that major natural languages are.
Esperanto is certainly the most widespread conlang, but (1) is very strongly related to European languages, (2) is well before the cutoff date for any LLM, (3) all training corpora of which I am aware contain a great many references to other languages and their cross-translations, and (4) the largest corpora are still less than 0.1% of those available for most common natural languages.