OpenAI’s GPT-4 is a Large Language Model (LLM) that can generate coherent constructed languages, or “conlangs,” which we propose be called “genlangs” when generated by Artificial Intelligence (AI). The genlangs created by ChatGPT for this research (Voxphera, Vivenzia, and Lumivoxa) each have unique features, appear facially coherent, and plausibly “translate” into English.
This study investigates whether genlangs created by ChatGPT follow Zipf’s law. Zipf’s law approximately holds across all natural and artificially constructed human languages. According to Zipf’s law, the word frequencies in a text corpus are inversely proportional to their rank in the frequency table. This means that the most frequent word appears about twice as often as the second most frequent word, three times as often as the third most frequent word, and so on. We hypothesize that Zipf’s law will hold for genlangs because (1) genlangs created by ChatGPT fundamentally operate in the same way as human language with respect to the semantic usefulness of certain tokens, and (2) ChatGPT has been trained on a corpora of text that includes many different languages, all of which exhibit Zipf’s law to varying degrees. Through statistical linguistics, we aim to understand if LLM-based languages statistically look human. Our findings indicate that genlangs adhere closely to Zipf’s law, supporting the hypothesis that genlangs created by ChatGPT exhibit similar statistical properties to natural and artificial human languages.
If the GPT ‘genlangs’ can be translated into English reliably, and we know English definitely follows Zipf’s law (and also all the evidence for ‘neuralinga’/‘interlingua’ from NMT), it seems like the genlangs would have to follow Zipf’s law too. If the genlangs are just thin skins by GPT over existing natural languages it knows, then why wouldn’t they follow Zipf’s law?
Seems like it’d be more relevant to show that for genlangs which don’t have any viable translation into an existing Zipfian language, they follow Zipf anyway.
I think that instead of considering random words as a baseline reference (Fig. 2), you should take the alphabet plus the space symbol, generate a random i.i.d. sequence of them, and then index words in that text. This won’t give a uniform distribution over words. It is total gibberish, but I expect it would follow Zipf’s law the same, based on these references I found on Wikipedia:
Wentian Li (1992), “Random Texts Exhibit Zipfs-Law-Like Word Frequency Distribution”
V. Belevitch (1959), “On the statistical laws of linguistic distributions”
I’d also show an example of the “ChatGPT gibberish” produced.