I think that instead of considering random words as a baseline reference (Fig. 2), you should take the alphabet plus the space symbol, generate a random i.i.d. sequence of them, and then index words in that text. This won’t give a uniform distribution over words. It is total gibberish, but I expect it would follow Zipf’s law the same, based on these references I found on Wikipedia:
I think that instead of considering random words as a baseline reference (Fig. 2), you should take the alphabet plus the space symbol, generate a random i.i.d. sequence of them, and then index words in that text. This won’t give a uniform distribution over words. It is total gibberish, but I expect it would follow Zipf’s law the same, based on these references I found on Wikipedia:
Wentian Li (1992), “Random Texts Exhibit Zipfs-Law-Like Word Frequency Distribution”
V. Belevitch (1959), “On the statistical laws of linguistic distributions”
I’d also show an example of the “ChatGPT gibberish” produced.