Idan Arye comments on Strong Evidence is Common

Idan Arye 14 Mar 2021 18:58 UTC
7 points
Isn’t that the information density for sentences? With all the conjunctions, and with the limitness of the number of different words that can appear in different places of the sentence, it’s not that surprising we only get 1.1 bits per letter. But names should be more information dense—maybe not the full 4.7 (because some names just don’t make sense) but at least 2 bits per letter, maybe even 3?
I don’t know where to find (or how to handle) a big list of full names, so I’m settling for the (probably partial) lists of first names from https://www.galbithink.org/names/us200.htm (picked because the plaintext format is easy to process). I wrote a small script: https://gist.github.com/idanarye/fb75e5f813ddbff7d664204607c20321
When I run it on the list of female names from the 1990s I get this:
$ ./names_entropy.py https://www.galbithink.org/names/s1990f.txt
Entropy per letter: 1.299113499617074
Any of the 5 rarest name are 1:7676.4534883720935
Bits for rarest name: 12.906224226276189
Rarest name needs to be 10 letters long
Rarest names are between 4 and 7 letters long
#1 Most frequent name is Christin, which is 8 letters long
Christin is worth 5.118397576228959 bits
Christin would needs to be 4 letters long
#2 Most frequent name is Mary, which is 4 letters long
Mary is worth 5.380839995073667 bits
Mary would needs to be 5 letters long
#3 Most frequent name is Ashley, which is 6 letters long
Ashley is worth 5.420441711983749 bits
Ashley would needs to be 5 letters long
#4 Most frequent name is Jesse, which is 5 letters long
Jesse is worth 5.4899422055346445 bits
Jesse would needs to be 5 letters long
#5 Most frequent name is Alice, which is 5 letters long
Alice is worth 5.590706018293878 bits
Alice would needs to be 5 letters long
And when I run it on the list of male names from the 1990s I get this:
$ ./names_entropy.py https://www.galbithink.org/names/s1990m.txt
Entropy per letter: 1.3429318549784128
Any of the 11 rarest name are 1:14261.4
Bits for rarest name: 13.799827993443198
Rarest name needs to be 11 letters long
Rarest names are between 4 and 8 letters long
#1 Most frequent name is John, which is 4 letters long
John is worth 5.004526222833823 bits
John would needs to be 4 letters long
#2 Most frequent name is Michael, which is 7 letters long
Michael is worth 5.1584658860672485 bits
Michael would needs to be 4 letters long
#3 Most frequent name is Joseph, which is 6 letters long
Joseph is worth 5.4305677416620135 bits
Joseph would needs to be 5 letters long
#4 Most frequent name is Christop, which is 8 letters long
Christop is worth 5.549228103371756 bits
Christop would needs to be 5 letters long
#5 Most frequent name is Matthew, which is 7 letters long
Matthew is worth 5.563161441124633 bits
Matthew would needs to be 5 letters long
So the information density is about 1.3 bits per letter. Higher than 1.1, but not nearly as high as I expected. But—the rarest names in these list are about 1:14k—not 1:1m like OP’s estimation. Then again—I’m only looking at given names—surnames tend to be more diverse. But that would also give them higher entropy, so instead of to figure out how to scale everything let’s just go with the given names, which I have numbers for (for simplicity, assume these lists I found are complete)
So—the rare names are about half as long as the number of letters required to represent them. The frequent names are anywhere between the number of letters required to represent them and twice that amount. I guess that is to be expected—names are not optimized to be an ideal representation, after all. But my point is that the amount of evidence needed here is not orders of magnitude bigger than the amount of information you gain from hearing the name.
Actually, due to what entropy is supposed to represent, on average the amount of information needed is exactly the amount of information contained in the name.