A study that found that English has about 1.1 bits of information per letter, if you already know the message is in english. (XKCD “What If” linked to the original)
Isn’t that the information density for sentences? With all the conjunctions, and with the limitness of the number of different words that can appear in different places of the sentence, it’s not that surprising we only get 1.1 bits per letter. But names should be more information dense—maybe not the full 4.7 (because some names just don’t make sense) but at least 2 bits per letter, maybe even 3?
Any of the 5 rarest name are 1:7676.4534883720935 Bits for rarest name: 12.906224226276189 Rarest name needs to be 10 letters long Rarest names are between 4 and 7 letters long
#1 Most frequent name is Christin, which is 8 letters long Christin is worth 5.118397576228959 bits Christin would needs to be 4 letters long
#2 Most frequent name is Mary, which is 4 letters long Mary is worth 5.380839995073667 bits Mary would needs to be 5 letters long
#3 Most frequent name is Ashley, which is 6 letters long Ashley is worth 5.420441711983749 bits Ashley would needs to be 5 letters long
#4 Most frequent name is Jesse, which is 5 letters long Jesse is worth 5.4899422055346445 bits Jesse would needs to be 5 letters long
#5 Most frequent name is Alice, which is 5 letters long Alice is worth 5.590706018293878 bits Alice would needs to be 5 letters long
And when I run it on the list of male names from the 1990s I get this:
Any of the 11 rarest name are 1:14261.4 Bits for rarest name: 13.799827993443198 Rarest name needs to be 11 letters long Rarest names are between 4 and 8 letters long
#1 Most frequent name is John, which is 4 letters long John is worth 5.004526222833823 bits John would needs to be 4 letters long
#2 Most frequent name is Michael, which is 7 letters long Michael is worth 5.1584658860672485 bits Michael would needs to be 4 letters long
#3 Most frequent name is Joseph, which is 6 letters long Joseph is worth 5.4305677416620135 bits Joseph would needs to be 5 letters long
#4 Most frequent name is Christop, which is 8 letters long Christop is worth 5.549228103371756 bits Christop would needs to be 5 letters long
#5 Most frequent name is Matthew, which is 7 letters long Matthew is worth 5.563161441124633 bits Matthew would needs to be 5 letters long
So the information density is about 1.3 bits per letter. Higher than 1.1, but not nearly as high as I expected. But—the rarest names in these list are about 1:14k—not 1:1m like OP’s estimation. Then again—I’m only looking at given names—surnames tend to be more diverse. But that would also give them higher entropy, so instead of to figure out how to scale everything let’s just go with the given names, which I have numbers for (for simplicity, assume these lists I found are complete)
So—the rare names are about half as long as the number of letters required to represent them. The frequent names are anywhere between the number of letters required to represent them and twice that amount. I guess that is to be expected—names are not optimized to be an ideal representation, after all. But my point is that the amount of evidence needed here is not orders of magnitude bigger than the amount of information you gain from hearing the name.
Actually, due to what entropy is supposed to represent, on average the amount of information needed is exactly the amount of information contained in the name.
A study that found that English has about 1.1 bits of information per letter, if you already know the message is in english. (XKCD “What If” linked to the original)
Isn’t that the information density for sentences? With all the conjunctions, and with the limitness of the number of different words that can appear in different places of the sentence, it’s not that surprising we only get 1.1 bits per letter. But names should be more information dense—maybe not the full 4.7 (because some names just don’t make sense) but at least 2 bits per letter, maybe even 3?
I don’t know where to find (or how to handle) a big list of full names, so I’m settling for the (probably partial) lists of first names from https://www.galbithink.org/names/us200.htm (picked because the plaintext format is easy to process). I wrote a small script: https://gist.github.com/idanarye/fb75e5f813ddbff7d664204607c20321
When I run it on the list of female names from the 1990s I get this:
And when I run it on the list of male names from the 1990s I get this:
So the information density is about 1.3 bits per letter. Higher than 1.1, but not nearly as high as I expected. But—the rarest names in these list are about 1:14k—not 1:1m like OP’s estimation. Then again—I’m only looking at given names—surnames tend to be more diverse. But that would also give them higher entropy, so instead of to figure out how to scale everything let’s just go with the given names, which I have numbers for (for simplicity, assume these lists I found are complete)
So—the rare names are about half as long as the number of letters required to represent them. The frequent names are anywhere between the number of letters required to represent them and twice that amount. I guess that is to be expected—names are not optimized to be an ideal representation, after all. But my point is that the amount of evidence needed here is not orders of magnitude bigger than the amount of information you gain from hearing the name.
Actually, due to what entropy is supposed to represent, on average the amount of information needed is exactly the amount of information contained in the name.