Excellent introduction. Your examples were all very intuitive.
For those who are reading, one way to get an intuition for the difference between binary strings and bits is to look at data compression. To begin with, it’s easy to create a code like ASCII, where every character is represented by a binary string of length 8 (usually referred to as 8 “bits” or one byte), allowing up to 28=256 unique characters. This type of code will allow you to represent a text document in English that’s 1024 characters in length with exactly 1 kB of information.
Except that’s not quite right. The 1 kB is only how much storage space it takes up in computer memory, not how much information the document actually contains.
In fact, each English character has closer to something like 1.1 bits per character in terms of actual information content, so an optimal compression algorithm could theoretically get that document down to around 141 bits. This is because not all characters occur with equal frequencies in all contexts. In fact, which character goes next in a sequence turns out to be quite predictable. Every time a predictive model reaches a point where its confusion about the next token/state is distributed 50-50, it needs to be given one additional bit of information to resolve this confusion. When things are more predictable, it will need fewer bits, and when things are less predictable, it will need more. As you said:
It turns out that for a given probability distribution p(x) over states, the encoding that minimizes average entropy uses strings that have one bit for every halving that it takes to get to p(x)
What you wrote as S(x)=log1p(x) is typically referred to as the “surprise” associated with state x, proportional to how many bits are necessary to resolve the confusion of a predictive model.
One example, getting back to data compression, is Huffman coding. Using just character frequency as the predictive model p(x), the surprise for each character will be equal to log1p(x). Huffman coding approximates this by giving shorter bit strings to more frequent characters and ensuring that no character’s bit string is a prefix of any other character’s bit string. As you described (without mentioning Huffman coding):
This is how you can use different-sized labels without having an additional “all done” symbol. If the bits known so far match a whole label, then they are not a prefix of any other label. Therefore they could not match any other label, and so you know the bits must refer to the label they already match so far. And using different-sized labels in your “prefix code” lets you reduce your expected entropy in cases where the states are not equally likely.
Using this method, you could compress the English document above down to something like 681 bits, assuming an average of 5.32 bits per character. This is not quite as much compression as is possible for English, since next-character predictions can be made more certain by looking at the context (kind of like what GPT is trained to do), but it’s easier to think about.
Excellent introduction. Your examples were all very intuitive.
For those who are reading, one way to get an intuition for the difference between binary strings and bits is to look at data compression. To begin with, it’s easy to create a code like ASCII, where every character is represented by a binary string of length 8 (usually referred to as 8 “bits” or one byte), allowing up to 28=256 unique characters. This type of code will allow you to represent a text document in English that’s 1024 characters in length with exactly 1 kB of information.
Except that’s not quite right. The 1 kB is only how much storage space it takes up in computer memory, not how much information the document actually contains.
In fact, each English character has closer to something like 1.1 bits per character in terms of actual information content, so an optimal compression algorithm could theoretically get that document down to around 141 bits. This is because not all characters occur with equal frequencies in all contexts. In fact, which character goes next in a sequence turns out to be quite predictable. Every time a predictive model reaches a point where its confusion about the next token/state is distributed 50-50, it needs to be given one additional bit of information to resolve this confusion. When things are more predictable, it will need fewer bits, and when things are less predictable, it will need more. As you said:
What you wrote as S(x)=log1p(x) is typically referred to as the “surprise” associated with state x, proportional to how many bits are necessary to resolve the confusion of a predictive model.
One example, getting back to data compression, is Huffman coding. Using just character frequency as the predictive model p(x), the surprise for each character will be equal to log1p(x). Huffman coding approximates this by giving shorter bit strings to more frequent characters and ensuring that no character’s bit string is a prefix of any other character’s bit string. As you described (without mentioning Huffman coding):
Using this method, you could compress the English document above down to something like 681 bits, assuming an average of 5.32 bits per character. This is not quite as much compression as is possible for English, since next-character predictions can be made more certain by looking at the context (kind of like what GPT is trained to do), but it’s easier to think about.
Also, just a couple minor errors:
In your “The first 31 binary strings in lexical order” figure, you’re missing a white square at the top of the fourth 3-bit string.
“diving by W” should be “dividing by W”. I know spell check would miss that one.
I didn’t notice any other errors. Again, great article.
Nice catches. I love that somebody double-checked all the binary strings. :)