To understand entropy you need to understand expected values, and self-information.
Expected value is what happens on average—a sum of outcomes weighted by how likely they are. The expected value of a 6 sided die is 1⁄6 + 2⁄6 + 3⁄6 + 4⁄6 + 5⁄6 + 6⁄6 = 3.5.
Self-information is sort of like how many bits of information you learn if something random becomes certain. For example, a fair coin comes up heads with 0.5 probability and tails with 0.5 probability. So if you learn this fair coin actually came up tails, you will learn the number of bits equal to its self-information, which is -log_2(0.5) = log_2(1/0.5) = 1.
Why people decided that this specific measure is what we should use is a good question, and not so obvious. Information theorists try to justify it axiomatically. That is, if we pick this measure, it obeys some nice properties we intuitively want it to obey. Properties like “we want this number of be higher for unlikely events” and “we want this number to be additive for independent events.” This is why we get a minus sign and a log (base, as long as larger than 1, does not matter for logs, but people like base 2).
Entropy is just the expected self-information, so once you understand the above two, you understand entropy.
Once you understand entropy, the reason entropy is maximized for the uniform distribution is related to why area of a figure with a given circumference is maximized by a circle. Area is also a sum (of little tiny pie slices of a figure), just like entropy is a sum. For area, the constraint is circumference being a given number, for entropy the constraint is probabilities must sum to one. You can think of both of these constraints as “normalization constraints”—things have to sum to some number.
In both cases, the sum is maximized if individual pieces are as equal to each other as allowed.
Does the information theory definition of entropy actually correspond to the physics definition of entropy? I understand what entropy means in terms of physics, but the information theory definition of the terms seemed fundamentally different to me. Is it, or does one actually correspond to the other in some way that I’m not seeing?
Shannon’s definition of entropy corresponds very closely to the definition of entropy used in statistical mechanics. It’s slightly more general and devoid of “physics baggage” (macro states and so on).
Analogy: Ising model of spin glasses vs undirected graphical models (Markov random fields). The former has a lot of baggage like “magnetization, external field, energy.” The latter is just a statistical model of conditional independence on a graph. The Ising model is a special case (in fact the first developed case, back in 1910) of a Markov random field.
To understand entropy you need to understand expected values, and self-information.
Expected value is what happens on average—a sum of outcomes weighted by how likely they are. The expected value of a 6 sided die is 1⁄6 + 2⁄6 + 3⁄6 + 4⁄6 + 5⁄6 + 6⁄6 = 3.5.
Self-information is sort of like how many bits of information you learn if something random becomes certain. For example, a fair coin comes up heads with 0.5 probability and tails with 0.5 probability. So if you learn this fair coin actually came up tails, you will learn the number of bits equal to its self-information, which is -log_2(0.5) = log_2(1/0.5) = 1.
Why people decided that this specific measure is what we should use is a good question, and not so obvious. Information theorists try to justify it axiomatically. That is, if we pick this measure, it obeys some nice properties we intuitively want it to obey. Properties like “we want this number of be higher for unlikely events” and “we want this number to be additive for independent events.” This is why we get a minus sign and a log (base, as long as larger than 1, does not matter for logs, but people like base 2).
Entropy is just the expected self-information, so once you understand the above two, you understand entropy.
Once you understand entropy, the reason entropy is maximized for the uniform distribution is related to why area of a figure with a given circumference is maximized by a circle. Area is also a sum (of little tiny pie slices of a figure), just like entropy is a sum. For area, the constraint is circumference being a given number, for entropy the constraint is probabilities must sum to one. You can think of both of these constraints as “normalization constraints”—things have to sum to some number.
In both cases, the sum is maximized if individual pieces are as equal to each other as allowed.
Does the information theory definition of entropy actually correspond to the physics definition of entropy? I understand what entropy means in terms of physics, but the information theory definition of the terms seemed fundamentally different to me. Is it, or does one actually correspond to the other in some way that I’m not seeing?
Shannon’s definition of entropy corresponds very closely to the definition of entropy used in statistical mechanics. It’s slightly more general and devoid of “physics baggage” (macro states and so on).
Analogy: Ising model of spin glasses vs undirected graphical models (Markov random fields). The former has a lot of baggage like “magnetization, external field, energy.” The latter is just a statistical model of conditional independence on a graph. The Ising model is a special case (in fact the first developed case, back in 1910) of a Markov random field.
Physicists have a really good nose for models.