Math-wise, basically you pick whatever distribution maximizes integral from negative infinity to infinity of p(x)log(p(x)) with respect to x, multiplied by −1. So -Sp(x)log(p(x))dx.
The crux of this making sense is that that value can be interpreted as the amount of information you expect to learn from hearing that x happened. Or more straightforwardly, its how much you expect to not know about a particular variable/event. If you use log base 2, its measured in the average number of yes/no questions needed to concisely learn that it happened. For an explanation of why that’s true, thesearticles are excellent.
The reason that you want to maximize this value in the distribution is that not doing so assumes that you have information that you don’t know. Say you have 5 bits of entropy in the maximum entropy distribution, and 4 in some other one. If you choose the4 bit one then you’re basically making up information by thinking that you need one fewer yes/no question than you actually do.
Math-wise, basically you pick whatever distribution maximizes integral from negative infinity to infinity of p(x)log(p(x)) with respect to x, multiplied by −1. So -Sp(x)log(p(x))dx.
The crux of this making sense is that that value can be interpreted as the amount of information you expect to learn from hearing that x happened. Or more straightforwardly, its how much you expect to not know about a particular variable/event. If you use log base 2, its measured in the average number of yes/no questions needed to concisely learn that it happened. For an explanation of why that’s true, these articles are excellent.
The reason that you want to maximize this value in the distribution is that not doing so assumes that you have information that you don’t know. Say you have 5 bits of entropy in the maximum entropy distribution, and 4 in some other one. If you choose the4 bit one then you’re basically making up information by thinking that you need one fewer yes/no question than you actually do.