Computing the description length using the entropy of a feature activation’s probability distribution is flexible enough to distinguish different types of distributions. For example, a binary distribution would have a entropy of one bit or less, and distributions spread out over more values would have larger entropies.
Yep, that’s completely true. Thanks for the reminder!
Yep, that’s completely true. Thanks for the reminder!