I have a minor disagreement, which I think supports your general point. There is definitely a type of compression going on in the algorithm, it’s just that the key insight in the compression is not to just “minimize entropy” but rather make the outputs of the encoder behave in a similar manner as the observed data. Indeed, that was one of the major insights in information theory is that one wants the encoding scheme to capture the properties of the distribution over the messages (and hence over alphabets).
Namely, in Hinton’s algorithm the outputs of the encoder are fed through a logistic function and then the cross-entropy is minimized (essentially the KL divergence). It seems that he’s more providing something like a reparameterization of a probability mass function for pixel intensities which is a logistic distribution when conditioned on the “deeper” nodes. Minimizing that KL divergence means that the distribution is made to be statistically indistinguishable from the distribution over the data intensities (since the KL-divergence minimizes expected log likelihood ratio-which means minimizing the power over the uniformly most powerful test).
Minimizing entropy blindly would mean the neural network nodes would give constant output: which is very compressive but utterly useless.
I have a minor disagreement, which I think supports your general point. There is definitely a type of compression going on in the algorithm, it’s just that the key insight in the compression is not to just “minimize entropy” but rather make the outputs of the encoder behave in a similar manner as the observed data. Indeed, that was one of the major insights in information theory is that one wants the encoding scheme to capture the properties of the distribution over the messages (and hence over alphabets).
Namely, in Hinton’s algorithm the outputs of the encoder are fed through a logistic function and then the cross-entropy is minimized (essentially the KL divergence). It seems that he’s more providing something like a reparameterization of a probability mass function for pixel intensities which is a logistic distribution when conditioned on the “deeper” nodes. Minimizing that KL divergence means that the distribution is made to be statistically indistinguishable from the distribution over the data intensities (since the KL-divergence minimizes expected log likelihood ratio-which means minimizing the power over the uniformly most powerful test).
Minimizing entropy blindly would mean the neural network nodes would give constant output: which is very compressive but utterly useless.