In a related matter, if I want to study the information theory of hierarchical Bayesian models, what definition of information should I be reading up on? Shannon entropy seems to only apply to discrete distributions; mutual differential entropy seems like the thing I want, but has “weird” properties like occasionally being negative; K-L divergence seems more like a tool for hypothesizing things about out-of-model error than for examining links between random variables.
What I really want to do is: set up a statistical model, and ask a well-defined, well-informed theoretical question about “How much information do I get about random variable Y if I measure random variable X, where the relationships between their distributions are specified by this model I’ve got here?” Having asked the theoretical question, I want to maybe derive some maths (an analytical expression/equation for the question I asked would be nice), and then run an empirical experiment where I construct such a model and show how the equation holds numerically.
The ultimate goal is to find out precisely how the “Blessing of Abstraction” mentioned in Probabilistic Models of Cognition is happening, and characterize it as a general feature of hierarchical modelling.
Short answer: use differential entropy and differential mutual information.
Differential entropy and Shannon entropy are both instances of a more general concept; Shannon entropy for discrete distributions and differential entropy for absolutely continuous ones.
KL-divergence is actually more about approximations, than about variables dependence; it would be strange to use KL-divergence for your purposes, since it is non-symmetric. Anyway, KL-divergence is tightly connected to mutual information:
%20=%20KL(p(x,y)%7C%7Cp(x)p(y)))
Differential mutual information is a measure of variables’ dependence; differential entropy is a measure of… what? In the discrete case we have the encoding interpretation, but it breaks in the continuous case. The fact that it can be negative shouldn’t bother you, because its interpretation is unclear anyway.
As for (differential) mutual information, it can’t be negative, as you can see from the formula above (KL-divergence is non-negative). Nothing weird occurs here.
Is LessWrong interested in Bayesian machine learning introductory articles?
I would be interested in that, don’t know about anyone else.
In a related matter, if I want to study the information theory of hierarchical Bayesian models, what definition of information should I be reading up on? Shannon entropy seems to only apply to discrete distributions; mutual differential entropy seems like the thing I want, but has “weird” properties like occasionally being negative; K-L divergence seems more like a tool for hypothesizing things about out-of-model error than for examining links between random variables.
What I really want to do is: set up a statistical model, and ask a well-defined, well-informed theoretical question about “How much information do I get about random variable Y if I measure random variable X, where the relationships between their distributions are specified by this model I’ve got here?” Having asked the theoretical question, I want to maybe derive some maths (an analytical expression/equation for the question I asked would be nice), and then run an empirical experiment where I construct such a model and show how the equation holds numerically.
The ultimate goal is to find out precisely how the “Blessing of Abstraction” mentioned in Probabilistic Models of Cognition is happening, and characterize it as a general feature of hierarchical modelling.
Short answer: use differential entropy and differential mutual information.
Differential entropy and Shannon entropy are both instances of a more general concept; Shannon entropy for discrete distributions and differential entropy for absolutely continuous ones.
KL-divergence is actually more about approximations, than about variables dependence; it would be strange to use KL-divergence for your purposes, since it is non-symmetric. Anyway, KL-divergence is tightly connected to mutual information:
%20=%20KL(p(x,y)%7C%7Cp(x)p(y)))Differential mutual information is a measure of variables’ dependence; differential entropy is a measure of… what? In the discrete case we have the encoding interpretation, but it breaks in the continuous case. The fact that it can be negative shouldn’t bother you, because its interpretation is unclear anyway.
As for (differential) mutual information, it can’t be negative, as you can see from the formula above (KL-divergence is non-negative). Nothing weird occurs here.
Brilliant! Thank you so much!
Seems right in the core of things LW is likely to be interested in, to me.
I for certain would be interested.
Yes, at best with a how to of applying the concept in python or R.
The phrase, “DO IT!”, spoken in an excited tone, comes to mind.