We propose a new framework for reasoning about information in complex systems. Our foundation is based on a variational extension of Shannon’s information theory that takes into account the modeling power and computational constraints of the observer. The resulting \emph{predictive V-information} encompasses mutual information and other notions of informativeness such as the coefficient of determination. Unlike Shannon’s mutual information and in violation of the data processing inequality, V-information can be created through computation. This is consistent with deep neural networks extracting hierarchies of progressively more informative features in representation learning. Additionally, we show that by incorporating computational constraints, V-information can be reliably estimated from data even in high dimensions with PAC-style guarantees. Empirically, we demonstrate predictive V-information is more effective than mutual information for structure learning and fair representation learning.
My reading is their definition of conditional predictive entropy is the naive generalization of Shannon’s conditional entropy given that the way that you condition on some data is restricted to only being able to implement functions of a particular class. And the corresponding generalization of mutual information becomes a measure of how much more predictable does some piece of information become (Y) given evidence (X) compared to no evidence.
For example, the goal of public key cryptography cannot be to make the mutual information between a plaintext, and public key & encrypted text zero, while maintaining maximal mutual information between the encrypted text and plaintext given the private key, since this is impossible.
Cryptography instead assumes everyone involved can only condition their probability distributions using polynomial time algorithms of the data they have, and in that circumstance you can minimize the predictability of your plain text after getting the public key & encrypted text, while maximizing the predictability of the plain text after getting the private key & encrypted text.
More mathematically, they assume you can only implement functions from your data to your conditioned probability distributions in the set of functions V, with the property that for any possible probability distribution you are able to output given the right set of data, you also have the choice of simply outputting the probability distribution without looking at the data. In other words, if you can represent it, you can output it. This corresponds to equation (1).
The Shannon entropy of a random variable Y given X is
H(Y|X)=−∫∫p(x,y)logp(y|x)dxdy
Thus, the predictive entropy of a random variable Y given X, only being able to condition using functions in V would be
HV=inff∈V−∫∫p(x,y)logf(y|x)dxdy
Where f(y|x)=f[x](y), if we’d like to use the notation of the paper.
And using this we can define predictive information, which as said before answers the question “how much more predictable is Y after we get the infromation X compared to no information?” by
IV(X→Y)=HV(Y|∅)−HV(Y|X)
which they also show can be empirically well estimated by the naive data sampling method (i.e. replacing the expectations in definition 2 with empirical samples).
A Theory of Usable Information Under Computational Constraints
h/t Simon Pepin Lehalleur
Can somebody explain to me what’s happening in this paper ?
My reading is their definition of conditional predictive entropy is the naive generalization of Shannon’s conditional entropy given that the way that you condition on some data is restricted to only being able to implement functions of a particular class. And the corresponding generalization of mutual information becomes a measure of how much more predictable does some piece of information become (Y) given evidence (X) compared to no evidence.
For example, the goal of public key cryptography cannot be to make the mutual information between a plaintext, and public key & encrypted text zero, while maintaining maximal mutual information between the encrypted text and plaintext given the private key, since this is impossible.
Cryptography instead assumes everyone involved can only condition their probability distributions using polynomial time algorithms of the data they have, and in that circumstance you can minimize the predictability of your plain text after getting the public key & encrypted text, while maximizing the predictability of the plain text after getting the private key & encrypted text.
More mathematically, they assume you can only implement functions from your data to your conditioned probability distributions in the set of functions V, with the property that for any possible probability distribution you are able to output given the right set of data, you also have the choice of simply outputting the probability distribution without looking at the data. In other words, if you can represent it, you can output it. This corresponds to equation (1).
The Shannon entropy of a random variable Y given X is
H(Y|X)=−∫∫p(x,y)logp(y|x)dxdy
Thus, the predictive entropy of a random variable Y given X, only being able to condition using functions in V would be
HV=inff∈V−∫∫p(x,y)logf(y|x)dxdy
Where f(y|x)=f[x](y), if we’d like to use the notation of the paper.
And using this we can define predictive information, which as said before answers the question “how much more predictable is Y after we get the infromation X compared to no information?” by
IV(X→Y)=HV(Y|∅)−HV(Y|X)
which they also show can be empirically well estimated by the naive data sampling method (i.e. replacing the expectations in definition 2 with empirical samples).