My reading is their definition of conditional predictive entropy is the naive generalization of Shannon’s conditional entropy given that the way that you condition on some data is restricted to only being able to implement functions of a particular class. And the corresponding generalization of mutual information becomes a measure of how much more predictable does some piece of information become (Y) given evidence (X) compared to no evidence.
For example, the goal of public key cryptography cannot be to make the mutual information between a plaintext, and public key & encrypted text zero, while maintaining maximal mutual information between the encrypted text and plaintext given the private key, since this is impossible.
Cryptography instead assumes everyone involved can only condition their probability distributions using polynomial time algorithms of the data they have, and in that circumstance you can minimize the predictability of your plain text after getting the public key & encrypted text, while maximizing the predictability of the plain text after getting the private key & encrypted text.
More mathematically, they assume you can only implement functions from your data to your conditioned probability distributions in the set of functions V, with the property that for any possible probability distribution you are able to output given the right set of data, you also have the choice of simply outputting the probability distribution without looking at the data. In other words, if you can represent it, you can output it. This corresponds to equation (1).
The Shannon entropy of a random variable Y given X is
H(Y|X)=−∫∫p(x,y)logp(y|x)dxdy
Thus, the predictive entropy of a random variable Y given X, only being able to condition using functions in V would be
HV=inff∈V−∫∫p(x,y)logf(y|x)dxdy
Where f(y|x)=f[x](y), if we’d like to use the notation of the paper.
And using this we can define predictive information, which as said before answers the question “how much more predictable is Y after we get the infromation X compared to no information?” by
IV(X→Y)=HV(Y|∅)−HV(Y|X)
which they also show can be empirically well estimated by the naive data sampling method (i.e. replacing the expectations in definition 2 with empirical samples).
Can somebody explain to me what’s happening in this paper ?
My reading is their definition of conditional predictive entropy is the naive generalization of Shannon’s conditional entropy given that the way that you condition on some data is restricted to only being able to implement functions of a particular class. And the corresponding generalization of mutual information becomes a measure of how much more predictable does some piece of information become (Y) given evidence (X) compared to no evidence.
For example, the goal of public key cryptography cannot be to make the mutual information between a plaintext, and public key & encrypted text zero, while maintaining maximal mutual information between the encrypted text and plaintext given the private key, since this is impossible.
Cryptography instead assumes everyone involved can only condition their probability distributions using polynomial time algorithms of the data they have, and in that circumstance you can minimize the predictability of your plain text after getting the public key & encrypted text, while maximizing the predictability of the plain text after getting the private key & encrypted text.
More mathematically, they assume you can only implement functions from your data to your conditioned probability distributions in the set of functions V, with the property that for any possible probability distribution you are able to output given the right set of data, you also have the choice of simply outputting the probability distribution without looking at the data. In other words, if you can represent it, you can output it. This corresponds to equation (1).
The Shannon entropy of a random variable Y given X is
H(Y|X)=−∫∫p(x,y)logp(y|x)dxdy
Thus, the predictive entropy of a random variable Y given X, only being able to condition using functions in V would be
HV=inff∈V−∫∫p(x,y)logf(y|x)dxdy
Where f(y|x)=f[x](y), if we’d like to use the notation of the paper.
And using this we can define predictive information, which as said before answers the question “how much more predictable is Y after we get the infromation X compared to no information?” by
IV(X→Y)=HV(Y|∅)−HV(Y|X)
which they also show can be empirically well estimated by the naive data sampling method (i.e. replacing the expectations in definition 2 with empirical samples).