Doing some simulations on a similar problem (1-d, with p(x) = x), I’m getting results indicating that this isn’t working well at all. Reducing the entropy by means of having large numbers in one bin seems to override the reduction in entropy by having the probabilities be more skewed, at least for this case. I am surprised, and a bit perplexed.
EDIT: I was hoping that a better measure, such as the mutual information I(x; y) between bin (x) and probability of being the same function (y) would work. But this boils down to the measure I suggested last time: I(x; y) = H(y) - H(y|x). H(y) is fixed just by the distribution of same vs. not. H(Y|X) = - sum x p(x) int p(y|x) log p(y|x) dy, and so maximizing the mutual information is the same as minimizing the measure I suggested last time.
What’s the similar problem? “1-d, with p(x) = x)” doesn’t mean much to me. It sounds like you’re looking for bins on the region 1 to d, with p(x) = x. I think that if you used the plogp -qlogq entropy, it would work fine.
“1-d”, meaning one-dimensional. n bins between 0 and 1, samples drawn uniformly in the space X = [0,1], with probability p(x) = x of being considered the same.
Doing some simulations on a similar problem (1-d, with p(x) = x), I’m getting results indicating that this isn’t working well at all. Reducing the entropy by means of having large numbers in one bin seems to override the reduction in entropy by having the probabilities be more skewed, at least for this case. I am surprised, and a bit perplexed.
EDIT: I was hoping that a better measure, such as the mutual information I(x; y) between bin (x) and probability of being the same function (y) would work. But this boils down to the measure I suggested last time: I(x; y) = H(y) - H(y|x). H(y) is fixed just by the distribution of same vs. not. H(Y|X) = - sum x p(x) int p(y|x) log p(y|x) dy, and so maximizing the mutual information is the same as minimizing the measure I suggested last time.
What’s the similar problem? “1-d, with p(x) = x)” doesn’t mean much to me. It sounds like you’re looking for bins on the region 1 to d, with p(x) = x. I think that if you used the plogp -qlogq entropy, it would work fine.
“1-d”, meaning one-dimensional. n bins between 0 and 1, samples drawn uniformly in the space X = [0,1], with probability p(x) = x of being considered the same.