The variance of the multivariate uniform distribution U([0,1]×[0,1]) is largest along the direction x1+x2, which is exactly the direction which we would want to represent a AND b.
The variance is actually the same in all directions. One can sanity-check by integration that the variance is 1⁄12 both along the axis and along the diagonal.
In fact, there’s nothing special about the uniform distribution here: The variance should be independent of direction for any N-dimensional joint distribution where the N constituent distributions are independent and have equal variance.[1]
The diagram in the post showing that “and”is linearly represented works if the features are represented discretely (so that there are exactly 4 points for 2 binary features, instead of a distribution for each combination). As soon as you start defining features with thresholds like DanielVarga did, the argument stops going through in general, and the claim can become false.
The stuff about unimodality doesn’t seem relevant to me, and in fact seems directionally wrong.
Thanks, you’re totally right about the equal variance thing—I had stupidly thought that the projection of U([0,1]2) onto y = x would be uniform on [−1√2,1√2] (obviously false!).
The case of a fully discrete distribution (supported in this case on four points) seems like a very special case of a something more general, where a “more typical” special case would be something like:
if a, b are both false, then sample from N(0,Σ)
if a is true and b is false, then sample from N(μa,Σ)
if a is false and b is true then sample from N(μb,Σ)
if a and b are true, then sample from N(μa+μb,Σ)
for some μa,μb∈Rn and covariance matrix Σ. In general, I don’t really expect the class-conditional distributions to be Gaussian, nor for the class-conditional covariances to be independent of the class. But I do expect something broadly like this, where the distributions are concentrated around their class-conditional means with probability falling off as you move further from the class-conditional mean (hence unimodality), and that the class-conditional variances are not too big relative to the distance between the clusters.
Given that longer explanation, does the unimodality thing still seem directionally wrong?
Oops, I misunderstood what you meant by unimodality earlier. Your comment seems broadly correct now (except for the variance thing). I would still guess that unimodality isn’t precisely the right well-behavedness desideratum, but I retract the “directionally wrong”.
The variance is actually the same in all directions. One can sanity-check by integration that the variance is 1⁄12 both along the axis and along the diagonal.
In fact, there’s nothing special about the uniform distribution here: The variance should be independent of direction for any N-dimensional joint distribution where the N constituent distributions are independent and have equal variance.[1]
The diagram in the post showing that “and” is linearly represented works if the features are represented discretely (so that there are exactly 4 points for 2 binary features, instead of a distribution for each combination). As soon as you start defining features with thresholds like DanielVarga did, the argument stops going through in general, and the claim can become false.
The stuff about unimodality doesn’t seem relevant to me, and in fact seems directionally wrong.
I have a not-fully-verbalized proof which I don’t have time to write out
Thanks, you’re totally right about the equal variance thing—I had stupidly thought that the projection of U([0,1]2) onto y = x would be uniform on [−1√2,1√2] (obviously false!).
The case of a fully discrete distribution (supported in this case on four points) seems like a very special case of a something more general, where a “more typical” special case would be something like:
if a, b are both false, then sample from N(0,Σ)
if a is true and b is false, then sample from N(μa,Σ)
if a is false and b is true then sample from N(μb,Σ)
if a and b are true, then sample from N(μa+μb,Σ)
for some μa,μb∈Rn and covariance matrix Σ. In general, I don’t really expect the class-conditional distributions to be Gaussian, nor for the class-conditional covariances to be independent of the class. But I do expect something broadly like this, where the distributions are concentrated around their class-conditional means with probability falling off as you move further from the class-conditional mean (hence unimodality), and that the class-conditional variances are not too big relative to the distance between the clusters.
Given that longer explanation, does the unimodality thing still seem directionally wrong?
Oops, I misunderstood what you meant by unimodality earlier. Your comment seems broadly correct now (except for the variance thing). I would still guess that unimodality isn’t precisely the right well-behavedness desideratum, but I retract the “directionally wrong”.