I’ll say that a model linearly represents a binary feature f if there is a linear probe out of the model’s latent space which is accurate for classifying f
If a model linearly represents features a and b, then it automatically linearly represents a∧b and a∨b.
I think I misunderstand your definition. Let feature a be represented by x_1 > 0.5, and let feature b be represented by x_2 > 0.5. Let x_i be iid uniform [0, 1]. Isn’t that a counterexample to (a and b) being linearly representable?
Thanks, you’re correct that my definition breaks in this case. I will say that this situation is a bit pathological for two reasons:
The mode of a uniform distribution doesn’t coincide with its mean.
The variance of the multivariate uniform distribution U([0,1]×[0,1]) is largest along the direction x1+x2, which is exactly the direction which we would want to represent a AND b.
I’m not sure exactly which assumptions should be imposed to avoid pathologies like this, but maybe something of the form: we are working with boolean features f whose class-conditional distributions D(⋅|f),D(⋅|¬f) satisfy properties like
D(⋅|f),D(⋅|¬f) are unimodal, and their modes coincide with their means
The variance of D(⋅|f),D(⋅|¬f) along any direction is not too large relative to the difference of the means Ex∼D(⋅|f)(x)−Ex∼D(⋅|¬f)(x)
The variance of the multivariate uniform distribution U([0,1]×[0,1]) is largest along the direction x1+x2, which is exactly the direction which we would want to represent a AND b.
The variance is actually the same in all directions. One can sanity-check by integration that the variance is 1⁄12 both along the axis and along the diagonal.
In fact, there’s nothing special about the uniform distribution here: The variance should be independent of direction for any N-dimensional joint distribution where the N constituent distributions are independent and have equal variance.[1]
The diagram in the post showing that “and”is linearly represented works if the features are represented discretely (so that there are exactly 4 points for 2 binary features, instead of a distribution for each combination). As soon as you start defining features with thresholds like DanielVarga did, the argument stops going through in general, and the claim can become false.
The stuff about unimodality doesn’t seem relevant to me, and in fact seems directionally wrong.
Thanks, you’re totally right about the equal variance thing—I had stupidly thought that the projection of U([0,1]2) onto y = x would be uniform on [−1√2,1√2] (obviously false!).
The case of a fully discrete distribution (supported in this case on four points) seems like a very special case of a something more general, where a “more typical” special case would be something like:
if a, b are both false, then sample from N(0,Σ)
if a is true and b is false, then sample from N(μa,Σ)
if a is false and b is true then sample from N(μb,Σ)
if a and b are true, then sample from N(μa+μb,Σ)
for some μa,μb∈Rn and covariance matrix Σ. In general, I don’t really expect the class-conditional distributions to be Gaussian, nor for the class-conditional covariances to be independent of the class. But I do expect something broadly like this, where the distributions are concentrated around their class-conditional means with probability falling off as you move further from the class-conditional mean (hence unimodality), and that the class-conditional variances are not too big relative to the distance between the clusters.
Given that longer explanation, does the unimodality thing still seem directionally wrong?
Oops, I misunderstood what you meant by unimodality earlier. Your comment seems broadly correct now (except for the variance thing). I would still guess that unimodality isn’t precisely the right well-behavedness desideratum, but I retract the “directionally wrong”.
I think I misunderstand your definition. Let feature a be represented by x_1 > 0.5, and let feature b be represented by x_2 > 0.5. Let x_i be iid uniform [0, 1]. Isn’t that a counterexample to (a and b) being linearly representable?
Thanks, you’re correct that my definition breaks in this case. I will say that this situation is a bit pathological for two reasons:
The mode of a uniform distribution doesn’t coincide with its mean.
The variance of the multivariate uniform distribution U([0,1]×[0,1]) is largest along the direction x1+x2, which is exactly the direction which we would want to represent a AND b.
I’m not sure exactly which assumptions should be imposed to avoid pathologies like this, but maybe something of the form: we are working with boolean features f whose class-conditional distributions D(⋅|f),D(⋅|¬f) satisfy properties like
D(⋅|f),D(⋅|¬f) are unimodal, and their modes coincide with their means
The variance of D(⋅|f),D(⋅|¬f) along any direction is not too large relative to the difference of the means Ex∼D(⋅|f)(x)−Ex∼D(⋅|¬f)(x)
The variance is actually the same in all directions. One can sanity-check by integration that the variance is 1⁄12 both along the axis and along the diagonal.
In fact, there’s nothing special about the uniform distribution here: The variance should be independent of direction for any N-dimensional joint distribution where the N constituent distributions are independent and have equal variance.[1]
The diagram in the post showing that “and” is linearly represented works if the features are represented discretely (so that there are exactly 4 points for 2 binary features, instead of a distribution for each combination). As soon as you start defining features with thresholds like DanielVarga did, the argument stops going through in general, and the claim can become false.
The stuff about unimodality doesn’t seem relevant to me, and in fact seems directionally wrong.
I have a not-fully-verbalized proof which I don’t have time to write out
Thanks, you’re totally right about the equal variance thing—I had stupidly thought that the projection of U([0,1]2) onto y = x would be uniform on [−1√2,1√2] (obviously false!).
The case of a fully discrete distribution (supported in this case on four points) seems like a very special case of a something more general, where a “more typical” special case would be something like:
if a, b are both false, then sample from N(0,Σ)
if a is true and b is false, then sample from N(μa,Σ)
if a is false and b is true then sample from N(μb,Σ)
if a and b are true, then sample from N(μa+μb,Σ)
for some μa,μb∈Rn and covariance matrix Σ. In general, I don’t really expect the class-conditional distributions to be Gaussian, nor for the class-conditional covariances to be independent of the class. But I do expect something broadly like this, where the distributions are concentrated around their class-conditional means with probability falling off as you move further from the class-conditional mean (hence unimodality), and that the class-conditional variances are not too big relative to the distance between the clusters.
Given that longer explanation, does the unimodality thing still seem directionally wrong?
Oops, I misunderstood what you meant by unimodality earlier. Your comment seems broadly correct now (except for the variance thing). I would still guess that unimodality isn’t precisely the right well-behavedness desideratum, but I retract the “directionally wrong”.