Sorry for stupid question but when you say f:X→Y is “(almost) linear”, I presume that means f(x1+x2)≈f(x1)+f(x2) and f(kx)≈kf(x). But what are X & Y here? Activations? Weights? Weight-changes-versus-initialization? Some-kind-of-semantic-meaning-encoded-as-a-high-dimensional-vector-of-pattern-matches-against-every-possible-abstract-concept? More than one of the above? Something else?
For an image-classification network, if we remove the softmax nonlinearity from the very end, then X would represent the input image in pixel space, and Y would represent the class logits. Then f(x1+x2)≈f(x1)+f(x2) would represent an image with two objects leading to an ambiguous classification (high log-probability for both classes), and f(kx)≈kf(x) would represent higher class certainty (softmax temperature = 1/k) when the image has higher contrast. I guess that kind of makes sense, but yeah, I think for real neural networks, this will only be linear-ish at best.
Well, averaging / adding two images in pixel space usually gives a thing that looks like two semi-transparent images overlaid, as opposed to “an image with two objects”.
If both images have the main object near the middle of the image or taking up most of the space (which is usually the case for single-class photos taken by humans), then yes. Otherwise, summing two images with small, off-center items will just look like a low-contrast, noisy image of two items.
Either way, though, I would expect this to result in class-label ambiguity. However, in some cases of semi-transparent-object-overlay, the overlay may end up mixing features in such a jumbled way that neither of the “true” classes is discernible. This would be a case where the almost-linearity of the network breaks down.
Maybe this linearity story would work better for generative models, where adding latent vector representations of two different objects would lead the network to generate an image with both objects included (an image that would have an ambiguous class label to a second network). It would need to be tested whether this sort of thing happens by default (e.g., with Stable Diffusion) or whether I’m just making stuff up here.
Maybe this linearity story would work better for generative models, where adding latent vector representations of two different objects would lead the network to generate an image with both objects included (an image that would have an ambiguous class label to a second network). It would need to be tested whether this sort of thing happens by default (e.g., with Stable Diffusion) or whether I’m just making stuff up here.
Yes this is exactly right. This is precisely the kind of linearity that I am talking about not the input->output mapping which is clearly nonlinear. The idea being that hidden inside the network is a linear latent space where we can perform linear operations and they (mostly) work. In the points of evidence in the post there is discussion of exactly this kind of latent space editing for stable diffusion. A nice example is this paper. Interestingly this also works for fine-tuning weight diffs for e.g. style transfer.
Sorry for stupid question but when you say f:X→Y is “(almost) linear”, I presume that means f(x1+x2)≈f(x1)+f(x2) and f(kx)≈kf(x). But what are X & Y here? Activations? Weights? Weight-changes-versus-initialization? Some-kind-of-semantic-meaning-encoded-as-a-high-dimensional-vector-of-pattern-matches-against-every-possible-abstract-concept? More than one of the above? Something else?
For an image-classification network, if we remove the softmax nonlinearity from the very end, then X would represent the input image in pixel space, and Y would represent the class logits. Then f(x1+x2)≈f(x1)+f(x2) would represent an image with two objects leading to an ambiguous classification (high log-probability for both classes), and f(kx)≈kf(x) would represent higher class certainty (softmax temperature = 1/k) when the image has higher contrast. I guess that kind of makes sense, but yeah, I think for real neural networks, this will only be linear-ish at best.
Well, averaging / adding two images in pixel space usually gives a thing that looks like two semi-transparent images overlaid, as opposed to “an image with two objects”.
If both images have the main object near the middle of the image or taking up most of the space (which is usually the case for single-class photos taken by humans), then yes. Otherwise, summing two images with small, off-center items will just look like a low-contrast, noisy image of two items.
Either way, though, I would expect this to result in class-label ambiguity. However, in some cases of semi-transparent-object-overlay, the overlay may end up mixing features in such a jumbled way that neither of the “true” classes is discernible. This would be a case where the almost-linearity of the network breaks down.
Maybe this linearity story would work better for generative models, where adding latent vector representations of two different objects would lead the network to generate an image with both objects included (an image that would have an ambiguous class label to a second network). It would need to be tested whether this sort of thing happens by default (e.g., with Stable Diffusion) or whether I’m just making stuff up here.
Yes this is exactly right. This is precisely the kind of linearity that I am talking about not the input->output mapping which is clearly nonlinear. The idea being that hidden inside the network is a linear latent space where we can perform linear operations and they (mostly) work. In the points of evidence in the post there is discussion of exactly this kind of latent space editing for stable diffusion. A nice example is this paper. Interestingly this also works for fine-tuning weight diffs for e.g. style transfer.