This post reminds me of a self-supervised machine learning model I wanted to make a while ago, but I never got around to.
Essentially the idea was to split the data up into a “small connected region” vs “everything else”. For instance, if the data was images, I’d pick out a small patch of the image vs the rest of the image. And then I’d predict the distribution of the small region from the everything else. The logic was that one could then invert the process, by taking a patch and predicting the distribution that this patch came from; this would tell you the features of the patch that correlate with far-away information.
More formally, my idea was to have two neural networks, which I called F and G, where F takes an image and outputs a feature vector, and G takes a feature vector and somehow functions as a generative model (in my original proposal I chose G to Image-GPT, but you could also imagine training G as a GAN or something). Let x be an image, C(x) be a small region from that image, and let M(x) be most of x with C(x) masked away, and let P(y|G(f)) be the probability of y according to G conditioned on f.
If we then consider the expression P(C(x)|G(F(M(x)))), then this essentially amounts to, how well is a region of an image predicted by features “far away”. So you can optimize this to get a model that does the sort of thing that your post talks about.
My idea was then that F might be kinda messy because id there’s a lot of redundant information, F doesn’t need to capture all copies of it, and so we probably shouldn’t use F. But since G is a generative model, it is incentivized to put in the redundant information to each of its output variables where the information would occur. So we would expect G to be quite robust.
If we wanted to abstract the important faraway features of an image while leaving behind the unimportant noise, we could thus just invert G, essentially asking “what faraway-relevant features might account for this patch?”.
I wrote more discussion about it in the #general channel of the EleutherAI discord back the 12th-13th of March, 2021.
This post reminds me of a self-supervised machine learning model I wanted to make a while ago, but I never got around to.
Essentially the idea was to split the data up into a “small connected region” vs “everything else”. For instance, if the data was images, I’d pick out a small patch of the image vs the rest of the image. And then I’d predict the distribution of the small region from the everything else. The logic was that one could then invert the process, by taking a patch and predicting the distribution that this patch came from; this would tell you the features of the patch that correlate with far-away information.
More formally, my idea was to have two neural networks, which I called F and G, where F takes an image and outputs a feature vector, and G takes a feature vector and somehow functions as a generative model (in my original proposal I chose G to Image-GPT, but you could also imagine training G as a GAN or something). Let x be an image, C(x) be a small region from that image, and let M(x) be most of x with C(x) masked away, and let P(y|G(f)) be the probability of y according to G conditioned on f.
If we then consider the expression P(C(x)|G(F(M(x)))), then this essentially amounts to, how well is a region of an image predicted by features “far away”. So you can optimize this to get a model that does the sort of thing that your post talks about.
My idea was then that F might be kinda messy because id there’s a lot of redundant information, F doesn’t need to capture all copies of it, and so we probably shouldn’t use F. But since G is a generative model, it is incentivized to put in the redundant information to each of its output variables where the information would occur. So we would expect G to be quite robust.
If we wanted to abstract the important faraway features of an image while leaving behind the unimportant noise, we could thus just invert G, essentially asking “what faraway-relevant features might account for this patch?”.
I wrote more discussion about it in the #general channel of the EleutherAI discord back the 12th-13th of March, 2021.
Looks like all the followups to BEiT masked autoencoder image modeling are doing similar things: https://arxiv.org/abs/2202.03382 https://arxiv.org/abs/2202.03026 https://arxiv.org/abs/2202.04200