Let’s interpret the set X as the set of all possible visual sensory experiences x=(x1,…,xn), where xi defines the color of the ith pixel.
Different distributions over elements of this set correspond to observing different objects; for example, we can have Pcar(X) and Papple(X), corresponding to us predicting different sensory experiences when looking at cars vs. apples.
Let’s take some specific specific set of observations XO⊂X, from which we’d be trying to derive a latent.
We assume uncertainty regarding what objects generated the training-set observations, getting a mixture of distributions Qα(XO)=αPcar(XO)+(1−α)Papple(XO).
We derive a natural latent Λ for Qα(XO) such that Qα(XO|Λ)=Πx∈XOQα(XO=x|Λ) for all allowed α.
This necessarily implies that Λ also induces independence between different sensory experiences for each individual distribution in the mixture: Pcar(XO|Λ)=Πx∈XOPcar(XO=x|Λ) and Papple(XO|Λ)=Πx∈XOPapple(XO=x|Λ).
If the set XO contains some observations generated by cars and some observations generated by apples, yet a nontrivial latent over the entire set nonetheless exists, then this latent must summarize information about some feature shared by both objects.
For example, perhaps it transpired that all cars depicted in this dataset are red, and all apples in this dataset are red, so Λ=Λredness ends up as “the concept of redness”.
This latent then could, prospectively, be applied to new objects. If we later learn of the existence of Pink(X) – an object seeing which predicts yet another distribution over visual experiences – then Λredness would “know” how to handle this “out of the box”. For example, if we have a set of observations XO′ such that it contains some red cars and some red ink, then Λredness would be natural over this set under both distributions, without us needing to recompute it.
This trick could be applied for learning new “features” of objects. Suppose we have some established observation-sets Xcars and Xapples, which have nontrivial natural latents Λcar and Λapple. To find new “object-agnostic” latents, we can try to form new sets of observations from subsets of those observations, define corresponding distributions, and see if mixtures of distributions over those subsets have nontrivial latents.
Formally: Xtest=Xspecific-cars∪Xspecific-apples where Xspecific-cars⊂Xcars and Xspecific-apples⊂Xapples, then Hα(Xtest)=αPcar(Xtest)+(1−α)Papple(Xtest), and we want to see if we have a new Λ that induces (approximate) independence between all x∈Xtest both under the “apple” and the “car” distributions.
Though note that it could be done the other way around as well: we could first learn the latents of “redness” and e. g. “greenness” by grouping all red-having and green-having observations, then try to find some subsets of those sets which also have nontrivial natural latents, and end up deriving the latent of “car” by grouping all red and green objects that happen to be cars.
(Which is to say, I’m not necessarily sure there’s a sharp divide between “adjectives” and “nouns” in this formulation. “The property of car-ness” is interpretable as an adjective here, and “greenery” is interpretable as a noun.)
I’d also expect that the latent over Xred-cars, i. e. Λred-car, could be constructed out of Λcar and Λredness (derived, respectively, from a pure-cars dataset and an all-red dataset)? In other words, if we simultaneously condition a dateset of red cars on a latent derived from a dataset of any-colored cars and a latent derived from a dateset of red-colored objects, then this combined latent Λredness⋅Λcar would induce independence across Xred-cars (which Λcar wouldn’t be able to do on its own, due to the instances sharing color-related information in addition to car-ness)?
All of this is interesting mostly in the approximate-latent regime (this allows us to avoid the nonrobust-to-tiny-mixtures trap), and in situations in which we already have some established latents which we want to break down into interoperable features.
In principle, if we have e. g. two sets of observations that we already know correspond to nontrivial latents, e. g.Xcars and Xapples, we could directly try to find subsets of their union that correspond to new nontrivial latents, in the hopes of recovering some features that’d correspond to grouping observations along some other dimension.
But if we already have established “object-typed” probability distributions Pcar(X) and Papple(X), then hypothesizing that the observations are generated by an arbitrary mixture of these distributions allows us to “wash out” any information that doesn’t actually correspond to some robustly shared features of cars-or-apples.
That is: consider if Xtest is 99% cars, 1% apples. Then an approximately correct natural latent over it is basically just Λcar, maybe with some additional noise from apples thrown in. This is what we’d get if we used the “naive” procedure in (1) above. But if we’re allowed to mix up the distributions, then “ramping” up the “apple” distribution (defining Qα=0.01(X), say) would end up with low probabilities assigned to all observations corresponding to cars, and now the approximately correct natural latent over this dataset would have more apple-like qualities. The demand for the latent to be valid on arbitraryα∈[0,1] then “washes out” all traces of car-ness and apple-ness, leaving only redness.
Is this about right? I’m getting a vague sense of some disconnect between this formulation and the OP...
Let’s see if I get this right...
Let’s interpret the set X as the set of all possible visual sensory experiences x=(x1,…,xn), where xi defines the color of the ith pixel.
Different distributions over elements of this set correspond to observing different objects; for example, we can have Pcar(X) and Papple(X), corresponding to us predicting different sensory experiences when looking at cars vs. apples.
Let’s take some specific specific set of observations XO⊂X, from which we’d be trying to derive a latent.
We assume uncertainty regarding what objects generated the training-set observations, getting a mixture of distributions Qα(XO)=αPcar(XO)+(1−α)Papple(XO).
We derive a natural latent Λ for Qα(XO) such that Qα(XO|Λ)=Πx∈XOQα(XO=x|Λ) for all allowed α.
This necessarily implies that Λ also induces independence between different sensory experiences for each individual distribution in the mixture: Pcar(XO|Λ)=Πx∈XOPcar(XO=x|Λ) and Papple(XO|Λ)=Πx∈XOPapple(XO=x|Λ).
If the set XO contains some observations generated by cars and some observations generated by apples, yet a nontrivial latent over the entire set nonetheless exists, then this latent must summarize information about some feature shared by both objects.
For example, perhaps it transpired that all cars depicted in this dataset are red, and all apples in this dataset are red, so Λ=Λredness ends up as “the concept of redness”.
This latent then could, prospectively, be applied to new objects. If we later learn of the existence of Pink(X) – an object seeing which predicts yet another distribution over visual experiences – then Λredness would “know” how to handle this “out of the box”. For example, if we have a set of observations XO′ such that it contains some red cars and some red ink, then Λredness would be natural over this set under both distributions, without us needing to recompute it.
This trick could be applied for learning new “features” of objects. Suppose we have some established observation-sets Xcars and Xapples, which have nontrivial natural latents Λcar and Λapple. To find new “object-agnostic” latents, we can try to form new sets of observations from subsets of those observations, define corresponding distributions, and see if mixtures of distributions over those subsets have nontrivial latents.
Formally: Xtest=Xspecific-cars∪Xspecific-apples where Xspecific-cars⊂Xcars and Xspecific-apples⊂Xapples, then Hα(Xtest)=αPcar(Xtest)+(1−α)Papple(Xtest), and we want to see if we have a new Λ that induces (approximate) independence between all x∈Xtest both under the “apple” and the “car” distributions.
Though note that it could be done the other way around as well: we could first learn the latents of “redness” and e. g. “greenness” by grouping all red-having and green-having observations, then try to find some subsets of those sets which also have nontrivial natural latents, and end up deriving the latent of “car” by grouping all red and green objects that happen to be cars.
(Which is to say, I’m not necessarily sure there’s a sharp divide between “adjectives” and “nouns” in this formulation. “The property of car-ness” is interpretable as an adjective here, and “greenery” is interpretable as a noun.)
I’d also expect that the latent over Xred-cars, i. e. Λred-car, could be constructed out of Λcar and Λredness (derived, respectively, from a pure-cars dataset and an all-red dataset)? In other words, if we simultaneously condition a dateset of red cars on a latent derived from a dataset of any-colored cars and a latent derived from a dateset of red-colored objects, then this combined latent Λredness⋅Λcar would induce independence across Xred-cars (which Λcar wouldn’t be able to do on its own, due to the instances sharing color-related information in addition to car-ness)?
All of this is interesting mostly in the approximate-latent regime (this allows us to avoid the nonrobust-to-tiny-mixtures trap), and in situations in which we already have some established latents which we want to break down into interoperable features.
In principle, if we have e. g. two sets of observations that we already know correspond to nontrivial latents, e. g.Xcars and Xapples, we could directly try to find subsets of their union that correspond to new nontrivial latents, in the hopes of recovering some features that’d correspond to grouping observations along some other dimension.
But if we already have established “object-typed” probability distributions Pcar(X) and Papple(X), then hypothesizing that the observations are generated by an arbitrary mixture of these distributions allows us to “wash out” any information that doesn’t actually correspond to some robustly shared features of cars-or-apples.
That is: consider if Xtest is 99% cars, 1% apples. Then an approximately correct natural latent over it is basically just Λcar, maybe with some additional noise from apples thrown in. This is what we’d get if we used the “naive” procedure in (1) above. But if we’re allowed to mix up the distributions, then “ramping” up the “apple” distribution (defining Qα=0.01(X), say) would end up with low probabilities assigned to all observations corresponding to cars, and now the approximately correct natural latent over this dataset would have more apple-like qualities. The demand for the latent to be valid on arbitrary α∈[0,1] then “washes out” all traces of car-ness and apple-ness, leaving only redness.
Is this about right? I’m getting a vague sense of some disconnect between this formulation and the OP...