Let’s say every day at the office, we get three boxes of donuts, numbered 1, 2, and 3. I grab a donut from each box, plunk them down on napkins helpfully labeled X1, X2, and X3. The donuts vary in two aspects: size (big or small) and flavor (vanilla or chocolate). Across all boxes, the ratio of big to small donuts remains consistent. However, Boxes 1 and 2 share the same vanilla-to-chocolate ratio, which is different from that of Box 3.
Does the correlation between X1 and X2 imply that there is no natural latent? Is this the desired behavior of natural latents, despite the presence of the common size ratio? (and the commonality that I’ve only ever pulled out donuts; there has never been a tennis ball in any of the boxes!)
If so, why is this what we want from natural latents? If not, how does a natural latent arise despite the internal correlation?
My take would be to split each “donut” variable Xi into “donut size” Si and “donut flavour” Fi. Then there a natural latent for the whole {Si} set of variables, and no natural latent for the whole {Fi} set.{Fi} basically becomes the “other stuff in the world” Z variable relative to {Si}.
Granted, there’s an issue in that we can basically do that for any set of variables Xi, even entirely unrelated ones: deliberately search for some decomposition of Xi into an Si and an Fi such that there’s a natural latent for Si. I think some more practical measures could be taken into account here, though, to enure that the abstractions we find are useful. For example, we can check the relative information contents/entropies of {Xi} and {Si}, thereby measuring “how much” of the initial variable-set we’re abstracting over. If it’s too little, that’s not a useful abstraction.[1]
That passes my common-sense check, at least. It’s essentially how we’re able to decompose and group objects along many different dimensions. We can focus on objects’ geometry (and therefore group all sphere-like objects, from billiard balls to planets to weather balloons) or their material (grouping all objects made out of rock) or their origin (grouping all man-made objects), etc.
Each grouping then corresponds to an abstraction, with its own generally-applicable properties. E. g., deriving a “sphere” abstraction lets us discover properties like “volume as a function of radius”, and then we can usefully apply that to any spherical object we discover. Similarly, man-made objects tend to have a purpose/function (unlike natural ones), which likewise lets us usefully reason about that whole category in the abstract.
(Edit: On second thoughts, I think the obvious naive way of doing that just results in {Si} containing all mutual information between Xi, with the “abstraction” then just being said mutual information. Which doesn’t seem very useful. I still think there’s something in that direction, but probably not exactly this.)
This branch of research is aimed at finding a (nearly) objective way of thinking about the universe. When I imagine the end result, I imagine something that receives a distribution across a bunch of data, and finds a bunch of useful patterns within it. At the moment that looks like finding patterns in data via find_natural_latent(get_chunks_of_data(data_distribution)) or perhaps showing that find_top_n(n, (chunks, natural_latent(chunks)) for chunks in all_chunked_subsets_of_data(data_distribution), key=lambda chunks, latent: usefulness_metric(latent)) is a (convergent sub)goal of agents. As such, the notion that the donuts’ data is simply poorly chunked—which needs to be solved anyway—makes a lot of sense to me.
I don’t know how to think about the possibilities when it comes to decomposing Xi. Why would it always be possible to decompose random variables to allow for a natural latent? Do you have an easy example of this? Also, what do you mean by mutual information between Xi, given that there are at least 3 of them? And why would just extracting said mutual information be useless? If you get the chance to point me towards good resources about any of these questions, that would be great.
Regarding chunking: a background assumption for me is that the causal structure of the world yields a natural chunking, with each chunk taking up a little local “voxel” of spacetime.
Some amount spacetime-induced chunking is “forced upon” an embedded agent, in some sense, since their sensors and actuators are localized in spacetime.
Now, there’s still degrees of freedom in taking more or less coarse-grained chunkings, and more or less coarse-graining differentially along different spacetime directions or in different places. But I expect that spacetime locality mostly nails down what we need as a starting point for convergent chunking.
Why would it always be possible to decompose random variables to allow for a natural latent?
Well, I suppose I overstated it a bit by saying “always”; you can certainly imagine artificial setups where the mutual information between a bunch of variables is zero. In practice, however, everything in the world is correlated with everything else, so in a real-world setting you’ll likely find such a decomposition always, or almost always.
And why would just extracting said mutual information be useless?
Well, not useless as such – it’s a useful formalism – but it would basically skip everything John and David’s post is describing. Crucially, it won’t uniquely determine whether a specific set of objects represents a well-abstracting category.
The abstraction-finding algorithm should be able to successfully abstract over data if and only if the underlying data actually correspond to some abstraction. If it can abstract over anything, however – any arbitrary bunch of objects – then whatever it is doing, it’s not finding “abstractions”. It may still be useful, but it’s not what we’re looking for here.
Concrete example: if we feed our algorithm 1000 examples of trees, it should output the “tree” abstraction. If we feed our algorithm 200 examples each of car tires, trees, hydrogen atoms, wallpapers, and continental-philosophy papers, it shouldn’t actually find some abstraction which all of these objects are instances of. But as per the everything-is-correlated argument above, they likely have non-zero mutual information, so the naive “find a decomposition for which there’s a natural latent” algorithm would fail to output nothing.
More broadly: We’re looking for a “true name” of abstractions, and mutual information is sort-of related, but also clearly not precisely it.
Let’s say every day at the office, we get three boxes of donuts, numbered 1, 2, and 3. I grab a donut from each box, plunk them down on napkins helpfully labeled X1, X2, and X3. The donuts vary in two aspects: size (big or small) and flavor (vanilla or chocolate). Across all boxes, the ratio of big to small donuts remains consistent. However, Boxes 1 and 2 share the same vanilla-to-chocolate ratio, which is different from that of Box 3.
Does the correlation between X1 and X2 imply that there is no natural latent? Is this the desired behavior of natural latents, despite the presence of the common size ratio? (and the commonality that I’ve only ever pulled out donuts; there has never been a tennis ball in any of the boxes!)
If so, why is this what we want from natural latents? If not, how does a natural latent arise despite the internal correlation?
My take would be to split each “donut” variable Xi into “donut size” Si and “donut flavour” Fi. Then there a natural latent for the whole {Si} set of variables, and no natural latent for the whole {Fi} set.{Fi} basically becomes the “other stuff in the world” Z variable relative to {Si}.
Granted, there’s an issue in that we can basically do that for any set of variables Xi, even entirely unrelated ones: deliberately search for some decomposition of Xi into an Si and an Fi such that there’s a natural latent for Si. I think some more practical measures could be taken into account here, though, to enure that the abstractions we find are useful. For example, we can check the relative information contents/entropies of {Xi} and {Si}, thereby measuring “how much” of the initial variable-set we’re abstracting over. If it’s too little, that’s not a useful abstraction.[1]
That passes my common-sense check, at least. It’s essentially how we’re able to decompose and group objects along many different dimensions. We can focus on objects’ geometry (and therefore group all sphere-like objects, from billiard balls to planets to weather balloons) or their material (grouping all objects made out of rock) or their origin (grouping all man-made objects), etc.
Each grouping then corresponds to an abstraction, with its own generally-applicable properties. E. g., deriving a “sphere” abstraction lets us discover properties like “volume as a function of radius”, and then we can usefully apply that to any spherical object we discover. Similarly, man-made objects tend to have a purpose/function (unlike natural ones), which likewise lets us usefully reason about that whole category in the abstract.
(Edit: On second thoughts, I think the obvious naive way of doing that just results in {Si} containing all mutual information between Xi, with the “abstraction” then just being said mutual information. Which doesn’t seem very useful. I still think there’s something in that direction, but probably not exactly this.)
Relevant: Finite Factored Sets, which IIRC offer some machinery for these sorts of decompositions of variables.
This branch of research is aimed at finding a (nearly) objective way of thinking about the universe. When I imagine the end result, I imagine something that receives a distribution across a bunch of data, and finds a bunch of useful patterns within it. At the moment that looks like finding patterns in data via
find_natural_latent(get_chunks_of_data(data_distribution))
or perhaps showing that
find_top_n(n, (chunks, natural_latent(chunks)) for chunks in
all_chunked_subsets_of_data(data_distribution),
key=lambda chunks, latent: usefulness_metric(latent))
is a (convergent sub)goal of agents. As such, the notion that the donuts’ data is simply poorly chunked—which needs to be solved anyway—makes a lot of sense to me.
I don’t know how to think about the possibilities when it comes to decomposing Xi. Why would it always be possible to decompose random variables to allow for a natural latent? Do you have an easy example of this?
Also, what do you mean by mutual information between Xi, given that there are at least 3 of them? And why would just extracting said mutual information be useless?
If you get the chance to point me towards good resources about any of these questions, that would be great.
Regarding chunking: a background assumption for me is that the causal structure of the world yields a natural chunking, with each chunk taking up a little local “voxel” of spacetime.
Some amount spacetime-induced chunking is “forced upon” an embedded agent, in some sense, since their sensors and actuators are localized in spacetime.
Now, there’s still degrees of freedom in taking more or less coarse-grained chunkings, and more or less coarse-graining differentially along different spacetime directions or in different places. But I expect that spacetime locality mostly nails down what we need as a starting point for convergent chunking.
You can generalize mutual information to N variables: interaction information.
Well, I suppose I overstated it a bit by saying “always”; you can certainly imagine artificial setups where the mutual information between a bunch of variables is zero. In practice, however, everything in the world is correlated with everything else, so in a real-world setting you’ll likely find such a decomposition always, or almost always.
Well, not useless as such – it’s a useful formalism – but it would basically skip everything John and David’s post is describing. Crucially, it won’t uniquely determine whether a specific set of objects represents a well-abstracting category.
The abstraction-finding algorithm should be able to successfully abstract over data if and only if the underlying data actually correspond to some abstraction. If it can abstract over anything, however – any arbitrary bunch of objects – then whatever it is doing, it’s not finding “abstractions”. It may still be useful, but it’s not what we’re looking for here.
Concrete example: if we feed our algorithm 1000 examples of trees, it should output the “tree” abstraction. If we feed our algorithm 200 examples each of car tires, trees, hydrogen atoms, wallpapers, and continental-philosophy papers, it shouldn’t actually find some abstraction which all of these objects are instances of. But as per the everything-is-correlated argument above, they likely have non-zero mutual information, so the naive “find a decomposition for which there’s a natural latent” algorithm would fail to output nothing.
More broadly: We’re looking for a “true name” of abstractions, and mutual information is sort-of related, but also clearly not precisely it.