Coming from another direction: a 50-bit update can turn Q into P, or vice-versa. So one thing this example shows is that natural latents, as they’re currently formulated, are not necessarily robust to even relatively small updates, since 50 bits can quite dramatically change a distribution.
Are you sure this is undesired behavior? Intuitively, small updates (relative to the information-content size of the system regarding which we’re updating) can drastically change how we’re modeling a particular system, into what abstractions we decompose it. E. g., suppose we have two competing theories regarding how to predict the neural activity in the human brain, and a new paper comes out with some clever (but informationally compact) experiment that yields decisive evidence in favour of one of those theories. That’s pretty similar to the setup in the post here, no? And reading this paper would lead to significant ontology shifts in the minds of the researchers who read it.
Indeed, now that I’m thinking about it, I’m not sure the quantity bit-size of the updatebit-size of the system is in any way interesting at all? Consider that the researchers’ minds could be updated either from reading the paper and examining the experimental procedure in detail (a “medium” number of bits), or by looking at the raw output data and then doing a replication of the paper (a “large” number of bits), or just by reading the names of the authors and skimming the abstract (a “small” number of bits).
There doesn’t seem to be a direct causal connection between the system’s size and the amount of bits needed to drastically update on its structure at all? You seem to expect some sort of proportionality between the two, but I think the size of one is straight-up independent of the size of the other if you let the nature of the communication channel between the system and the agent-doing-the-updating vary freely (i. e., if you’re uncertain regarding whether it’s “direct observation of the system” OR “trust in science” OR “trust in the paper’s authors” OR …).[1]
Indeed, merely describing how you need to update using high-level symbolic languages, rather than by throwing raw data about the system at you, already shaves off a ton of bits, decoupling “the size of the system” from “the size of the update”.
Perhaps DKL really isn’t the right metric to use, here? The motivation for having natural abstractions in your world-model is that they make the world easier to predict for the purposes of controlling said world. So similar-enough natural abstractions would recommend the same policies for navigating that world. Back-tracking further, the distributions that would give rise to similar-enough natural abstractions would be distributions that correspond to worlds the policies for navigating which are similar-enough...
I. e., the distance metric would need to take interventions/the do operator into account. Something like SID comes to mind (but not literally SID, I expect).
Though there may be some more interesting claim regarding that entire channel? E. g., that if the agent can update drastically just based on a few bits output by this channel, we have to assume that the channel contains “information funnels” which compress/summarize the raw state of the system down? That these updates have to be entangled with at least however-many-bits describing the ground-truth state of the system, for them to be valid?
In the context of alignment, we want to be able to pin down which concepts we are referring to, and natural latents were (as I understand it) partly meant to be a solution to that. However if there are multiple different concepts that fit the same natural latent but function very differently then that doesn’t seem to solve the alignment aspect.
I do see the intuitive angle of “two agents exposed to mostly-similar training sets should be expected to develop the same natural abstractions, which would allow us to translate between the ontologies of different ML models and between ML models and humans”, and that this post illustrated how one operationalization of this idea failed.
However if there are multiple different concepts that fit the same natural latent but function very differently
That’s not quite what this post shows, I think? It’s not that there are multiple concepts that fit the same natural latent, it’s that if we have two distributions that are judged very close by the KL divergence, and we derive the natural latents for them, they may turn out drastically different. The P agent and the Q agent legitimately live in very epistemically different worlds!
Which is likely not actually the case for slightly different training sets, or LLMs’ training sets vs. humans’ life experiences. Those are very close on some metric X, and now it seems that X isn’t (just) DKL.
Maybe one way to phrase it is that the X’s represent the “type signature” of the latent, and the type signature is the thing we can most easily hope is shared between the agents, since it’s “out there in the world” as it represents the outwards interaction with things. We’d hope to be able to share the latent simply by sharing the type signature, because the other thing that determines the latent is the agents’ distribution, but this distribution is more an “internal” thing that might be too complicated to work with. But the proof in the OP shows that the type signature is not enough to pin it down, even for agents whose models are highly compatible with each other as-measured-by-KL-in-type-signature.
Sure, but what I question is whether the OP shows that the type signature wouldn’t be enough for realistic scenarios where we have two agents trained on somewhat different datasets. It’s not clear that their datasets would be different the same way P and Q are different here.
Are you sure this is undesired behavior? Intuitively, small updates (relative to the information-content size of the system regarding which we’re updating) can drastically change how we’re modeling a particular system, into what abstractions we decompose it. E. g., suppose we have two competing theories regarding how to predict the neural activity in the human brain, and a new paper comes out with some clever (but informationally compact) experiment that yields decisive evidence in favour of one of those theories. That’s pretty similar to the setup in the post here, no? And reading this paper would lead to significant ontology shifts in the minds of the researchers who read it.
Which brings to mind How Many Bits Of Optimization Can One Bit Of Observation Unlock?, and the counter-example there...
Indeed, now that I’m thinking about it, I’m not sure the quantity bit-size of the updatebit-size of the system is in any way interesting at all? Consider that the researchers’ minds could be updated either from reading the paper and examining the experimental procedure in detail (a “medium” number of bits), or by looking at the raw output data and then doing a replication of the paper (a “large” number of bits), or just by reading the names of the authors and skimming the abstract (a “small” number of bits).
There doesn’t seem to be a direct causal connection between the system’s size and the amount of bits needed to drastically update on its structure at all? You seem to expect some sort of proportionality between the two, but I think the size of one is straight-up independent of the size of the other if you let the nature of the communication channel between the system and the agent-doing-the-updating vary freely (i. e., if you’re uncertain regarding whether it’s “direct observation of the system” OR “trust in science” OR “trust in the paper’s authors” OR …).[1]
Indeed, merely describing how you need to update using high-level symbolic languages, rather than by throwing raw data about the system at you, already shaves off a ton of bits, decoupling “the size of the system” from “the size of the update”.
Perhaps DKL really isn’t the right metric to use, here? The motivation for having natural abstractions in your world-model is that they make the world easier to predict for the purposes of controlling said world. So similar-enough natural abstractions would recommend the same policies for navigating that world. Back-tracking further, the distributions that would give rise to similar-enough natural abstractions would be distributions that correspond to worlds the policies for navigating which are similar-enough...
I. e., the distance metric would need to take interventions/the do operator into account. Something like SID comes to mind (but not literally SID, I expect).
Though there may be some more interesting claim regarding that entire channel? E. g., that if the agent can update drastically just based on a few bits output by this channel, we have to assume that the channel contains “information funnels” which compress/summarize the raw state of the system down? That these updates have to be entangled with at least however-many-bits describing the ground-truth state of the system, for them to be valid?
We actually started from that counterexample, and the tiny mixtures example grew out of it.
In the context of alignment, we want to be able to pin down which concepts we are referring to, and natural latents were (as I understand it) partly meant to be a solution to that. However if there are multiple different concepts that fit the same natural latent but function very differently then that doesn’t seem to solve the alignment aspect.
I do see the intuitive angle of “two agents exposed to mostly-similar training sets should be expected to develop the same natural abstractions, which would allow us to translate between the ontologies of different ML models and between ML models and humans”, and that this post illustrated how one operationalization of this idea failed.
That’s not quite what this post shows, I think? It’s not that there are multiple concepts that fit the same natural latent, it’s that if we have two distributions that are judged very close by the KL divergence, and we derive the natural latents for them, they may turn out drastically different. The P agent and the Q agent legitimately live in very epistemically different worlds!
Which is likely not actually the case for slightly different training sets, or LLMs’ training sets vs. humans’ life experiences. Those are very close on some metric X, and now it seems that X isn’t (just) DKL.
Maybe one way to phrase it is that the X’s represent the “type signature” of the latent, and the type signature is the thing we can most easily hope is shared between the agents, since it’s “out there in the world” as it represents the outwards interaction with things. We’d hope to be able to share the latent simply by sharing the type signature, because the other thing that determines the latent is the agents’ distribution, but this distribution is more an “internal” thing that might be too complicated to work with. But the proof in the OP shows that the type signature is not enough to pin it down, even for agents whose models are highly compatible with each other as-measured-by-KL-in-type-signature.
Sure, but what I question is whether the OP shows that the type signature wouldn’t be enough for realistic scenarios where we have two agents trained on somewhat different datasets. It’s not clear that their datasets would be different the same way P and Q are different here.