First thread: treat each bit in the representation of quantities as distinct random variables, so that e.g. the higher-order and lower-order bits are separate. Then presumably there will often be good approximate natural latents (and higher-level abstract structures) over the higher-order bits, moreso than the lower-order bits. I would say this is the most obvious starting point, but it also has a major drawback: “bits” of a binary number representation are an extremely artificial ontological choice for purposes of this problem. I’d strongly prefer an approach in which magnitudes drop out more naturally.
Is the idea here to try to find a way to bias the representation towards higher-order bits than lower-order bits in a variable? I don’t think this is necessary, because it seems like you would get it “for free” due to the fact that lower-order bits usually aren’t predictable without the higher-order bits.
The issue I’m talking about is that we don’t want a bias towards higher-order bits, we want a bias towards magnitude. As in, if there’s 100s of dynamics that can be predicted about something that’s going on at the scale of 1 kJ or 1 gram or 1 cm, that’s about 1/10th as important as if there’s 1 dynamic that can be predicted about something that’s going on at the scale of 1000 MJ or 1 kg or 10 m.
(Obviously on the agency side of things, we have a lot of concepts that allow us to make sense of this, but they all require a representation of the magnitudes, so if the epistemics don’t contain some bias towards magnitudes, the agents might mostly “miss” this.)
Thus the second thread: maxent. It continues to seem like there’s probably a natural way to view natural latents in a maxent form, which would involve numerically-valued natural “features” that get added together. That would provide a much less artificial notion of magnitude. However, it requires figuring out the maxent thing for natural latents, which I’ve tried and failed at several times now (though with progress each time).
Hmmm maybe.
You mean like, the macrostate k is defined by an equation like, the highest-entropy distribution of microstates that satisfies E[f(microstate)]=k? My immediate skepticism would be that this is still defining the magnitudes epistemically (based on the probabilities in the expectation), whereas I suspect they would have to be based on something like a conservation law or diffusion process, but let’s take some more careful thought:
It seems like we’d generally not expect to be able to directly observe a microstate. So we’d really use something like E[f(adjacentmacrostate)|condition]=k; for instance the classical case is putting a thermometer to an object, or putting an object on a scale.
But for most natural features g(microstate), your uncertainty about the macrostates would (according to the central thesis of my sequence) be ~lognormally distributed. Since a lognormal distribution is maxent based on log(g) and log(g)2, this means that a maximally-informative f would be something like f(adjacentmacrostate)≈E[(log(g(microstate)),log(g(microstate))2)|adjacentmacrostate].
And, because the scaling of f is decided by the probabilities, its scaling is only defined up to a multiplicative factor p, which means g is only defined up to a power, such that g(...)p would be as natural as g(...).
Which undermines the possibility of addition, because ∑ixpi≠(∑ixi)p.
As a side-note, a slogan I’ve found which communicates the relevant intuition is “information is logarithmic”. I like to imagine that the “ideal” information-theoretic encoder is a function h such that h(v⊗w)=h(v)⊕h(w) (converting tensor products to direct sums). Of course, this is kind of underdefined, doesn’t even typecheck, and can easily be used to derive contradictions; but I find it gives the right intuition in a lot of places, so I expect to eventually find a cleaner way to express it.
Is the idea here to try to find a way to bias the representation towards higher-order bits than lower-order bits in a variable? I don’t think this is necessary, because it seems like you would get it “for free” due to the fact that lower-order bits usually aren’t predictable without the higher-order bits.
The issue I’m talking about is that we don’t want a bias towards higher-order bits, we want a bias towards magnitude. As in, if there’s 100s of dynamics that can be predicted about something that’s going on at the scale of 1 kJ or 1 gram or 1 cm, that’s about 1/10th as important as if there’s 1 dynamic that can be predicted about something that’s going on at the scale of 1000 MJ or 1 kg or 10 m.
(Obviously on the agency side of things, we have a lot of concepts that allow us to make sense of this, but they all require a representation of the magnitudes, so if the epistemics don’t contain some bias towards magnitudes, the agents might mostly “miss” this.)
Hmmm maybe.
You mean like, the macrostate k is defined by an equation like, the highest-entropy distribution of microstates that satisfies E[f(microstate)]=k? My immediate skepticism would be that this is still defining the magnitudes epistemically (based on the probabilities in the expectation), whereas I suspect they would have to be based on something like a conservation law or diffusion process, but let’s take some more careful thought:
It seems like we’d generally not expect to be able to directly observe a microstate. So we’d really use something like E[f(adjacentmacrostate)|condition]=k; for instance the classical case is putting a thermometer to an object, or putting an object on a scale.
But for most natural features g(microstate), your uncertainty about the macrostates would (according to the central thesis of my sequence) be ~lognormally distributed. Since a lognormal distribution is maxent based on log(g) and log(g)2, this means that a maximally-informative f would be something like f(adjacentmacrostate)≈E[(log(g(microstate)),log(g(microstate))2)|adjacentmacrostate].
And, because the scaling of f is decided by the probabilities, its scaling is only defined up to a multiplicative factor p, which means g is only defined up to a power, such that g(...)p would be as natural as g(...).
Which undermines the possibility of addition, because ∑ixpi≠(∑ixi)p.
As a side-note, a slogan I’ve found which communicates the relevant intuition is “information is logarithmic”. I like to imagine that the “ideal” information-theoretic encoder is a function h such that h(v⊗w)=h(v)⊕h(w) (converting tensor products to direct sums). Of course, this is kind of underdefined, doesn’t even typecheck, and can easily be used to derive contradictions; but I find it gives the right intuition in a lot of places, so I expect to eventually find a cleaner way to express it.