How sure are we that models will keeptracking Bayesian belief states, and so allow this inverse reasoning to be used, when they don’t have enough space and compute to actually track a distribution over latent states?
One obvious guess there would be that the factorization structure is exploited, e.g. independence and especially conditional independence/DAG structure. And then a big question is how distributions of conditionally independent latents in particular end up embedded.
One obvious guess there would be that the factorization structure is exploited, e.g. independence and especially conditional independence/DAG structure. And then a big question is how distributions of conditionally independent latents in particular end up embedded.
Separately, there are theoretical reasons to expect convergence to approximate causal models of the data generating process, e.g. Robust agents learn causal world models.
Right. If I have n fully independent latent variables that suffice to describe the state of the system, each of which can be in one of s different states, then even tracking the probability of every state for every latent with a p bit precision float will only take me about n×s×p bits. That’s actually not that bad compared to n×log(s) for just tracking some max likelihood guess.
One obvious guess there would be that the factorization structure is exploited, e.g. independence and especially conditional independence/DAG structure. And then a big question is how distributions of conditionally independent latents in particular end up embedded.
There are some theoretical reasons to expect linear representations for variables which are causally separable / independent. See recent work from Victor Veitch’s group, e.g. Concept Algebra for (Score-Based) Text-Controlled Generative Models, The Linear Representation Hypothesis and the Geometry of Large Language Models, On the Origins of Linear Representations in Large Language Models.
Separately, there are theoretical reasons to expect convergence to approximate causal models of the data generating process, e.g. Robust agents learn causal world models.
Linearity might also make it (provably) easier to find the concepts, see Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models.
Right. If I have n fully independent latent variables that suffice to describe the state of the system, each of which can be in one of s different states, then even tracking the probability of every state for every latent with a p bit precision float will only take me about n×s×p bits. That’s actually not that bad compared to n×log(s) for just tracking some max likelihood guess.