How We Picture Bayesian Agents
I think that when most people picture a Bayesian agent, they imagine a system which:
Enumerates every possible state/trajectory of “the world”, and assigns a probability to each.
When new observations come in, loops over every state/trajectory, checks the probability of the observations conditional on each, and then updates via Bayes rule.
To select actions, computes the utility which each action will yield under each state/trajectory, then averages over state/trajectory weighted by probability, and picks the action with the largest weighted-average utility.
Typically, we define Bayesian agents as agents which behaviorally match that picture.
But that’s not really the picture David and I typically have in mind, when we picture Bayesian agents. Yes, behaviorally they act that way. But I think people get overly-anchored imagining the internals of the agent that way, and then mistakenly imagine that a Bayesian model of agency is incompatible with various features of real-world agents (e.g. humans) which a Bayesian framework can in fact handle quite well.
So this post is about our prototypical mental picture of a “Bayesian agent”, and how it diverges from the basic behavioral picture.
Causal Models and Submodels
Probably you’ve heard of causal diagrams or Bayes nets by now.
If our Bayesian agent’s world model is represented via a big causal diagram, then that already looks quite different from the original “enumerate all states/trajectories” picture. Assuming reasonable sparsity, the data structures representing the causal model (i.e. graph + conditional probabilities on each node) take up an amount of space which grows linearly with the size of the world, rather than exponentially. It’s still too big for an agent embedded in the world to store in its head directly, but much smaller than the brute-force version.
(Also, a realistic agent would want to explicitly represent more than just one causal diagram, in order to have uncertainty over causal structure. But that will largely be subsumed by our next point anyway.)
Much more efficiency can be achieved by representing causal models like we represent programs. For instance, this little “program”:
factorial = Model {
n = 4
base_result = 1
recurse_result = do(factorial, n=n-1).result
result = (n == 0) ? base_result : n * recurse_result
}
… is in fact a recursively-defined causal model. It compactly represents an infinite causal diagram, corresponding to the unrolled computation. (See the linked post for more details on how this works.)
Conceptually, this sort of representation involves lots of causal “submodels” which “call” each other—or, to put it differently, lots of little diagram-pieces which can be wired together and reused in the full world-model. Reuse means that such models can represent worlds which are “bigger than” the memory available to the agent itself, so long as those worlds have lots of compressible structure—e.g. the factorial example above, which represents an infinite causal diagram using a finite representation.
(Aside: those familiar with probabilistic programming could view this world-model representation as simply a probabilistic program.)
Updates
So we have a style of model which can compactly represent quite large worlds, so long as those worlds have lots of compressible structure. But there’s still the problem of updates on that structure.
Here, we typically imagine some kind of message-passing, though it’s an open problem exactly what such an algorithm looks like for big/complex models.
The key idea here is that most observations are not directly relevant to our submodels of most of the world. I see a bird flying by my office, and that tells me nothing at all about the price of gasoline[1]. So we expect that, the vast majority of the time, message-passing updates of a similar flavor to those used on Bayes nets (though not exactly the same) will quickly converge, without having to explicitly propagate to most of the submodel-nodes.
Latents
Message-passing on large models does still have some efficiency issues, however. To make things more efficient, we expect that realistic agents typically structure their model around “latent variables” which mediate most interactions. For instance, early 20th century biologists would observe that some species of animals had very similar anatomy, physiology, or behavior—i.e. if one wrote out a giant list of traits, some species would end up with very highly correlated lists. From this, they inferred some latent (i.e. not directly observed) relationship between those species—in this case, shared evolutionary ancestry. The extent to which this inference was correct varied—inferences are sometimes wrong, even when the reasoning is basically right—but either way, that “mediation by latent shared ancestry” pattern sure was how biologists structured their models.
Humans in general seem to do a very similar thing when modeling the world as containing “kinds of things”—i.e. we notice that there’s a cluster of things which have bark, leaves, wood, roots, etc, all connected in a shape with a central trunk recursively branching out both above and below ground… Then we intuitively model all these things as stemming from some latent variable (e.g. “tree-ness”). That latent variable, in our internal models, explains the correlations: a child might ask “why do things which have bark also have roots?”, and we might reply “because they’re trees”. Again, there’s room to argue about how well that answers the child’s question, but the answer does seem to reflect the internal structure of our models either way.
One key issue: different agents could, in principle, model the same environment using different latents; the latents are not necessarily fully determined by the prior + environment. For instance, I could model a bunch of rolls of a biased die as mediated by an unknown “bias”, or I could model them as just a bunch of rolls with some complicated correlations between them. The predictions will be the same. In practice minds mostly seem to converge on quite similar latents, and the general project of natural abstraction is largely aimed at understanding when and why that happens.
Aside: Map-Territory Correspondence
There is no rule saying that the variables in a Bayesian agent’s world-model have anything to do with “things” in their environment. I could totally write a Bayesian agent which models itself as living in Conway’s Game of Life and tries to maximize a utility function defined over things in Conway’s game of life (like e.g. number of gliders), but then I could wire up the inputs and outputs of that agent to a photosensor and motor in my office. The agent will mostly be very confused (i.e. its predictions will be wrong a lot), and won’t do anything interesting, but it would be a valid Bayesian agent.
In particular, it’s the latents in the model which don’t need to correspond to anything in the environment. The variables which the agent maps to its observations and actions (as opposed to latents, which are everything else), do have some rigid “correspondence”, because when the agent receives inputs it will map them to its observations, and when the agent yields outputs it will map them to its actions.
A more realistic example: some humans believe in e.g. spirits or the like. Much like the Conway’s Game of Life bot, they are just very confused, and those parts of their world model involving spirits don’t necessarily “correspond to” any actual structure in the world.
… Nonetheless, in practice it seems like most latents in most humans’ models do “correspond to” stuff in the world in some important sense, and understanding that correspondence is another big part of the general project of natural abstraction.
Utility Over Latents
One big reason that latent variables are important is that, insofar as it makes sense to view real-world agents as Bayesians at all, the inputs to those agents’ utility functions are typically latent variables—not observations or actions directly. This follows from common sentiments like “I want my spouse to actually be happy, not just to look-to-me like they’re happy”. “Look-to-me like they’re happy” would be a utility function whose inputs are my own observations directly; “actually be happy” is a utility function whose inputs are latent variables representing my spouse.
For more on this topic, see The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables.
Lazy Utility Maximization
Even if causal models structured like programs and message-passing and latents allow for efficient updates of models of large worlds (and, to be clear, we don’t think we currently have the whole story here), there’s still the question of how to efficiently maximize expected utility over the model.
A key idea here is that we never actually need to calculate expected utility, in order to maximize it.
For example, suppose I’m deciding what to order for lunch. I expect this decision to be basically-irrelevant to the vast majority of things I care about in the world and in life. But if I want to calculate my full expected utility, I need to account for all those things, from Dad’s collection of old milk bottles to future tiny genetically engineered dragons. But I don’t need to calculate all that, in order to make an expected-utility-maximizing lunch order. I just need to calculate the difference between the utility which I expect if I order lamb Karahi vs a sisig burrito.
… and since my expectations for most of the world are the same under those two options, I should be able to calculate the difference lazily, without having to query most of my world model. Much like the message-passing update, I expect deltas to quickly fall off to zero as things propagate through the model.
Caching and Inconsistency
Here we’ll diverge somewhat from a strictly behaviorally Bayesian agent, but in a way which plays particularly well with an otherwise-Bayesian agent.
Richard Bellman popularized the idea of dynamic programming: in this context, making utility maximization calculations more efficient by precomputing and caching the instrumental values of intermediates. Insofar as we imagine our supposedly-Bayesian agent maintaining some instrumental value cache, we open the door to a certain kind of “incoherence”: the values in the cache may, for some reason, be inconsistent with either each other or the agent’s utility function. This sort of incoherence could be locally detected and fixed, by checking whether the cached values locally satisfy the Bellman equation (with the exact flavor of Bellman equation depending on what style of model we’re using for the Bayesian agent).
Similarly, we could imagine caching being useful epistemically, for efficient updates. There again, failures of cache maintenance could result in “inconsistent beliefs”.
If and when cache inconsistency is detected, the agent might require quite a bit of propagation—i.e. thinking and reflection—to sort it out.
Putting It All Together
When we picture a “Bayesian agent”, we’re typically picturing an agent with a world-model which looks basically like a moderately-sized program with a lot of recursion. That “program” represents a big causal model as a bunch of smaller submodels, which get reused and “call” each other.
Updates are performed via some sort of message-passing; we expect that the messages don’t typically need to propagate very far. Similarly, to maximize expected utility, the agent only needs to compute the difference in expected utility between options available in its current decision. As with updates, such differences are expected to typically not propagate very far.
Most of the variables in the model are latents, as opposed to variables directly representing observations or actions. Such latents don’t have to correspond to anything in the world; the fact that they usually seem to correspond to stuff in the world in some sense is an interesting empirical fact, and characterizing that “correspondence” is one big piece of the general project of natural latents. One reason such latents are important (even without bringing e.g. language into the picture) is that the inputs to the agent’s utility function are typically latents rather than observations/actions—e.g. “I want my spouse to actually be happy, not just to look-to-me like they’re happy”.
Finally, if we want to make the model capture certain non-Bayesian human behaviors while still keeping most of the picture, we can assume that instrumental values and/or epistemic updates are cached. This creates the possibility of cache inconsistency/incoherence.
- ^
John is clearly a complete amateur at augury, but the meaning here is hopefully still clear.
Yeah to some extent, although it’s stacking the deck when the minds speak the same language and grew up in the same culture. If you instead go to remote tribes, you find plenty of untranslatable words—or more accurately, words that translate to some complicated phrase that you’ve probably never thought about before. (I dug up an example for §4.3 here, in reference to Lisa Feldman Barrett’s extensive chronicling of exotic emotion words from around the world.)
(That’s not necessarily relevant to alignment because we could likewise put AGIs in a training environment with lots of English-language content, and then the AGIs would presumably get English-language concepts.)
You were talking about values and preferences in the previous paragraph, then suddenly switched to “beliefs”. Was that deliberate?
Yes.
This is an exciting observation. I wonder if you could empirically demonstrate that this works in a model based RL setup, on a videogame or something?
I’m not sure I understand this—very far from where? Where do we start with updating? Which beliefs/latents are updated first?
Very far through the graph representing the causal model, where we start from one or a few nodes representing the immediate observations.
Do you have an example?
Say I have the visual impression of a rose, presumably caused by a rose in front of me. Do I then update beliefs involving this rose? And afterwards beliefs about things which caused the rose to exist? E.g. about the gardener? Or perhaps one could say my observation of a rose was caused by my own behavior? Head movements, plans etc.
In my mind, there is an amount of internal confusion which feels much stronger than what I would expect for an agent as in the OP. Or is the idea possibly that everything in the architecture uses caching and instrumental values? From reading, I imagined a memory+cache structure instead of being closer to “cache all the way down”.
Apart from this, I would bet that something interesting will happen for a somewhat human-comparable agent with regards to self-modelling and identity. Would anything similar to human identity emerge or would this require additional structure? Some representation of the agent itself, and its capabilities should be present at least
“Cached” might be an unhelpful term here, compared to “amortized”. ‘Cache’ makes one think of databases or memories, as something you ‘know’ (in a database or long-term memory somewhere), whereas in practice it tends to be more something you do—fusing inference with action. (They are ‘cached’ in the same way that you might loosely talk about a neural net ‘caching’ a complicated-to-compute function, like a value function in RL/decision theory.)
So ‘amortized’ tends to be more used in the Bayesian RL literature, and give you an idea of what Bayesian RL agents (like LLMs) are doing: they are not (usually) implementing the Bayes-optimal backwards induction over the full decision-tree solving the POMDP when they engage in meta-learning like in-context learning (which leads you to infeasibilities like AIXI), they are doing amortized optimization. Depending on available time & compute, an agent might, at any given moment, be doing something anywhere on the spectrum from hardwired reflex to cogitating for hours explicitly on a tree of possibilities. (Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes. Or in expert iteration like AlphaZero, you have the CNN executing an amortized version of all previous MCTS searches, as distilled into the CNN, and then executing some more explicit tree search to improve its current estimates and then amortize that back into the CNN again to improve the policy some more.)
They gradually learn, applying some optimization one at a time, to implement a computation increasingly equivalent to the Bayes-optimal actions, which may boil down to an extremely simple algorithm like tracking a single sufficient-statistic summarizing the entire history and implementing an if-then-else on a boundary value of it (eg. drift-diffusion); Duff 2002 suggests thinking of it as “compiling” the full Bayes-optimal program interpreted flexibly but slowly at runtime down into a fast optimized but inflexible executable specialized for particular cases. A beautiful example of reading off the simple head/tails counting algorithm implemented by a meta-learning RNN can be seen in https://arxiv.org/pdf/1905.03030.pdf#page=6&org=deepmind EDIT: I go through a lot of this for my Kelly coin-flip page, but also here’s some recent research doing the same thing, but with different non-Bayesian terminology is https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their
(I have more links on this topic; does anyone have a better review of the topic than “Bayesian Reinforcement Learning: A Survey”, Ghavamzadeh et al 2016? I feel like a major problem with discussion of LLM scaling is that the Bayesian RL perspective is just not getting through to people, and part of the problem is I’m not sure what ‘the’ best introduction or summary writeup is. People can hardly be expected to just go and read 30 years of Schmidhuber papers...)
Do you have a reference for this? I have a hard time believing that this is generally true of anything other than toy models trained on toy tasks. I think you’re referencing this paper, which trains a shallow attention-only transformer where they get rid of the nonlinearity in the attention, trained to perform linear regression. There are too many dissimilarities between the setting in this work and LLMs to convince me that this is true of LLama or GPT4.
Well, obviously not just that one (“Transformers learn in-context by gradient descent”, van Oswald et al 2022). There’s lots of related work examining it in various ways. (I haven’t read a lot of those myself, unfortunately—as always, too many things to read, especially if I ever want to write my own stuff.)
I don’t know why you have a hard time believing it, so I couldn’t say what of those you might find relevant—it makes plenty of sense to me, for the reasons I outlined here, and is what I expect from increasingly capable models. And you didn’t seem to disagree with these sorts of claims last time: “I think that these papers do provide sufficient behavioral evidence that transformers are implementing something close to gradient descent in their weights.”
Broadly, I was also thinking of: “How Well Can Transformers Emulate In-context Newton’s Method?”, Giannou et al 2024, “Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models”, Fu et al 2023, “CausalLM is not optimal for in-context learning”, Ding et al 2023, “One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention”, Mahankali et al 2023, “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers”, Dai et al 2023, “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”, Garg et al 2022/”What learning algorithm is in-context learning? Investigations with linear models”, Akyürek et al 2022, & “An Explanation of In-context Learning as Implicit Bayesian Inference”, Xie et al 2021.
What a reply, thank you!
Note that the things being cached are not things stored in memory elsewhere. Rather, they’re (supposedly) outputs of costly-to-compute functions—e.g. the instrumental value of something would be costly to compute directly from our terminal goals and world model. And most of the values in cache are computed from other cached values, rather than “from scratch”—e.g. the instrumental value of X might be computed (and then cached) from the already-cached instrumental values of some stuff which X costs/provides.
Coherence of Caches and Agents goes into more detail on that part of the picture, if you’re interested.
Thanks for the guidance! Together with Gwern’s reply my understanding now is that caching can indeed be very fluidly integrated into the architecture (and that there is a whole fascinating field that I could try to learn about).
After letting the ideas settle for a bit, I think that one aspect that might have lead me to think
is that a Bayesian agent as described still is (or at least could be) very “monolithic” in its world model. I struggle with putting this into words, but my thinking feels a lot more disjointed/local/modular. It would make sense if there is a spectrum from “basically global/serial computation” to “fully distributed/parallel computation” where going more to the right adds sources of internal confusion.
Yeah, that’s one of the main things which the “causal models as programs” thing is meant to capture, especially in conjunction with message passing and caching. The whole thing is still behaviorally one big model insofar as the cache is coherent, but the implementation is a bunch of little sparsely-interacting submodel-instances.