Finally, if we want to make the model capture certain non-Bayesian human behaviors while still keeping most of the picture, we can assume that instrumental values and/or epistemic updates are cached. This creates the possibility of cache inconsistency/incoherence.
In my mind, there is an amount of internal confusion which feels much stronger than what I would expect for an agent as in the OP.
Or is the idea possibly that everything in the architecture uses caching and instrumental values? From reading, I imagined a memory+cache structure instead of being closer to “cache all the way down”.
Apart from this, I would bet that something interesting will happen for a somewhat human-comparable agent with regards to self-modelling and identity. Would anything similar to human identity emerge or would this require additional structure?
Some representation of the agent itself, and its capabilities should be present at least
“Cached” might be an unhelpful term here, compared to “amortized”. ‘Cache’ makes one think of databases or memories, as something you ‘know’ (in a database or long-term memory somewhere), whereas in practice it tends to be more something you do—fusing inference with action. (They are ‘cached’ in the same way that you might loosely talk about a neural net ‘caching’ a complicated-to-compute function, like a value function in RL/decision theory.)
So ‘amortized’ tends to be more used in the Bayesian RL literature, and give you an idea of what Bayesian RL agents (like LLMs) are doing: they are not (usually) implementing the Bayes-optimal backwards induction over the full decision-tree solving the POMDP when they engage in meta-learning like in-context learning (which leads you to infeasibilities like AIXI), they are doing amortized optimization. Depending on available time & compute, an agent might, at any given moment, be doing something anywhere on the spectrum from hardwired reflex to cogitating for hours explicitly on a tree of possibilities. (Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes. Or in expert iteration like AlphaZero, you have the CNN executing an amortized version of all previous MCTS searches, as distilled into the CNN, and then executing some more explicit tree search to improve its current estimates and then amortize that back into the CNN again to improve the policy some more.)
They gradually learn, applying some optimization one at a time, to implement a computation increasingly equivalent to the Bayes-optimal actions, which may boil down to an extremely simple algorithm like tracking a single sufficient-statistic summarizing the entire history and implementing an if-then-else on a boundary value of it (eg. drift-diffusion); Duff 2002 suggests thinking of it as “compiling” the full Bayes-optimal program interpreted flexibly but slowly at runtime down into a fast optimized but inflexible executable specialized for particular cases. A beautiful example of reading off the simple head/tails counting algorithm implemented by a meta-learning RNN can be seen in https://arxiv.org/pdf/1905.03030.pdf#page=6&org=deepmind EDIT: I go through a lot of this for my Kelly coin-flip page, but also here’s some recent research doing the same thing, but with different non-Bayesian terminology is https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their
(I have more links on this topic; does anyone have a better review of the topic than “Bayesian Reinforcement Learning: A Survey”, Ghavamzadeh et al 2016? I feel like a major problem with discussion of LLM scaling is that the Bayesian RL perspective is just not getting through to people, and part of the problem is I’m not sure what ‘the’ best introduction or summary writeup is. People can hardly be expected to just go and read 30 years of Schmidhuber papers...)
Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes
Do you have a reference for this? I have a hard time believing that this is generally true of anything other than toy models trained on toy tasks. I think you’re referencing this paper, which trains a shallow attention-only transformer where they get rid of the nonlinearity in the attention, trained to perform linear regression. There are too many dissimilarities between the setting in this work and LLMs to convince me that this is true of LLama or GPT4.
Well, obviously not just that one (“Transformers learn in-context by gradient descent”, van Oswald et al 2022). There’s lots of related work examining it in various ways. (I haven’t read a lot of those myself, unfortunately—as always, too many things to read, especially if I ever want to write my own stuff.)
From reading, I imagined a memory+cache structure instead of being closer to “cache all the way down”.
Note that the things being cached are not things stored in memory elsewhere. Rather, they’re (supposedly) outputs of costly-to-compute functions—e.g. the instrumental value of something would be costly to compute directly from our terminal goals and world model. And most of the values in cache are computed from other cached values, rather than “from scratch”—e.g. the instrumental value of X might be computed (and then cached) from the already-cached instrumental values of some stuff which X costs/provides.
Thanks for the guidance! Together with Gwern’s reply my understanding now is that caching can indeed be very fluidly integrated into the architecture (and that there is a whole fascinating field that I could try to learn about).
After letting the ideas settle for a bit, I think that one aspect that might have lead me to think
In my mind, there is an amount of internal confusion which feels much stronger than what I would expect for an agent as in the OP
is that a Bayesian agent as described still is (or at least could be) very “monolithic” in its world model. I struggle with putting this into words, but my thinking feels a lot more disjointed/local/modular.
It would make sense if there is a spectrum from “basically global/serial computation” to “fully distributed/parallel computation” where going more to the right adds sources of internal confusion.
Yeah, that’s one of the main things which the “causal models as programs” thing is meant to capture, especially in conjunction with message passing and caching. The whole thing is still behaviorally one big model insofar as the cache is coherent, but the implementation is a bunch of little sparsely-interacting submodel-instances.
In my mind, there is an amount of internal confusion which feels much stronger than what I would expect for an agent as in the OP. Or is the idea possibly that everything in the architecture uses caching and instrumental values? From reading, I imagined a memory+cache structure instead of being closer to “cache all the way down”.
Apart from this, I would bet that something interesting will happen for a somewhat human-comparable agent with regards to self-modelling and identity. Would anything similar to human identity emerge or would this require additional structure? Some representation of the agent itself, and its capabilities should be present at least
“Cached” might be an unhelpful term here, compared to “amortized”. ‘Cache’ makes one think of databases or memories, as something you ‘know’ (in a database or long-term memory somewhere), whereas in practice it tends to be more something you do—fusing inference with action. (They are ‘cached’ in the same way that you might loosely talk about a neural net ‘caching’ a complicated-to-compute function, like a value function in RL/decision theory.)
So ‘amortized’ tends to be more used in the Bayesian RL literature, and give you an idea of what Bayesian RL agents (like LLMs) are doing: they are not (usually) implementing the Bayes-optimal backwards induction over the full decision-tree solving the POMDP when they engage in meta-learning like in-context learning (which leads you to infeasibilities like AIXI), they are doing amortized optimization. Depending on available time & compute, an agent might, at any given moment, be doing something anywhere on the spectrum from hardwired reflex to cogitating for hours explicitly on a tree of possibilities. (Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes. Or in expert iteration like AlphaZero, you have the CNN executing an amortized version of all previous MCTS searches, as distilled into the CNN, and then executing some more explicit tree search to improve its current estimates and then amortize that back into the CNN again to improve the policy some more.)
They gradually learn, applying some optimization one at a time, to implement a computation increasingly equivalent to the Bayes-optimal actions, which may boil down to an extremely simple algorithm like tracking a single sufficient-statistic summarizing the entire history and implementing an if-then-else on a boundary value of it (eg. drift-diffusion); Duff 2002 suggests thinking of it as “compiling” the full Bayes-optimal program interpreted flexibly but slowly at runtime down into a fast optimized but inflexible executable specialized for particular cases. A beautiful example of reading off the simple head/tails counting algorithm implemented by a meta-learning RNN can be seen in https://arxiv.org/pdf/1905.03030.pdf#page=6&org=deepmind EDIT: I go through a lot of this for my Kelly coin-flip page, but also here’s some recent research doing the same thing, but with different non-Bayesian terminology is https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their
(I have more links on this topic; does anyone have a better review of the topic than “Bayesian Reinforcement Learning: A Survey”, Ghavamzadeh et al 2016? I feel like a major problem with discussion of LLM scaling is that the Bayesian RL perspective is just not getting through to people, and part of the problem is I’m not sure what ‘the’ best introduction or summary writeup is. People can hardly be expected to just go and read 30 years of Schmidhuber papers...)
Do you have a reference for this? I have a hard time believing that this is generally true of anything other than toy models trained on toy tasks. I think you’re referencing this paper, which trains a shallow attention-only transformer where they get rid of the nonlinearity in the attention, trained to perform linear regression. There are too many dissimilarities between the setting in this work and LLMs to convince me that this is true of LLama or GPT4.
Well, obviously not just that one (“Transformers learn in-context by gradient descent”, van Oswald et al 2022). There’s lots of related work examining it in various ways. (I haven’t read a lot of those myself, unfortunately—as always, too many things to read, especially if I ever want to write my own stuff.)
I don’t know why you have a hard time believing it, so I couldn’t say what of those you might find relevant—it makes plenty of sense to me, for the reasons I outlined here, and is what I expect from increasingly capable models. And you didn’t seem to disagree with these sorts of claims last time: “I think that these papers do provide sufficient behavioral evidence that transformers are implementing something close to gradient descent in their weights.”
Broadly, I was also thinking of: “How Well Can Transformers Emulate In-context Newton’s Method?”, Giannou et al 2024, “Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models”, Fu et al 2023, “CausalLM is not optimal for in-context learning”, Ding et al 2023, “One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention”, Mahankali et al 2023, “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers”, Dai et al 2023, “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”, Garg et al 2022/”What learning algorithm is in-context learning? Investigations with linear models”, Akyürek et al 2022, & “An Explanation of In-context Learning as Implicit Bayesian Inference”, Xie et al 2021.
What a reply, thank you!
Note that the things being cached are not things stored in memory elsewhere. Rather, they’re (supposedly) outputs of costly-to-compute functions—e.g. the instrumental value of something would be costly to compute directly from our terminal goals and world model. And most of the values in cache are computed from other cached values, rather than “from scratch”—e.g. the instrumental value of X might be computed (and then cached) from the already-cached instrumental values of some stuff which X costs/provides.
Coherence of Caches and Agents goes into more detail on that part of the picture, if you’re interested.
Thanks for the guidance! Together with Gwern’s reply my understanding now is that caching can indeed be very fluidly integrated into the architecture (and that there is a whole fascinating field that I could try to learn about).
After letting the ideas settle for a bit, I think that one aspect that might have lead me to think
is that a Bayesian agent as described still is (or at least could be) very “monolithic” in its world model. I struggle with putting this into words, but my thinking feels a lot more disjointed/local/modular. It would make sense if there is a spectrum from “basically global/serial computation” to “fully distributed/parallel computation” where going more to the right adds sources of internal confusion.
Yeah, that’s one of the main things which the “causal models as programs” thing is meant to capture, especially in conjunction with message passing and caching. The whole thing is still behaviorally one big model insofar as the cache is coherent, but the implementation is a bunch of little sparsely-interacting submodel-instances.