I’m happy you wrote this! Lots of random comments, feel free to ignore any or all of them:
basal ganglia operates as a model-free RL system
I always say that the whole brain (including not only the basal ganglia but also the thalamocortical system, medulla, etc.) operates as a model-based RL system. You’re saying that the BG by itself operates as a model-free RL system. So I don’t think we’re disagreeing, because “the cortex is the model”?? (Well, we definitely have some disagreements about the BG, but we don’t have to get into them, I don’t think they’re very important for present purposes.)
Dopamine is produced and transmitted at a few highly specific subcortical nuclei (VTA and SNc) which, from a computational standpoint, function as reward models.
Sorry if it’s explained somewhere, but I’m not following why you describe these as “reward models” and not “[ground-truth] rewards”.
Moving from the neuroscience and into the machine learning, it is clear that the PFC is the seat of the cortex’s learnt meta-reinforcement learning algorithm.
I don’t really find “meta-RL” as a great way to think about dlPFC (or whatever the exact region-in-question is). See Rohin’s critique of that DeepMind paper here. I might instead say that “dlPFC can learn good ideas / habits that are defined at a higher level of abstraction” or something like that. For example, if I learn through experience (or hearsay) that it’s a good idea to use Anki flashcards, you can call that Meta-RL (“I am learning how to learn”). But you can equally well describe it as “I am learning to take good actions that will eventually lead to good consequences”. Likewise, I’d say “learning through experience that I should suck up to vain powerful people” is probably is in the same category as “learning through experience that I should use Anki flashcards”—I suspect they’re learned in the same way by the same part of PFC—but “learning to suck up” really isn’t the kind of thing that one would call “meta-RL”, I think. There’s no “meta”—it’s just a good (abstract) type of action that I have learned by RL.
when asked, many humans want to try to reduce the influence of their ‘instinctual’ and habitual behaviours and instead subordinate more of their behaviours to explicit planning. Humans, at least, appear to want to be more coherent than they actually are.
I endorse a description by Scott Alexander here: “Thinking about studying Swahili is positively reinforced, actually studying Swahili is negatively reinforced. The natural and obvious result is that I intend to study Swahili, but don’t.”
So in that context, we can ask: “why are meta-desires [desires to have or not have certain desires] simpler and more coherent than object-level desires?” And I think the answer is: Object-level desires flow from hundreds of things like hunger, sex drive, laziness, etc., whereas meta-desires flow way-out-of-proportion from just one single source: the drive for social status. (Why yes I have been reading Robin Hanson, how did you know?) So the latter winds up being comparatively simple / coherent.
we (usually) know to be fearful at a real snake and not a photograph of a snake
This is minor but just for fun: I would have said “movie” not “photograph”. My hunch is that there’s a snake-detector in the superior colliculus, but that it’s mainly detecting how the snake moves / slithers, not what it looks like in a static image. I can’t prove this—the neuroscience papers on fear-of-snakes almost always use still photographs, to my chagrin.
the next key factor is that human primary reward functions are extremely underspecified.
I agree in the sense of “it’s hard to look at the brainstem and figure out what a developed-world adult is trying to do at any given moment, or more generally in life”. I kinda disagree in the sense of “a person who is not hungry or cold will still be motivated by social status and so on”. I don’t think it’s right to put “eating when hungry” in the category of “primary reward” but say that “impressing one’s friends” is in a different, lesser category (if that’s what you’re saying). I think they’re both in the same category.
…linguistic…
I don’t particularly buy the importance of words-in-particular here. For example, some words have two or more definitions, but we have no trouble at all valuing one of those definitions but not the other. And some people sometimes have difficulty articulating their values. From what I understand, internal monologue plays a bigger or smaller role in the mental life of different people. So anyway, I don’t see any particular reason to privilege words per se over non-linguistic concepts, at least if the goal is a descriptive theory of humans. If we’re talking about aligning LLMs, I’m open to the idea that linguistic concepts are sufficient to point at the right things.
…latent space…
I’ve been thinking about something vaguely like attractor dynamics, or a Bayes net, such that if concept A is very active, then that makes related concept B slightly active. And then slightly-active-concept-B is connected to striatum etc. which affects the valence / value / dopamine calculation. I wonder whether my mental picture here is mathematically equivalent to the thing you’re saying about high-dimensional latent space embeddings. Eh, probably ¯\_(ツ)_/¯
This hope is intrinsically doomed because there is no coherent moral system or set of values to be discovered.
I think I would have made the weaker statement “There is no particular reason to expect this project to be possible at all.” I don’t see a positive case that the project will definitely fail. Maybe the philosophers will get very lucky, or whatever. I’m just nitpicking here, feel free to ignore.
This is where I perhaps have my strongest disagreement with Steven Byrnes
I think (?) you’re imagining a different AGI development model than me, one based on LLMs, in which more layers + RLHF scales to AGI. Whereas I’m assuming (or at least, “taking actions conditional on the assumption”) that LLM+RLHF will plateau at some point before x-risk, and then future AI researchers will pivot to architectures more obviously & deeply centered around RL, e.g. AIs for which TD learning is happening not only throughout training but also online during deployment (as it is in humans).
If I condition on your (presumed) beliefs, then I would agree with what you said in that footnote, I think, and I would probably stop trying to learn about the hypothalamus etc. and find something else to do.
If it helps, I have a short summary of what I’m working on and the corresponding theory-of-change here.
I always say that the whole brain (including not only the basal ganglia but also the thalamocortical system, medulla, etc.) operates as a model-based RL system. You’re saying that the BG by itself operates as a model-free RL system. So I don’t think we’re disagreeing, because “the cortex is the model”?? (Well, we definitely have some disagreements about the BG, but we don’t have to get into them, I don’t think they’re very important for present purposes.)
I think there is some disagreement here, at least in the way I am using model-based / model-free RL (not sure exactly how you are using it). Model-based RL, at least to me, is not just about explicitly having some kind of model, which I think we both agree exists in cortex, but rather the actual action selection system using that model to do some kind of explicit rollouts for planning. I do not think the basal ganglia does this, while I think the PFC has some meta-learned ability to do this. In this sense, the BG is ‘model-free’ while the cortex is ‘model-based’.
I don’t really find “meta-RL” as a great way to think about dlPFC (or whatever the exact region-in-question is). See Rohin’s critique of that DeepMind paper here. I might instead say that “dlPFC can learn good ideas / habits that are defined at a higher level of abstraction” or something like that. For example, if I learn through experience (or hearsay) that it’s a good idea to use Anki flashcards, you can call that Meta-RL (“I am learning how to learn”). But you can equally well describe it as “I am learning to take good actions that will eventually lead to good consequences”. Likewise, I’d say “learning through experience that I should suck up to vain powerful people” is probably is in the same category as “learning through experience that I should use Anki flashcards”—I suspect they’re learned in the same way by the same part of PFC—but “learning to suck up” really isn’t the kind of thing that one would call “meta-RL”, I think. There’s no “meta”—it’s just a good (abstract) type of action that I have learned by RL.
This is an interesting point. At some level of abstraction, I don’t think there is a huge amount of difference between meta-RL and ‘learning highly abstract actions/habits’. What I am mostly pointing towards this is the PFC learns high-level actions including how to optimise and perform RL over long horizons effectively including learning high-level cognitive habits like how to do planning etc, which is not an intrinsic ability but rather has to be learned. My understanding of what exactly the dlPFC does and how exactly it works is the place where I am most uncertain at present.
I agree in the sense of “it’s hard to look at the brainstem and figure out what a developed-world adult is trying to do at any given moment, or more generally in life”. I kinda disagree in the sense of “a person who is not hungry or cold will still be motivated by social status and so on”. I don’t think it’s right to put “eating when hungry” in the category of “primary reward” but say that “impressing one’s friends” is in a different, lesser category (if that’s what you’re saying). I think they’re both in the same category.
I agree that even when not immediately hungry or cold etc we still get primary rewards from increasing social status etc. I don’t completely agree with Robin Hanson that almost all human behaviour can be explained by this drive directly though. I think we act on more complex linguistic values, or at least our behaviour to fulfil these primary rewards of social status is mediated through these.
I don’t particularly buy the importance of words-in-particular here. For example, some words have two or more definitions, but we have no trouble at all valuing one of those definitions but not the other. And some people sometimes have difficulty articulating their values. From what I understand, internal monologue plays a bigger or smaller role in the mental life of different people. So anyway, I don’t see any particular reason to privilege words per se over non-linguistic concepts, at least if the goal is a descriptive theory of humans. If we’re talking about aligning LLMs, I’m open to the idea that linguistic concepts are sufficient to point at the right things.
So for words literally, I agree with this. By ‘linguistic’ I am more pointing at abstract high-level cortical representations. I think that for the most part these line up pretty well with and are shaped by our linguistic representations and that the ability of language to compress and communicate complex latent states is one of the big reasons for humanity’s success.
I think I would have made the weaker statement “There is no particular reason to expect this project to be possible at all.” I don’t see a positive case that the project will definitely fail. Maybe the philosophers will get very lucky, or whatever. I’m just nitpicking here, feel free to ignore.
This is fair. I personally have very low odds on success but it is not a logical impossibility.
I think (?) you’re imagining a different AGI development model than me, one based on LLMs, in which more layers + RLHF scales to AGI. Whereas I’m assuming (or at least, “taking actions conditional on the assumption”) that LLM+RLHF will plateau at some point before x-risk, and then future AI researchers will pivot to architectures more obviously & deeply centered around RL, e.g. AIs for which TD learning is happening not only throughout training but also online during deployment (as it is in humans).
I am not sure we actually imagine that different AGI designs. Specifically, my near-term AGI model is essentially a multi-modal DL-trained world model, likely with an LLM as a centrepiece but also potentially vision and other modalities included, and then trained with RL either end to end or as some kind of wrapper on a very large range of tasks. I think, given that we already have extremely powerful LLMs in existence, almost any future AGI design will use them at least as part of the general world model. In this case, then there will be a very general and highly accessible linguistic latent space which will serve as the basis of policy and reward model inputs.
Model-based RL, at least to me, is … using that model to do some kind of explicit rollouts for planning
Seems like just terminology then. I’m using the term “model-based RL” more broadly than you.
I agree with you that (1) explicit one-timestep-at-a-time rollouts is very common (maybe even universal) in self-described “model-based RL” papers that you find on arxiv/cs today, and that (2) these kinds of rollouts are not part of the brain “source code” (although they might show up sometimes as a learned metacognitive strategy).
I think you’re taking (1) to be evidence that “the term ‘model-based RL’ implies one-timestep-at-a-time rollouts”, whereas I’m taking (1) to be evidence that “AI/CS people have some groupthink about how to construct effective model-based RL algorithms”.
I don’t think there is a huge amount of difference between meta-RL and ‘learning highly abstract actions/habits’
Hmm, I think the former is a strict subset of the latter. E.g. I think “learning through experience that I should suck up to vain powerful people” is the latter but not the former.
I don’t completely agree with Robin Hanson that almost all human behaviour can be explained by this drive directly though.
Yeah I agree with the “directly” part. For example, I think some kind of social drives + the particular situations I’ve been in, led to me thinking that it’s good to act with integrity. But now that desire / value is installed inside me, not just a means to an end, so I feel some nonzero motivation to “act with integrity” even when I know for sure that I won’t get caught etc. Not that it’s always a sufficient motivation …
I think there is some disagreement here, at least in the way I am using model-based / model-free RL (not sure exactly how you are using it). Model-based RL, at least to me, is not just about explicitly having some kind of model, which I think we both agree exists in cortex, but rather the actual action selection system using that model to do some kind of explicit rollouts for planning. I do not think the basal ganglia does this, while I think the PFC has some meta-learned ability to do this. In this sense, the BG is ‘model-free’ while the cortex is ‘model-based’.
Huh. I’d agree that’s an important distinction, but having a model also can be leveraged for learning; the way I’d normally use it, actor-critic architectures can fall on a spectrum of “modeliness” depending on how “modely” the critic is, even if the actor is a non-recursive, non-modely architecture. I think this is relevant to shard theory because I think the best arguments about shards involve inner alignment failure in model-free-in-my-stricter-sense models.
So, I agree and I think we are getting at the same thing (though not completely sure what you are pointing at). The way to have a model-y critic and actor is to have the actor and critic perform model-free RL over the latent space of your unsupervised world model. This is the key point of my post and why humans can have ‘values’ and desires for highly abstract linguistic concepts such as ‘justice’ as opposed to pure sensory states or primary rewards.
This is fair. I personally have very low odds on success but it is not a logical impossibility.
I’d say that the probability of success depends on
(1) Conservatism—how much of the prior structure (i.e., what our behavior actually looks like at the moment, how it’s driven by particular shards, etc.). The more conservative you are, the harder it is.
(2) Parametrization—how many moving parts (e.g., values in value consequentialism or virtues in virtue ethics) you allow for in your desired model—the more, the easier.
If you want to explain all of human behavior and reduce it to one metric only, the project is doomed.[1]
For some values of (1) and (2) you can find one or more coherent extrapolations of human values/value concepts. The thing is, often there’s not one extrapolation that is clearly better for one particular person and the greater the number of people whose values you want to extrapolate, the harder it gets. People differ in what extrapolation they would prefer (or even if they would like to extrapolate away from their status quo common sense ethics) due to different genetics, experiences, cultural influences, pragmatic reasons etc.
There may also be some misunderstanding if one side assumes that the project is descriptive (adequately describe all of human behavior with a small set of latent value concepts) or prescriptive (provide a unified, coherent framework that retains some part of our current value system but makes it more principled, robust against moving out of distribution, etc.)
I’m happy you wrote this! Lots of random comments, feel free to ignore any or all of them:
I always say that the whole brain (including not only the basal ganglia but also the thalamocortical system, medulla, etc.) operates as a model-based RL system. You’re saying that the BG by itself operates as a model-free RL system. So I don’t think we’re disagreeing, because “the cortex is the model”?? (Well, we definitely have some disagreements about the BG, but we don’t have to get into them, I don’t think they’re very important for present purposes.)
Sorry if it’s explained somewhere, but I’m not following why you describe these as “reward models” and not “[ground-truth] rewards”.
I don’t really find “meta-RL” as a great way to think about dlPFC (or whatever the exact region-in-question is). See Rohin’s critique of that DeepMind paper here. I might instead say that “dlPFC can learn good ideas / habits that are defined at a higher level of abstraction” or something like that. For example, if I learn through experience (or hearsay) that it’s a good idea to use Anki flashcards, you can call that Meta-RL (“I am learning how to learn”). But you can equally well describe it as “I am learning to take good actions that will eventually lead to good consequences”. Likewise, I’d say “learning through experience that I should suck up to vain powerful people” is probably is in the same category as “learning through experience that I should use Anki flashcards”—I suspect they’re learned in the same way by the same part of PFC—but “learning to suck up” really isn’t the kind of thing that one would call “meta-RL”, I think. There’s no “meta”—it’s just a good (abstract) type of action that I have learned by RL.
I endorse a description by Scott Alexander here: “Thinking about studying Swahili is positively reinforced, actually studying Swahili is negatively reinforced. The natural and obvious result is that I intend to study Swahili, but don’t.”
So in that context, we can ask: “why are meta-desires [desires to have or not have certain desires] simpler and more coherent than object-level desires?” And I think the answer is: Object-level desires flow from hundreds of things like hunger, sex drive, laziness, etc., whereas meta-desires flow way-out-of-proportion from just one single source: the drive for social status. (Why yes I have been reading Robin Hanson, how did you know?) So the latter winds up being comparatively simple / coherent.
This is minor but just for fun: I would have said “movie” not “photograph”. My hunch is that there’s a snake-detector in the superior colliculus, but that it’s mainly detecting how the snake moves / slithers, not what it looks like in a static image. I can’t prove this—the neuroscience papers on fear-of-snakes almost always use still photographs, to my chagrin.
I agree in the sense of “it’s hard to look at the brainstem and figure out what a developed-world adult is trying to do at any given moment, or more generally in life”. I kinda disagree in the sense of “a person who is not hungry or cold will still be motivated by social status and so on”. I don’t think it’s right to put “eating when hungry” in the category of “primary reward” but say that “impressing one’s friends” is in a different, lesser category (if that’s what you’re saying). I think they’re both in the same category.
I don’t particularly buy the importance of words-in-particular here. For example, some words have two or more definitions, but we have no trouble at all valuing one of those definitions but not the other. And some people sometimes have difficulty articulating their values. From what I understand, internal monologue plays a bigger or smaller role in the mental life of different people. So anyway, I don’t see any particular reason to privilege words per se over non-linguistic concepts, at least if the goal is a descriptive theory of humans. If we’re talking about aligning LLMs, I’m open to the idea that linguistic concepts are sufficient to point at the right things.
I’ve been thinking about something vaguely like attractor dynamics, or a Bayes net, such that if concept A is very active, then that makes related concept B slightly active. And then slightly-active-concept-B is connected to striatum etc. which affects the valence / value / dopamine calculation. I wonder whether my mental picture here is mathematically equivalent to the thing you’re saying about high-dimensional latent space embeddings. Eh, probably ¯\_(ツ)_/¯
I think I would have made the weaker statement “There is no particular reason to expect this project to be possible at all.” I don’t see a positive case that the project will definitely fail. Maybe the philosophers will get very lucky, or whatever. I’m just nitpicking here, feel free to ignore.
I think (?) you’re imagining a different AGI development model than me, one based on LLMs, in which more layers + RLHF scales to AGI. Whereas I’m assuming (or at least, “taking actions conditional on the assumption”) that LLM+RLHF will plateau at some point before x-risk, and then future AI researchers will pivot to architectures more obviously & deeply centered around RL, e.g. AIs for which TD learning is happening not only throughout training but also online during deployment (as it is in humans).
If I condition on your (presumed) beliefs, then I would agree with what you said in that footnote, I think, and I would probably stop trying to learn about the hypothalamus etc. and find something else to do.
If it helps, I have a short summary of what I’m working on and the corresponding theory-of-change here.
I think there is some disagreement here, at least in the way I am using model-based / model-free RL (not sure exactly how you are using it). Model-based RL, at least to me, is not just about explicitly having some kind of model, which I think we both agree exists in cortex, but rather the actual action selection system using that model to do some kind of explicit rollouts for planning. I do not think the basal ganglia does this, while I think the PFC has some meta-learned ability to do this. In this sense, the BG is ‘model-free’ while the cortex is ‘model-based’.
This is an interesting point. At some level of abstraction, I don’t think there is a huge amount of difference between meta-RL and ‘learning highly abstract actions/habits’. What I am mostly pointing towards this is the PFC learns high-level actions including how to optimise and perform RL over long horizons effectively including learning high-level cognitive habits like how to do planning etc, which is not an intrinsic ability but rather has to be learned. My understanding of what exactly the dlPFC does and how exactly it works is the place where I am most uncertain at present.
I agree that even when not immediately hungry or cold etc we still get primary rewards from increasing social status etc. I don’t completely agree with Robin Hanson that almost all human behaviour can be explained by this drive directly though. I think we act on more complex linguistic values, or at least our behaviour to fulfil these primary rewards of social status is mediated through these.
So for words literally, I agree with this. By ‘linguistic’ I am more pointing at abstract high-level cortical representations. I think that for the most part these line up pretty well with and are shaped by our linguistic representations and that the ability of language to compress and communicate complex latent states is one of the big reasons for humanity’s success.
This is fair. I personally have very low odds on success but it is not a logical impossibility.
I am not sure we actually imagine that different AGI designs. Specifically, my near-term AGI model is essentially a multi-modal DL-trained world model, likely with an LLM as a centrepiece but also potentially vision and other modalities included, and then trained with RL either end to end or as some kind of wrapper on a very large range of tasks. I think, given that we already have extremely powerful LLMs in existence, almost any future AGI design will use them at least as part of the general world model. In this case, then there will be a very general and highly accessible linguistic latent space which will serve as the basis of policy and reward model inputs.
Seems like just terminology then. I’m using the term “model-based RL” more broadly than you.
I agree with you that (1) explicit one-timestep-at-a-time rollouts is very common (maybe even universal) in self-described “model-based RL” papers that you find on arxiv/cs today, and that (2) these kinds of rollouts are not part of the brain “source code” (although they might show up sometimes as a learned metacognitive strategy).
I think you’re taking (1) to be evidence that “the term ‘model-based RL’ implies one-timestep-at-a-time rollouts”, whereas I’m taking (1) to be evidence that “AI/CS people have some groupthink about how to construct effective model-based RL algorithms”.
Hmm, I think the former is a strict subset of the latter. E.g. I think “learning through experience that I should suck up to vain powerful people” is the latter but not the former.
Yeah I agree with the “directly” part. For example, I think some kind of social drives + the particular situations I’ve been in, led to me thinking that it’s good to act with integrity. But now that desire / value is installed inside me, not just a means to an end, so I feel some nonzero motivation to “act with integrity” even when I know for sure that I won’t get caught etc. Not that it’s always a sufficient motivation …
Huh. I’d agree that’s an important distinction, but having a model also can be leveraged for learning; the way I’d normally use it, actor-critic architectures can fall on a spectrum of “modeliness” depending on how “modely” the critic is, even if the actor is a non-recursive, non-modely architecture. I think this is relevant to shard theory because I think the best arguments about shards involve inner alignment failure in model-free-in-my-stricter-sense models.
So, I agree and I think we are getting at the same thing (though not completely sure what you are pointing at). The way to have a model-y critic and actor is to have the actor and critic perform model-free RL over the latent space of your unsupervised world model. This is the key point of my post and why humans can have ‘values’ and desires for highly abstract linguistic concepts such as ‘justice’ as opposed to pure sensory states or primary rewards.
I’d say that the probability of success depends on
(1) Conservatism—how much of the prior structure (i.e., what our behavior actually looks like at the moment, how it’s driven by particular shards, etc.). The more conservative you are, the harder it is.
(2) Parametrization—how many moving parts (e.g., values in value consequentialism or virtues in virtue ethics) you allow for in your desired model—the more, the easier.
If you want to explain all of human behavior and reduce it to one metric only, the project is doomed.[1]
For some values of (1) and (2) you can find one or more coherent extrapolations of human values/value concepts. The thing is, often there’s not one extrapolation that is clearly better for one particular person and the greater the number of people whose values you want to extrapolate, the harder it gets. People differ in what extrapolation they would prefer (or even if they would like to extrapolate away from their status quo common sense ethics) due to different genetics, experiences, cultural influences, pragmatic reasons etc.
There may also be some misunderstanding if one side assumes that the project is descriptive (adequately describe all of human behavior with a small set of latent value concepts) or prescriptive (provide a unified, coherent framework that retains some part of our current value system but makes it more principled, robust against moving out of distribution, etc.)