beren comments on The Computational Anatomy of Human Values

beren 7 Apr 2023 22:30 UTC
LW: 8 AF: 5
0
AF
I always say that the whole brain (including not only the basal ganglia but also the thalamocortical system, medulla, etc.) operates as a model-based RL system. You’re saying that the BG by itself operates as a model-free RL system. So I don’t think we’re disagreeing, because “the cortex is the model”?? (Well, we definitely have some disagreements about the BG, but we don’t have to get into them, I don’t think they’re very important for present purposes.)

I think there is some disagreement here, at least in the way I am using model-based / model-free RL (not sure exactly how you are using it). Model-based RL, at least to me, is not just about explicitly having some kind of model, which I think we both agree exists in cortex, but rather the actual action selection system using that model to do some kind of explicit rollouts for planning. I do not think the basal ganglia does this, while I think the PFC has some meta-learned ability to do this. In this sense, the BG is ‘model-free’ while the cortex is ‘model-based’.
I don’t really find “meta-RL” as a great way to think about dlPFC (or whatever the exact region-in-question is). See Rohin’s critique of that DeepMind paper here. I might instead say that “dlPFC can learn good ideas / habits that are defined at a higher level of abstraction” or something like that. For example, if I learn through experience (or hearsay) that it’s a good idea to use Anki flashcards, you can call that Meta-RL (“I am learning how to learn”). But you can equally well describe it as “I am learning to take good actions that will eventually lead to good consequences”. Likewise, I’d say “learning through experience that I should suck up to vain powerful people” is probably is in the same category as “learning through experience that I should use Anki flashcards”—I suspect they’re learned in the same way by the same part of PFC—but “learning to suck up” really isn’t the kind of thing that one would call “meta-RL”, I think. There’s no “meta”—it’s just a good (abstract) type of action that I have learned by RL.

This is an interesting point. At some level of abstraction, I don’t think there is a huge amount of difference between meta-RL and ‘learning highly abstract actions/habits’. What I am mostly pointing towards this is the PFC learns high-level actions including how to optimise and perform RL over long horizons effectively including learning high-level cognitive habits like how to do planning etc, which is not an intrinsic ability but rather has to be learned. My understanding of what exactly the dlPFC does and how exactly it works is the place where I am most uncertain at present.
I agree in the sense of “it’s hard to look at the brainstem and figure out what a developed-world adult is trying to do at any given moment, or more generally in life”. I kinda disagree in the sense of “a person who is not hungry or cold will still be motivated by social status and so on”. I don’t think it’s right to put “eating when hungry” in the category of “primary reward” but say that “impressing one’s friends” is in a different, lesser category (if that’s what you’re saying). I think they’re both in the same category.
I agree that even when not immediately hungry or cold etc we still get primary rewards from increasing social status etc. I don’t completely agree with Robin Hanson that almost all human behaviour can be explained by this drive directly though. I think we act on more complex linguistic values, or at least our behaviour to fulfil these primary rewards of social status is mediated through these.
I don’t particularly buy the importance of words-in-particular here. For example, some words have two or more definitions, but we have no trouble at all valuing one of those definitions but not the other. And some people sometimes have difficulty articulating their values. From what I understand, internal monologue plays a bigger or smaller role in the mental life of different people. So anyway, I don’t see any particular reason to privilege words per se over non-linguistic concepts, at least if the goal is a descriptive theory of humans. If we’re talking about aligning LLMs, I’m open to the idea that linguistic concepts are sufficient to point at the right things.
So for words literally, I agree with this. By ‘linguistic’ I am more pointing at abstract high-level cortical representations. I think that for the most part these line up pretty well with and are shaped by our linguistic representations and that the ability of language to compress and communicate complex latent states is one of the big reasons for humanity’s success.
I think I would have made the weaker statement “There is no particular reason to expect this project to be possible at all.” I don’t see a positive case that the project will definitely fail. Maybe the philosophers will get very lucky, or whatever. I’m just nitpicking here, feel free to ignore.
This is fair. I personally have very low odds on success but it is not a logical impossibility.
I think (?) you’re imagining a different AGI development model than me, one based on LLMs, in which more layers + RLHF scales to AGI. Whereas I’m assuming (or at least, “taking actions conditional on the assumption”) that LLM+RLHF will plateau at some point before x-risk, and then future AI researchers will pivot to architectures more obviously & deeply centered around RL, e.g. AIs for which TD learning is happening not only throughout training but also online during deployment (as it is in humans).
I am not sure we actually imagine that different AGI designs. Specifically, my near-term AGI model is essentially a multi-modal DL-trained world model, likely with an LLM as a centrepiece but also potentially vision and other modalities included, and then trained with RL either end to end or as some kind of wrapper on a very large range of tasks. I think, given that we already have extremely powerful LLMs in existence, almost any future AGI design will use them at least as part of the general world model. In this case, then there will be a very general and highly accessible linguistic latent space which will serve as the basis of policy and reward model inputs.
- Steven Byrnes 10 Apr 2023 14:03 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Model-based RL, at least to me, is … using that model to do some kind of explicit rollouts for planning
  Seems like just terminology then. I’m using the term “model-based RL” more broadly than you.
  I agree with you that (1) explicit one-timestep-at-a-time rollouts is very common (maybe even universal) in self-described “model-based RL” papers that you find on arxiv/cs today, and that (2) these kinds of rollouts are not part of the brain “source code” (although they might show up sometimes as a learned metacognitive strategy).
  I think you’re taking (1) to be evidence that “the term ‘model-based RL’ implies one-timestep-at-a-time rollouts”, whereas I’m taking (1) to be evidence that “AI/CS people have some groupthink about how to construct effective model-based RL algorithms”.
  I don’t think there is a huge amount of difference between meta-RL and ‘learning highly abstract actions/habits’
  Hmm, I think the former is a strict subset of the latter. E.g. I think “learning through experience that I should suck up to vain powerful people” is the latter but not the former.
  I don’t completely agree with Robin Hanson that almost all human behaviour can be explained by this drive directly though.
  Yeah I agree with the “directly” part. For example, I think some kind of social drives + the particular situations I’ve been in, led to me thinking that it’s good to act with integrity. But now that desire / value is installed inside me, not just a means to an end, so I feel some nonzero motivation to “act with integrity” even when I know for sure that I won’t get caught etc. Not that it’s always a sufficient motivation …
- Charlie Steiner 8 Apr 2023 5:59 UTC
  2 points
  0
  Parent
  I think there is some disagreement here, at least in the way I am using model-based / model-free RL (not sure exactly how you are using it). Model-based RL, at least to me, is not just about explicitly having some kind of model, which I think we both agree exists in cortex, but rather the actual action selection system using that model to do some kind of explicit rollouts for planning. I do not think the basal ganglia does this, while I think the PFC has some meta-learned ability to do this. In this sense, the BG is ‘model-free’ while the cortex is ‘model-based’.
  Huh. I’d agree that’s an important distinction, but having a model also can be leveraged for learning; the way I’d normally use it, actor-critic architectures can fall on a spectrum of “modeliness” depending on how “modely” the critic is, even if the actor is a non-recursive, non-modely architecture. I think this is relevant to shard theory because I think the best arguments about shards involve inner alignment failure in model-free-in-my-stricter-sense models.
  - beren 8 Apr 2023 11:09 UTC
    3 points
    0
    Parent
    So, I agree and I think we are getting at the same thing (though not completely sure what you are pointing at). The way to have a model-y critic and actor is to have the actor and critic perform model-free RL over the latent space of your unsupervised world model. This is the key point of my post and why humans can have ‘values’ and desires for highly abstract linguistic concepts such as ‘justice’ as opposed to pure sensory states or primary rewards.
- Mateusz Bagiński 12 Apr 2023 8:25 UTC
  1 point
  0
  Parent
  This is fair. I personally have very low odds on success but it is not a logical impossibility.
  
  I’d say that the probability of success depends on
  
  (1) Conservatism—how much of the prior structure (i.e., what our behavior actually looks like at the moment, how it’s driven by particular shards, etc.). The more conservative you are, the harder it is.
  
  (2) Parametrization—how many moving parts (e.g., values in value consequentialism or virtues in virtue ethics) you allow for in your desired model—the more, the easier.
  
  If you want to explain all of human behavior and reduce it to one metric only, the project is doomed.^[1]
  
  For some values of (1) and (2) you can find one or more coherent extrapolations of human values/value concepts. The thing is, often there’s not one extrapolation that is clearly better for one particular person and the greater the number of people whose values you want to extrapolate, the harder it gets. People differ in what extrapolation they would prefer (or even if they would like to extrapolate away from their status quo common sense ethics) due to different genetics, experiences, cultural influences, pragmatic reasons etc.
  ↩︎
  There may also be some misunderstanding if one side assumes that the project is descriptive (adequately describe all of human behavior with a small set of latent value concepts) or prescriptive (provide a unified, coherent framework that retains some part of our current value system but makes it more principled, robust against moving out of distribution, etc.)