Charlie Steiner comments on Human preferences as RL critic values—implications for alignment

Charlie Steiner 18 Mar 2023 19:39 UTC
4 points
−6
I think this is a little wrong when it comes to humans, and this reflects an alignment difficulty.
Consider heroin. Heroin would really excite my reward system, and yet saying I prefer heroin is wrong. The activity of the critic governs the learning of the actor, but just because the critic would get excited by something if it happened doesn’t mean that the combined actor-critic system currently acts or plans with any preference for that thing. Point being that identifying my preferences with the activity of the critic isn’t quite right.
This means that making an AI prefer something is more complicated than just using an actor-critic model where that thing gets a high reward. This is also a problem faced by evolution (albeit tangled up with the small information budget evolution has to work with); if it’s evolutionarily advantageous for humans’ “actor” to have some behavior that it would not otherwise have, the evolved solution won’t look like a focused critic that detects that behavior and rewards it, or an omniscient critic that knows how the world works from the start and steers the human straight towards the goal, it has to look like a curriculum that makes progressively-closer-to-desired behaviors progressively more likely to actually be encountered and slowly doles out reward as it notices better things happening.
Okay, maybe it’s pretty fair to call the critic’s reward estimate an estimate of how good things are according to the human, at that moment in time. Even ignoring the edge cases, my claim is that this is surprisingly not helpful for aligning AI to human values.
- Seth Herd 18 Mar 2023 20:19 UTC
  3 points
  0
  Parent
  I’m not sure I’m following you. I definitely agree that human behavior is not completely determined by the critic system. And that this complicates the alignment of brain like AGI. For instance, when we act out of habit, the critic is not even invoked until at least until the action is completed, and maybe not at all.
  
  But I think you’re addressing instinctive behavior. If you throw something at my eye, I’ll blink—and this might not take any learning. If an electrical transformer box blows up nearby, I might adopt a stereotyped defensive posture with one arm out and one leg up, even if I’ve never studied martial arts (this is a personal anecdote from a neuroscience instructor on instincts). If you put sugar in my mouth, I’ll probably salivate even as a newborn.
  
  However, those are the best examples I can come up with. I think that evolution has worked by making its preferred outcomes (or rather simple markers of them) be rewarding. The critic system is thought to derive reward from more than the four Fs; curiosity and social approval are often theorized to innately produce reward (although I don’t know of any hard evidence that these are primary rewards rather than learned rewards, after looking a good bit).
  
  Providing an expectation-discounted reward signal is one way to produce progressively-closer-to-desired behaviors. In the mammalian system, I think evolution has good reasons to prefer this route rather than trying to hardwire behaviors in an extremely complex world, and while competing with the whole forebrain system for control of behavior.
  
  But again, I might be misunderstanding you. In any case, thanks for the thoughts!
  
  Edits, to address a couple more points:
  
  I think the critic system, and your conscious predictions and preferences, are very much in charge in your decision not to find some heroin even though it’s reputedly the most rewarding thing you can do with a single chunk of time once you have some. You are factoring in your huge preference to not spend your life like the characters in Trainspotting, stealing and scrounging through filth for another fix. Or at least it seems that’s why I’m not doing it.
  
  The low information budget of evolution is exactly why I think it relies on hardwired reward inputs to the critic for governing behavior in mammals that navigate and learn relatively complex behaviors in relatively complex perceived environments.
  
  It seems you’re saying that a good deal of our behavior isn’t governed by the critic system. My estimate is that even though it’s all ultimately guided by evolution, the vast majority of mammalian behavior is governed by the critic. Which would make it a good target of alignment in a brainlike AGI system.
  
  I’ll look at your posts to see if you discuss this elsewhere. Or pointers would be appreciated.
  - Charlie Steiner 19 Mar 2023 4:23 UTC
    3 points
    1
    Parent
    I don’t think I’m cribbing from one of my posts. This might be related to some of Alex Turner’s recent posts though.
    It seems you’re saying that a good deal of our behavior isn’t governed by the critic system. My estimate is that even though it’s all ultimately guided by evolution, the vast majority of mammalian behavior is governed by the critic. Which would make it a good target of alignment in a brainlike AGI system.
    I’d like to think I’m being a little more subtle. Me avoiding heroin isn’t “not governed by the critic,” instead what’s going on is that it’s learned behavior based largely on how the critic has acted so far in my life, which happens to generalize in a way that contradicts what the critic would do if I actually tried heroin.
    Point is, if you somehow managed to separate my reward circuitry from the rest of my brain, you would be missing information needed to learn my values. My reward circuitry would think heroin was highly rewarding, and the fact that I don’t value it is stored in the actor, a consequence of the history of my life. If I go out and become a heroin addict and start to value heroin, that information would also be found in the actor, not in the critic.
    Providing an expectation-discounted reward signal is one way to produce progressively-closer-to-desired behaviors. In the mammalian system, I think evolution has good reasons to prefer this route rather than trying to hardwire behaviors in an extremely complex world, and while competing with the whole forebrain system for control of behavior.
    Yeah, I may have edited in something relevant to this after commenting. The problem faced by evolution (and also by humans trying to align AI) is that the critic doesn’t start out omniscient, or even particularly clever—it doesn’t actually know what the expectation-discounted reward is. Given the constraints, it’s stuck trying to nudge the actor to explore in maybe-good directions, so that it can make better guesses about where to nudge towards next—basically clever curriculum learning.
    I bring this up because this curriculum is information that’s in the critic, but that isn’t identical to our values. It has a sort of planned obsolescence; the nudges aren’t there because evolution expected us to literally value the nudges, they’re there to serve as a breadcrumb trail that would have led us to learning evolutionarily favorable habits of mind in the ancestral environment.
    - Seth Herd 19 Mar 2023 22:11 UTC
      3 points
      0
      Parent
      Me avoiding heroin isn’t “not governed by the critic,” instead what’s going on is that it’s learned behavior based largely on how the critic has acted so far in my life, which happens to generalize in a way that contradicts what the critic would do if I actually tried heroin.
      I think we’re largely in agreement on this. The actor system is controlling a lot of our behavior. But it’s doing so as the critic system trained it to do. So the critic is in charge, minus generalization errors.
      However, I also want to claim that the critic system is directly in charge when we’re using model-based thinking- when we come up with a predicted outcome before acting, the critic is supplying the estimate of how good that outcome is. But I’m not even sure this is a crux. The critic is still in charge in a pretty important way.
      If I go out and become a heroin addict and start to value heroin, that information would also be found in the actor, not in the critic.
      I think that information would be found in both the actor and the critic. But not to exactly the same degree. I think the critic probably updates faster. And the end result of the process can be a complex interaction between the actor, a world model (which I didn’t even bring into it in the article) and the critic. For instance, if it doesn’t occur to you to think about the likely consequences of doing heroin, the decision is based on the critic’s prediction that the heroin will be awesome. If the process, governed probably by the actor, does make a prediction of withdrawals and degradation as a result, then the decision is based on a rough sum that includes the critic’s very negative assignment of value to that part of the outcome.
      The problem faced by evolution (and also by humans trying to align AI) is that the critic doesn’t start out omniscient, or even particularly clever—it doesn’t actually know what the expectation-discounted reward is.
      I totally agree. That’s why the key question here is whether the critic can be reprogrammed after there’s enough knowledge in the actor and the world model.
      As for the idea that the critic nudges, I agree. I think the early nudges are provided by a small variety of innate reward signals, and the critic then expands those with theories of the next thing we should explore, as it learns to connect those innate rewards to other sensory representations.
      The critic is only representing adult human “values” as the result of tons of iterative learning between the systems. That’s the theory, anyway.
      It’s also worth noting that, even if this isn’t how the human system works, it might be a workable scheme to make more alignable AGI systems.