I don’t think I’m cribbing from one of my posts. This might be related to some of Alex Turner’s recent posts though.
It seems you’re saying that a good deal of our behavior isn’t governed by the critic system. My estimate is that even though it’s all ultimately guided by evolution, the vast majority of mammalian behavior is governed by the critic. Which would make it a good target of alignment in a brainlike AGI system.
I’d like to think I’m being a little more subtle. Me avoiding heroin isn’t “not governed by the critic,” instead what’s going on is that it’s learned behavior based largely on how the critic has acted so far in my life, which happens to generalize in a way that contradicts what the critic would do if I actually tried heroin.
Point is, if you somehow managed to separate my reward circuitry from the rest of my brain, you would be missing information needed to learn my values. My reward circuitry would think heroin was highly rewarding, and the fact that I don’t value it is stored in the actor, a consequence of the history of my life. If I go out and become a heroin addict and start to value heroin, that information would also be found in the actor, not in the critic.
Providing an expectation-discounted reward signal is one way to produce progressively-closer-to-desired behaviors. In the mammalian system, I think evolution has good reasons to prefer this route rather than trying to hardwire behaviors in an extremely complex world, and while competing with the whole forebrain system for control of behavior.
Yeah, I may have edited in something relevant to this after commenting. The problem faced by evolution (and also by humans trying to align AI) is that the critic doesn’t start out omniscient, or even particularly clever—it doesn’t actually know what the expectation-discounted reward is. Given the constraints, it’s stuck trying to nudge the actor to explore in maybe-good directions, so that it can make better guesses about where to nudge towards next—basically clever curriculum learning.
I bring this up because this curriculum is information that’s in the critic, but that isn’t identical to our values. It has a sort of planned obsolescence; the nudges aren’t there because evolution expected us to literally value the nudges, they’re there to serve as a breadcrumb trail that would have led us to learning evolutionarily favorable habits of mind in the ancestral environment.
Me avoiding heroin isn’t “not governed by the critic,” instead what’s going on is that it’s learned behavior based largely on how the critic has acted so far in my life, which happens to generalize in a way that contradicts what the critic would do if I actually tried heroin.
I think we’re largely in agreement on this. The actor system is controlling a lot of our behavior. But it’s doing so as the critic system trained it to do. So the critic is in charge, minus generalization errors.
However, I also want to claim that the critic system is directly in charge when we’re using model-based thinking- when we come up with a predicted outcome before acting, the critic is supplying the estimate of how good that outcome is. But I’m not even sure this is a crux. The critic is still in charge in a pretty important way.
If I go out and become a heroin addict and start to value heroin, that information would also be found in the actor, not in the critic.
I think that information would be found in both the actor and the critic. But not to exactly the same degree. I think the critic probably updates faster. And the end result of the process can be a complex interaction between the actor, a world model (which I didn’t even bring into it in the article) and the critic. For instance, if it doesn’t occur to you to think about the likely consequences of doing heroin, the decision is based on the critic’s prediction that the heroin will be awesome. If the process, governed probably by the actor, does make a prediction of withdrawals and degradation as a result, then the decision is based on a rough sum that includes the critic’s very negative assignment of value to that part of the outcome.
The problem faced by evolution (and also by humans trying to align AI) is that the critic doesn’t start out omniscient, or even particularly clever—it doesn’t actually know what the expectation-discounted reward is.
I totally agree. That’s why the key question here is whether the critic can be reprogrammed after there’s enough knowledge in the actor and the world model.
As for the idea that the critic nudges, I agree. I think the early nudges are provided by a small variety of innate reward signals, and the critic then expands those with theories of the next thing we should explore, as it learns to connect those innate rewards to other sensory representations.
The critic is only representing adult human “values” as the result of tons of iterative learning between the systems. That’s the theory, anyway.
It’s also worth noting that, even if this isn’t how the human system works, it might be a workable scheme to make more alignable AGI systems.
I don’t think I’m cribbing from one of my posts. This might be related to some of Alex Turner’s recent posts though.
I’d like to think I’m being a little more subtle. Me avoiding heroin isn’t “not governed by the critic,” instead what’s going on is that it’s learned behavior based largely on how the critic has acted so far in my life, which happens to generalize in a way that contradicts what the critic would do if I actually tried heroin.
Point is, if you somehow managed to separate my reward circuitry from the rest of my brain, you would be missing information needed to learn my values. My reward circuitry would think heroin was highly rewarding, and the fact that I don’t value it is stored in the actor, a consequence of the history of my life. If I go out and become a heroin addict and start to value heroin, that information would also be found in the actor, not in the critic.
Yeah, I may have edited in something relevant to this after commenting. The problem faced by evolution (and also by humans trying to align AI) is that the critic doesn’t start out omniscient, or even particularly clever—it doesn’t actually know what the expectation-discounted reward is. Given the constraints, it’s stuck trying to nudge the actor to explore in maybe-good directions, so that it can make better guesses about where to nudge towards next—basically clever curriculum learning.
I bring this up because this curriculum is information that’s in the critic, but that isn’t identical to our values. It has a sort of planned obsolescence; the nudges aren’t there because evolution expected us to literally value the nudges, they’re there to serve as a breadcrumb trail that would have led us to learning evolutionarily favorable habits of mind in the ancestral environment.
I think we’re largely in agreement on this. The actor system is controlling a lot of our behavior. But it’s doing so as the critic system trained it to do. So the critic is in charge, minus generalization errors.
However, I also want to claim that the critic system is directly in charge when we’re using model-based thinking- when we come up with a predicted outcome before acting, the critic is supplying the estimate of how good that outcome is. But I’m not even sure this is a crux. The critic is still in charge in a pretty important way.
I think that information would be found in both the actor and the critic. But not to exactly the same degree. I think the critic probably updates faster. And the end result of the process can be a complex interaction between the actor, a world model (which I didn’t even bring into it in the article) and the critic. For instance, if it doesn’t occur to you to think about the likely consequences of doing heroin, the decision is based on the critic’s prediction that the heroin will be awesome. If the process, governed probably by the actor, does make a prediction of withdrawals and degradation as a result, then the decision is based on a rough sum that includes the critic’s very negative assignment of value to that part of the outcome.
I totally agree. That’s why the key question here is whether the critic can be reprogrammed after there’s enough knowledge in the actor and the world model.
As for the idea that the critic nudges, I agree. I think the early nudges are provided by a small variety of innate reward signals, and the critic then expands those with theories of the next thing we should explore, as it learns to connect those innate rewards to other sensory representations.
The critic is only representing adult human “values” as the result of tons of iterative learning between the systems. That’s the theory, anyway.
It’s also worth noting that, even if this isn’t how the human system works, it might be a workable scheme to make more alignable AGI systems.