I don’t understand the analogy with humans. It sounds like you are saying “an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward” be analogous to “humans care about the details of their reward circuitry.” But:
I don’t think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior.
(Emphasis added)
I don’t think this engages with the substance of the analogy to humans. I don’t think any party in this conversation believes that human learning is “just” RL based on a reward circuit, and I don’t believe it either. “Just RL” also isn’t necessary for the human case to give evidence about the AI case. Therefore, your summary seems to me like a strawman of the argument.
I would say “human value formation mostly occurs via RL & algorithms meta-learned thereby, but in the important context of SSL / predictive processing, and influenced by inductive biases from high-level connectome topology and genetically specified reflexes and environmental regularities and...”
Furthermore, wehave good evidence that RL plays an important role in human learning. For example, from The shard theory of human values:
Wolfram Schultz and colleagues have found that the signaling behavior of phasic dopamine in the mesocorticolimbic pathway mirrors that of a TD error (or reward prediction error).
In addition to finding correlates of reinforcement learning signals in the brain, artificial manipulation of those signal correlates (through optogenetic stimulation, for example) produces the behavioral adjustments that would be predicted from their putative role in reinforcement learning.
Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals.
Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains.
Both of those observations have high probability, so they aren’t significant Bayesian evidence for “RL tends to produce external goals by default.”
In particular, for this to be evidence for Richard’s claim, you need to say: “If RL tended to produce systems that care about reward, then RL would be significantly less likely to play a role in human cognition.” There’s some update there but it’s just not big. It’s easy to build brains that use RL as part of a more complicated system and end up with lots of goals other than reward. My view is probably the other way—humans care about reward more than I would guess from the actual amount of RL they can do over the course of their life (my guess is that other systems play a significant role in our conscious attitude towards pleasure).
In particular, humans prominently do reinforcement learning.
Humans care about lots of things in reality, not just certain kinds of cognitive-update-signals.
“RL → high chance of caring about reality” predicts this observation more strongly than “RL → low chance of caring about reality”
This seems pretty straightforward to me, but I bet there are also pieces of your perspective I’m just not seeing.
But in particular, it doesn’t seem relevant to consider selection pressures from evolution, except insofar as we’re postulating additional mechanisms which evolution found which explain away some of the reality-caring? That would weaken (but not eliminate) the update towards “RL → high chance of caring about reality.”
Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains.
I don’t see how this point is relevant. Are you saying that within-lifetime learning is unsurprising, so we can’t make further updates by reasoning about how people do it?
Both of those observations have high probability, so they aren’t significant Bayesian evidence for “RL tends to produce external goals by default.”
I’m saying that there was a missed update towards that conclusion, so it doesn’t matter if we already knew that humans do within-lifetime learning?
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low. I’m objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
The Bayesian update is P(humans care about the real world | RL agents usually care about reward) / P(humans care about the real world | RL agents mostly care about other stuff). So if e.g. P(humans care about the real world | RL agents don’t usually care about reward) was 80%, then your update could be at most 1.25. In fact I think it’s even smaller than that..
And then if you try to turn that into evidence about “reward is a very hard concept to learn,” or a prediction about how neural nets trained with RL will behave, it’s moving my odds ratios by less than 10% (since we are using “RL” quite loosely in this discussion, and there are lots of other differences and complications at play, all of which shrink the update).
You seem to be saying “yes but it’s evidence,” which I’m not objecting to—I’m just saying it’s an extremely small amount evidence. I’m not clear on whether you agree with my calculation.
(Some of the other text I wrote was about a different argument you might be making: that P(humans use RL | RL agents usually care about reward) is significantly lower than P(humans use RL| RL agents mostly are about other stuff), because evolution would then have never used RL. My sense is that you aren’t making this argument so you should ignore all of that, sorry to be confusing.)
Just saw this reply recently. Thanks for leaving it, I found it stimulating.
(I wrote the following rather quickly, in an attempt to write anything at all, as I find it not that pleasant to write LW comments—no offense to you in particular. Apologies if it’s confusing or unclear.)
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low.
Yes, in large part.
I’m objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
Yeah, are people differentially selected for caring about the real world? At the risk of seeming facile, this feels non-obvious. My gut take is that conditional on RL agents usually caring about reward (and thus setting aside a bunch of my inside-view reasoning about how RL dynamics work), conditional on that—reward-humans could totally have been selected for.
This would drive up P(humans care about reward | RL agents care about reward, humans were selected by evolution), and thus (I think?) drive down P(humans care about the real world | RL agents usually care about reward).
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
In what way is my fitness lower than someone who really cares about these things, given that the best way to get rewards may well be to actually do the things?
Here are some ways I can think of:
Caring about reward directly makes reward hacking a problem evolution has to solve, and if it doesn’t solve it properly, the person ends up masturbating and not taking (re)productive actions.
Counter-counterpoint: But also many people do in fact enjoy masturbating, even though it seems (to my naive view) like an obvious thing to select away, which was present ancestrally.
People seem to be able to tell when you don’t really care about them, and just want to use them to get things.
But if I valued the rewarding feeling of having friends and spending time with them and watching them succeed—I, personally, feel good and happy when I am hanging out with my friends—then I still would be instrumentally aligned with my friends. If so, there would be no selection pressure for reward-motivation detectors, in the way that people were selected into noticing deception (even if not hardcoded to do so).
Overall, I feel like people being selected by evolution is an important qualifier which ends up changing the inferences we make via e.g. P(care about the real world | ), and I think I’ve at least somewhat neglected this qualifier. But I think I’d estimate P(humans care about the real world | RL agents usually care about reward) as… .2? I’d feel slightly more surprised by that than by the specific outcome of two coinflips coming up heads.
I think that there are serious path-dependent constraints in evolution, not a super clear/strong/realizable fitness gradient away from caring about reward if that’s how RL agents usually work, and so I expect humans to be relatively typical along this dimension, with some known unknowns around whatever extra circuitry evolution wrote into us.
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?
I think this highlights a good counterpoint. I think this alternate theory predicts “probably not”, although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status → reward; and it’s high-status to sacrifice yourself for your kid). Or because keeping your kid safe → high reward as another learned drive.
Overall this feels like contortion but I think it’s possible. Maybe overall this is a… 1-bit update against the “not selection for caring about reality” point?
(Emphasis added)
I don’t think this engages with the substance of the analogy to humans. I don’t think any party in this conversation believes that human learning is “just” RL based on a reward circuit, and I don’t believe it either. “Just RL” also isn’t necessary for the human case to give evidence about the AI case. Therefore, your summary seems to me like a strawman of the argument.
I would say “human value formation mostly occurs via RL & algorithms meta-learned thereby, but in the important context of SSL / predictive processing, and influenced by inductive biases from high-level connectome topology and genetically specified reflexes and environmental regularities and...”
Furthermore, we have good evidence that RL plays an important role in human learning. For example, from The shard theory of human values:
This is incredibly weak evidence.
Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals.
Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains.
Both of those observations have high probability, so they aren’t significant Bayesian evidence for “RL tends to produce external goals by default.”
In particular, for this to be evidence for Richard’s claim, you need to say: “If RL tended to produce systems that care about reward, then RL would be significantly less likely to play a role in human cognition.” There’s some update there but it’s just not big. It’s easy to build brains that use RL as part of a more complicated system and end up with lots of goals other than reward. My view is probably the other way—humans care about reward more than I would guess from the actual amount of RL they can do over the course of their life (my guess is that other systems play a significant role in our conscious attitude towards pleasure).
Curious what systems you have in mind here.
I don’t understand why you think this explains away the evidential impact, and I guess I put way less weight on selection reasoning than you do. My reasoning here goes:
Lots of animals do reinforcement learning.
In particular, humans prominently do reinforcement learning.
Humans care about lots of things in reality, not just certain kinds of cognitive-update-signals.
“RL → high chance of caring about reality” predicts this observation more strongly than “RL → low chance of caring about reality”
This seems pretty straightforward to me, but I bet there are also pieces of your perspective I’m just not seeing.
But in particular, it doesn’t seem relevant to consider selection pressures from evolution, except insofar as we’re postulating additional mechanisms which evolution found which explain away some of the reality-caring? That would weaken (but not eliminate) the update towards “RL → high chance of caring about reality.”
I don’t see how this point is relevant. Are you saying that within-lifetime learning is unsurprising, so we can’t make further updates by reasoning about how people do it?
I’m saying that there was a missed update towards that conclusion, so it doesn’t matter if we already knew that humans do within-lifetime learning?
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low. I’m objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
The Bayesian update is P(humans care about the real world | RL agents usually care about reward) / P(humans care about the real world | RL agents mostly care about other stuff). So if e.g. P(humans care about the real world | RL agents don’t usually care about reward) was 80%, then your update could be at most 1.25. In fact I think it’s even smaller than that..
And then if you try to turn that into evidence about “reward is a very hard concept to learn,” or a prediction about how neural nets trained with RL will behave, it’s moving my odds ratios by less than 10% (since we are using “RL” quite loosely in this discussion, and there are lots of other differences and complications at play, all of which shrink the update).
You seem to be saying “yes but it’s evidence,” which I’m not objecting to—I’m just saying it’s an extremely small amount evidence. I’m not clear on whether you agree with my calculation.
(Some of the other text I wrote was about a different argument you might be making: that P(humans use RL | RL agents usually care about reward) is significantly lower than P(humans use RL| RL agents mostly are about other stuff), because evolution would then have never used RL. My sense is that you aren’t making this argument so you should ignore all of that, sorry to be confusing.)
Just saw this reply recently. Thanks for leaving it, I found it stimulating.
(I wrote the following rather quickly, in an attempt to write anything at all, as I find it not that pleasant to write LW comments—no offense to you in particular. Apologies if it’s confusing or unclear.)
Yes, in large part.
Yeah, are people differentially selected for caring about the real world? At the risk of seeming facile, this feels non-obvious. My gut take is that conditional on RL agents usually caring about reward (and thus setting aside a bunch of my inside-view reasoning about how RL dynamics work), conditional on that—reward-humans could totally have been selected for.
This would drive up P(humans care about reward | RL agents care about reward, humans were selected by evolution), and thus (I think?) drive down P(humans care about the real world | RL agents usually care about reward).
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
In what way is my fitness lower than someone who really cares about these things, given that the best way to get rewards may well be to actually do the things?
Here are some ways I can think of:
Caring about reward directly makes reward hacking a problem evolution has to solve, and if it doesn’t solve it properly, the person ends up masturbating and not taking (re)productive actions.
Counter-counterpoint: But also many people do in fact enjoy masturbating, even though it seems (to my naive view) like an obvious thing to select away, which was present ancestrally.
People seem to be able to tell when you don’t really care about them, and just want to use them to get things.
But if I valued the rewarding feeling of having friends and spending time with them and watching them succeed—I, personally, feel good and happy when I am hanging out with my friends—then I still would be instrumentally aligned with my friends. If so, there would be no selection pressure for reward-motivation detectors, in the way that people were selected into noticing deception (even if not hardcoded to do so).
Overall, I feel like people being selected by evolution is an important qualifier which ends up changing the inferences we make via e.g. P(care about the real world | ), and I think I’ve at least somewhat neglected this qualifier. But I think I’d estimate P(humans care about the real world | RL agents usually care about reward) as… .2? I’d feel slightly more surprised by that than by the specific outcome of two coinflips coming up heads.
I think that there are serious path-dependent constraints in evolution, not a super clear/strong/realizable fitness gradient away from caring about reward if that’s how RL agents usually work, and so I expect humans to be relatively typical along this dimension, with some known unknowns around whatever extra circuitry evolution wrote into us.
Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?
I think this highlights a good counterpoint. I think this alternate theory predicts “probably not”, although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status → reward; and it’s high-status to sacrifice yourself for your kid). Or because keeping your kid safe → high reward as another learned drive.
Overall this feels like contortion but I think it’s possible. Maybe overall this is a… 1-bit update against the “not selection for caring about reality” point?