Just saw this reply recently. Thanks for leaving it, I found it stimulating.
(I wrote the following rather quickly, in an attempt to write anything at all, as I find it not that pleasant to write LW comments—no offense to you in particular. Apologies if it’s confusing or unclear.)
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low.
Yes, in large part.
I’m objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
Yeah, are people differentially selected for caring about the real world? At the risk of seeming facile, this feels non-obvious. My gut take is that conditional on RL agents usually caring about reward (and thus setting aside a bunch of my inside-view reasoning about how RL dynamics work), conditional on that—reward-humans could totally have been selected for.
This would drive up P(humans care about reward | RL agents care about reward, humans were selected by evolution), and thus (I think?) drive down P(humans care about the real world | RL agents usually care about reward).
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
In what way is my fitness lower than someone who really cares about these things, given that the best way to get rewards may well be to actually do the things?
Here are some ways I can think of:
Caring about reward directly makes reward hacking a problem evolution has to solve, and if it doesn’t solve it properly, the person ends up masturbating and not taking (re)productive actions.
Counter-counterpoint: But also many people do in fact enjoy masturbating, even though it seems (to my naive view) like an obvious thing to select away, which was present ancestrally.
People seem to be able to tell when you don’t really care about them, and just want to use them to get things.
But if I valued the rewarding feeling of having friends and spending time with them and watching them succeed—I, personally, feel good and happy when I am hanging out with my friends—then I still would be instrumentally aligned with my friends. If so, there would be no selection pressure for reward-motivation detectors, in the way that people were selected into noticing deception (even if not hardcoded to do so).
Overall, I feel like people being selected by evolution is an important qualifier which ends up changing the inferences we make via e.g. P(care about the real world | ), and I think I’ve at least somewhat neglected this qualifier. But I think I’d estimate P(humans care about the real world | RL agents usually care about reward) as… .2? I’d feel slightly more surprised by that than by the specific outcome of two coinflips coming up heads.
I think that there are serious path-dependent constraints in evolution, not a super clear/strong/realizable fitness gradient away from caring about reward if that’s how RL agents usually work, and so I expect humans to be relatively typical along this dimension, with some known unknowns around whatever extra circuitry evolution wrote into us.
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?
I think this highlights a good counterpoint. I think this alternate theory predicts “probably not”, although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status → reward; and it’s high-status to sacrifice yourself for your kid). Or because keeping your kid safe → high reward as another learned drive.
Overall this feels like contortion but I think it’s possible. Maybe overall this is a… 1-bit update against the “not selection for caring about reality” point?
Just saw this reply recently. Thanks for leaving it, I found it stimulating.
(I wrote the following rather quickly, in an attempt to write anything at all, as I find it not that pleasant to write LW comments—no offense to you in particular. Apologies if it’s confusing or unclear.)
Yes, in large part.
Yeah, are people differentially selected for caring about the real world? At the risk of seeming facile, this feels non-obvious. My gut take is that conditional on RL agents usually caring about reward (and thus setting aside a bunch of my inside-view reasoning about how RL dynamics work), conditional on that—reward-humans could totally have been selected for.
This would drive up P(humans care about reward | RL agents care about reward, humans were selected by evolution), and thus (I think?) drive down P(humans care about the real world | RL agents usually care about reward).
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
In what way is my fitness lower than someone who really cares about these things, given that the best way to get rewards may well be to actually do the things?
Here are some ways I can think of:
Caring about reward directly makes reward hacking a problem evolution has to solve, and if it doesn’t solve it properly, the person ends up masturbating and not taking (re)productive actions.
Counter-counterpoint: But also many people do in fact enjoy masturbating, even though it seems (to my naive view) like an obvious thing to select away, which was present ancestrally.
People seem to be able to tell when you don’t really care about them, and just want to use them to get things.
But if I valued the rewarding feeling of having friends and spending time with them and watching them succeed—I, personally, feel good and happy when I am hanging out with my friends—then I still would be instrumentally aligned with my friends. If so, there would be no selection pressure for reward-motivation detectors, in the way that people were selected into noticing deception (even if not hardcoded to do so).
Overall, I feel like people being selected by evolution is an important qualifier which ends up changing the inferences we make via e.g. P(care about the real world | ), and I think I’ve at least somewhat neglected this qualifier. But I think I’d estimate P(humans care about the real world | RL agents usually care about reward) as… .2? I’d feel slightly more surprised by that than by the specific outcome of two coinflips coming up heads.
I think that there are serious path-dependent constraints in evolution, not a super clear/strong/realizable fitness gradient away from caring about reward if that’s how RL agents usually work, and so I expect humans to be relatively typical along this dimension, with some known unknowns around whatever extra circuitry evolution wrote into us.
Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?
I think this highlights a good counterpoint. I think this alternate theory predicts “probably not”, although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status → reward; and it’s high-status to sacrifice yourself for your kid). Or because keeping your kid safe → high reward as another learned drive.
Overall this feels like contortion but I think it’s possible. Maybe overall this is a… 1-bit update against the “not selection for caring about reality” point?