Marc Carauleanu comments on Towards empathy in RL agents and beyond: Insights from cognitive science for AI Alignment

Marc Carauleanu 4 Apr 2023 21:08 UTC
1 point
0
I am slightly confused by your hypothetical. The hypothesis is rather that when the predicted reward from seeing my friend eating a cookie due to self-other overlap is lower than the obtained reward of me not eating a cookie, the self-other overlap might not be updated against because the increase in subjective reward from risking to get the obtained reward is higher than the prediction error in this low stakes scenario. I am fairly uncertain about this being what actually happens but I put it forward as a potential hypothesis.
“So anyway, you seem to be assuming that the human brain has no special mechanisms to prevent the unlearning of self-other overlap. I would propose instead that the human brain does have such special mechanisms, and that we better go figure out what those mechanisms are. :)”
My intuition is that the brain does have special mechanisms to prevent the unlearning of self-other overlap so I agree with you that we should be looking into the literature to understand them, as one would expect mechanisms like that to evolve given incentives to unlearn self-other overlap and given the evolutionary benefits of self-other overlap. One such mechanism could be the brain being more risk tolerant when it comes to empathic responses and not updating against self-other overlap when the predicted reward is lower than the obtained reward but I don’t have a model of how exactly this would be implemented.
“I’m a bit confused by this. My “apocalypse stories” from the grandparent comment did not assume any competing incentives and mechanisms, right? They were all bad actions that I claim also flowed naturally from self-other-overlap-derived incentives.”
What I meant by “competing incentives” is any incentives that compete with the good incentives described by me (other-preservation and sub-agent stability), which could include bad incentives that might also flow naturally from self-other overlap.
- Steven Byrnes 5 Apr 2023 14:14 UTC
  2 points
  0
  Parent
  when the predicted reward from seeing my friend eating a cookie due to self-other overlap is lower than the obtained reward of me not eating a cookie, the self-other overlap might not be updated against because the increase in subjective reward from risking to get the obtained reward is higher than the prediction error in this low stakes scenario. I am fairly uncertain about this being what actually happens but I put it forward as a potential hypothesis.
  I think you’re confused, or else I don’t follow. Can we walk through it? Assume traditional ML-style actor-critic RL with TD learning, if that’s OK. There’s a value function V and a reward function R. Let’s assume we start with:
  - V(I’m about to eat a cookie) = 10
  - V(Joe is about to eat a cookie) = 2 [it’s nonzero “by default” because of self-other overlap]
  - R(I’m eating a cookie) = 10
  - R(Joe is eating a cookie) = 0
  So when I see that Joe is about to eat a cookie, this pleases me (V>0). But then Joe eats the cookie, and the reward is zero, so TD learning kicks in and reduces V(Joe is about to eat a cookie) for next time. Repeat a few more times and V(Joe is about to eat a cookie) approaches zero, right? So eventually, when I see that Joe is about to eat a cookie, I don’t care.
  How does your story differ from that? Can you walk through the mechanism in terms of V and R?
  What I meant by “competing incentives” is…
  This is just terminology, but let’s say the water company tries to get people to reduce their water usage by giving them a gift card when their water usage is lower in Month N+1 than in Month N. Two of the possible behaviors that might result are (A) the intended behavior where people use less water each month, (B) an unintended behavior where people waste tons and tons of water in odd months to ensure that it definitely will go down in even months.
  I would describe this as ONE INCENTIVE that incentivizes both of these two behaviors (and many other possible behaviors as well). Whereas you would describe it as “an incentive to do (A) and a competing incentive to do (B)”, apparently. (Right?) I’m not an economist, but when I google-searched for “competing incentives” just now, none of the results were using the term in the way that you’re using it here. Typically people used the phrase “competing incentives” to talk about two different incentive programs provided by two different organizations working at cross-purposes, or something like that.