Thanks for watching the talk and for the insightful comments! A couple of thoughts:
I agree that mirror neurons are problematic both theoretically and empirically so I avoided framing that data in terms of mirror neurons. I interpret the premotor cortex data and most other self-other overlap data under definition (A) described in your post.
Regarding the second point, I think that the issue you brought up correctly identifies an incentive for updating away from “self-other overlap” but this doesn’t seem to fully realise in humans and I expect this to not fully realise in AI agents either due to stronger competing incentives that favour self-other overlap. One possible explanation is the attitude for risk incorporated into the subjective reward value. In this paper, in the section “Risky rewards, subjective value, and formal economic utility”, it is mentioned that monkeys show nonlinear utility functions that are compatible with risk seeking at small juice amounts and risk avoiding at larger amounts. It is possible that if the reward predictor predicts that “I will be eating chocolate” given that the amount of reward expected is fairly low humans are risk-seeking in that regime and our subjective reward value is high, making us want to take that bet, and given that the reward prediction error might be fairly small, it could potentially be overridden by our subjective reward value influenced by our attitude to risk. Now this might explain how empathic responses from self-other overlap are kept for low-stakes scenarios but there are stronger incentives to update away from self-other overlap in higher-stakes scenarios. Humans seem to have learned to modulate their empathy which might prevent some reward-prediction error. Also it is possible that we have evolved to be more risk-seeking when it comes to empathic concern due to the evolutionary benefits of empathy increasing the subjective reward value of the empathic response due to self-other overlap in higher stakes scenarios but I am uncertain. I am curious what you think about this hypothesis.
I agree that the theory of change that I presented is simplistic and I should have more explicitly stated the key uncertainties of this proposal although I did mention throughout the talk that I do not think that inducing self-other overlap is enough and that we still have to ensure that the set of incentives that shape the agent’s behaviour favours good outcomes. What I was trying to communicate in the theory of change section is that self-other overlap sets incentives that favour AI not killing us (self-preservation as an instrumentally convergent goal given which self-other overlap provides an incentive for other-preservation) and sub-agent stability of self-other overlap (due to the AI expecting its self/other-preservation preferences to be frustrated if the agents that it creates, including improved versions of itself, don’t have self/other-preservation preferences) but I failed to put enough emphasis on the fact that this will only happen if we find ways to ensure that these incentives dominate and are not overridden by competing incentives and mechanisms. I think that the competing incentives problem is tractable, which is one of the main reasons I believe that this research direction is promising.
I’m confused by your second bullet point. Let’s say you really like chocolate chip cookies but you’re strictly neutral on peanut butter chocolate chip cookies. And they look and smell the same until you bite in (maybe you have a bad sense of smell).
Now you see a plate on a buffet with a little sign next to it that says “Peanut butter chocolate chip cookies”. You ask your trusted friend whether the sign is correct, and they say “Yeah, I just ate one, it was very peanut buttery, yum.” Next to that plate is a plate of brownies, which you like a little bit, but much less than you like chocolate chip cookies.
Your model seems to be: “I won’t be completely sure that the sign isn’t wrong and my friend isn’t lying. So being risk-seeking, I’m going to eat the cookie, just in case.”
I don’t think that model is realistic though. Obviously you’re going to believe the sign & believe your trusted friend. The odds that the cookies aren’t peanut butter are essentially zero. You know this very well. So you’ll go for the brownie instead.
Now we switch to the empathy case. Again, I think it’s perfectly obvious to anyone that a plate labeled & described as “peanut butter chocolate chip cookies” is in fact full of peanut butter chocolate chip cookies, and people will act accordingly. Well, it’s even more obvious by far that “my friend eating a chocolate chip cookie” is not in fact “me eating a chocolate chip cookie”! So, if I don’t feel an urge to eat the cookie that I know damn well has peanut butter in it, I likewise won’t feel an urge to take actions that I know damn well will lead to someone else eating a yummy cookie instead of me.
So anyway, you seem to be assuming that the human brain has no special mechanisms to prevent the unlearning of self-other overlap. I would propose instead that the human brain does have such special mechanisms, and that we better go figure out what those mechanisms are. :)
self-other overlap sets incentives that favour AI not killing us (self-preservation as an instrumentally convergent goal given which self-other overlap provides an incentive for other-preservation) and sub-agent stability of self-other overlap (due to the AI expecting its self/other-preservation preferences to be frustrated if the agents that it creates, including improved versions of itself, don’t have self/other-preservation preferences) but I failed to put enough emphasis on the fact that this will only happen if we find ways to ensure that these incentives dominate and are not overridden by competing incentives and mechanisms
I’m a bit confused by this. My “apocalypse stories” from the grandparent comment did not assume any competing incentives and mechanisms, right? They were all bad actions that I claim also flowed naturally from self-other-overlap-derived incentives.
I am slightly confused by your hypothetical. The hypothesis is rather that when the predicted reward from seeing my friend eating a cookie due to self-other overlap is lower than the obtained reward of me not eating a cookie, the self-other overlap might not be updated against because the increase in subjective reward from risking to get the obtained reward is higher than the prediction error in this low stakes scenario. I am fairly uncertain about this being what actually happens but I put it forward as a potential hypothesis.
“So anyway, you seem to be assuming that the human brain has no special mechanisms to prevent the unlearning of self-other overlap. I would propose instead that the human brain does have such special mechanisms, and that we better go figure out what those mechanisms are. :)”
My intuition is that the brain does have special mechanisms to prevent the unlearning of self-other overlap so I agree with you that we should be looking into the literature to understand them, as one would expect mechanisms like that to evolve given incentives to unlearn self-other overlap and given the evolutionary benefits of self-other overlap. One such mechanism could be the brain being more risk tolerant when it comes to empathic responses and not updating against self-other overlap when the predicted reward is lower than the obtained reward but I don’t have a model of how exactly this would be implemented.
“I’m a bit confused by this. My “apocalypse stories” from the grandparent comment did not assume any competing incentives and mechanisms, right? They were all bad actions that I claim also flowed naturally from self-other-overlap-derived incentives.”
What I meant by “competing incentives” is any incentives that compete with the good incentives described by me (other-preservation and sub-agent stability), which could include bad incentives that might also flow naturally from self-other overlap.
when the predicted reward from seeing my friend eating a cookie due to self-other overlap is lower than the obtained reward of me not eating a cookie, the self-other overlap might not be updated against because the increase in subjective reward from risking to get the obtained reward is higher than the prediction error in this low stakes scenario. I am fairly uncertain about this being what actually happens but I put it forward as a potential hypothesis.
I think you’re confused, or else I don’t follow. Can we walk through it? Assume traditional ML-style actor-critic RL with TD learning, if that’s OK. There’s a value function V and a reward function R. Let’s assume we start with:
V(I’m about to eat a cookie) = 10
V(Joe is about to eat a cookie) = 2 [it’s nonzero “by default” because of self-other overlap]
R(I’m eating a cookie) = 10
R(Joe is eating a cookie) = 0
So when I see that Joe is about to eat a cookie, this pleases me (V>0). But then Joe eats the cookie, and the reward is zero, so TD learning kicks in and reduces V(Joe is about to eat a cookie) for next time. Repeat a few more times and V(Joe is about to eat a cookie) approaches zero, right? So eventually, when I see that Joe is about to eat a cookie, I don’t care.
How does your story differ from that? Can you walk through the mechanism in terms of V and R?
What I meant by “competing incentives” is…
This is just terminology, but let’s say the water company tries to get people to reduce their water usage by giving them a gift card when their water usage is lower in Month N+1 than in Month N. Two of the possible behaviors that might result are (A) the intended behavior where people use less water each month, (B) an unintended behavior where people waste tons and tons of water in odd months to ensure that it definitely will go down in even months.
I would describe this as ONE INCENTIVE that incentivizes both of these two behaviors (and many other possible behaviors as well). Whereas you would describe it as “an incentive to do (A) and a competing incentive to do (B)”, apparently. (Right?) I’m not an economist, but when I google-searched for “competing incentives” just now, none of the results were using the term in the way that you’re using it here. Typically people used the phrase “competing incentives” to talk about two different incentive programs provided by two different organizations working at cross-purposes, or something like that.
Thanks for watching the talk and for the insightful comments! A couple of thoughts:
I agree that mirror neurons are problematic both theoretically and empirically so I avoided framing that data in terms of mirror neurons. I interpret the premotor cortex data and most other self-other overlap data under definition (A) described in your post.
Regarding the second point, I think that the issue you brought up correctly identifies an incentive for updating away from “self-other overlap” but this doesn’t seem to fully realise in humans and I expect this to not fully realise in AI agents either due to stronger competing incentives that favour self-other overlap. One possible explanation is the attitude for risk incorporated into the subjective reward value. In this paper, in the section “Risky rewards, subjective value, and formal economic utility”, it is mentioned that monkeys show nonlinear utility functions that are compatible with risk seeking at small juice amounts and risk avoiding at larger amounts. It is possible that if the reward predictor predicts that “I will be eating chocolate” given that the amount of reward expected is fairly low humans are risk-seeking in that regime and our subjective reward value is high, making us want to take that bet, and given that the reward prediction error might be fairly small, it could potentially be overridden by our subjective reward value influenced by our attitude to risk. Now this might explain how empathic responses from self-other overlap are kept for low-stakes scenarios but there are stronger incentives to update away from self-other overlap in higher-stakes scenarios. Humans seem to have learned to modulate their empathy which might prevent some reward-prediction error. Also it is possible that we have evolved to be more risk-seeking when it comes to empathic concern due to the evolutionary benefits of empathy increasing the subjective reward value of the empathic response due to self-other overlap in higher stakes scenarios but I am uncertain. I am curious what you think about this hypothesis.
I agree that the theory of change that I presented is simplistic and I should have more explicitly stated the key uncertainties of this proposal although I did mention throughout the talk that I do not think that inducing self-other overlap is enough and that we still have to ensure that the set of incentives that shape the agent’s behaviour favours good outcomes. What I was trying to communicate in the theory of change section is that self-other overlap sets incentives that favour AI not killing us (self-preservation as an instrumentally convergent goal given which self-other overlap provides an incentive for other-preservation) and sub-agent stability of self-other overlap (due to the AI expecting its self/other-preservation preferences to be frustrated if the agents that it creates, including improved versions of itself, don’t have self/other-preservation preferences) but I failed to put enough emphasis on the fact that this will only happen if we find ways to ensure that these incentives dominate and are not overridden by competing incentives and mechanisms. I think that the competing incentives problem is tractable, which is one of the main reasons I believe that this research direction is promising.
I’m confused by your second bullet point. Let’s say you really like chocolate chip cookies but you’re strictly neutral on peanut butter chocolate chip cookies. And they look and smell the same until you bite in (maybe you have a bad sense of smell).
Now you see a plate on a buffet with a little sign next to it that says “Peanut butter chocolate chip cookies”. You ask your trusted friend whether the sign is correct, and they say “Yeah, I just ate one, it was very peanut buttery, yum.” Next to that plate is a plate of brownies, which you like a little bit, but much less than you like chocolate chip cookies.
Your model seems to be: “I won’t be completely sure that the sign isn’t wrong and my friend isn’t lying. So being risk-seeking, I’m going to eat the cookie, just in case.”
I don’t think that model is realistic though. Obviously you’re going to believe the sign & believe your trusted friend. The odds that the cookies aren’t peanut butter are essentially zero. You know this very well. So you’ll go for the brownie instead.
Now we switch to the empathy case. Again, I think it’s perfectly obvious to anyone that a plate labeled & described as “peanut butter chocolate chip cookies” is in fact full of peanut butter chocolate chip cookies, and people will act accordingly. Well, it’s even more obvious by far that “my friend eating a chocolate chip cookie” is not in fact “me eating a chocolate chip cookie”! So, if I don’t feel an urge to eat the cookie that I know damn well has peanut butter in it, I likewise won’t feel an urge to take actions that I know damn well will lead to someone else eating a yummy cookie instead of me.
So anyway, you seem to be assuming that the human brain has no special mechanisms to prevent the unlearning of self-other overlap. I would propose instead that the human brain does have such special mechanisms, and that we better go figure out what those mechanisms are. :)
I’m a bit confused by this. My “apocalypse stories” from the grandparent comment did not assume any competing incentives and mechanisms, right? They were all bad actions that I claim also flowed naturally from self-other-overlap-derived incentives.
I am slightly confused by your hypothetical. The hypothesis is rather that when the predicted reward from seeing my friend eating a cookie due to self-other overlap is lower than the obtained reward of me not eating a cookie, the self-other overlap might not be updated against because the increase in subjective reward from risking to get the obtained reward is higher than the prediction error in this low stakes scenario. I am fairly uncertain about this being what actually happens but I put it forward as a potential hypothesis.
“So anyway, you seem to be assuming that the human brain has no special mechanisms to prevent the unlearning of self-other overlap. I would propose instead that the human brain does have such special mechanisms, and that we better go figure out what those mechanisms are. :)”
My intuition is that the brain does have special mechanisms to prevent the unlearning of self-other overlap so I agree with you that we should be looking into the literature to understand them, as one would expect mechanisms like that to evolve given incentives to unlearn self-other overlap and given the evolutionary benefits of self-other overlap. One such mechanism could be the brain being more risk tolerant when it comes to empathic responses and not updating against self-other overlap when the predicted reward is lower than the obtained reward but I don’t have a model of how exactly this would be implemented.
“I’m a bit confused by this. My “apocalypse stories” from the grandparent comment did not assume any competing incentives and mechanisms, right? They were all bad actions that I claim also flowed naturally from self-other-overlap-derived incentives.”
What I meant by “competing incentives” is any incentives that compete with the good incentives described by me (other-preservation and sub-agent stability), which could include bad incentives that might also flow naturally from self-other overlap.
I think you’re confused, or else I don’t follow. Can we walk through it? Assume traditional ML-style actor-critic RL with TD learning, if that’s OK. There’s a value function V and a reward function R. Let’s assume we start with:
V(I’m about to eat a cookie) = 10
V(Joe is about to eat a cookie) = 2 [it’s nonzero “by default” because of self-other overlap]
R(I’m eating a cookie) = 10
R(Joe is eating a cookie) = 0
So when I see that Joe is about to eat a cookie, this pleases me (V>0). But then Joe eats the cookie, and the reward is zero, so TD learning kicks in and reduces V(Joe is about to eat a cookie) for next time. Repeat a few more times and V(Joe is about to eat a cookie) approaches zero, right? So eventually, when I see that Joe is about to eat a cookie, I don’t care.
How does your story differ from that? Can you walk through the mechanism in terms of V and R?
This is just terminology, but let’s say the water company tries to get people to reduce their water usage by giving them a gift card when their water usage is lower in Month N+1 than in Month N. Two of the possible behaviors that might result are (A) the intended behavior where people use less water each month, (B) an unintended behavior where people waste tons and tons of water in odd months to ensure that it definitely will go down in even months.
I would describe this as ONE INCENTIVE that incentivizes both of these two behaviors (and many other possible behaviors as well). Whereas you would describe it as “an incentive to do (A) and a competing incentive to do (B)”, apparently. (Right?) I’m not an economist, but when I google-searched for “competing incentives” just now, none of the results were using the term in the way that you’re using it here. Typically people used the phrase “competing incentives” to talk about two different incentive programs provided by two different organizations working at cross-purposes, or something like that.