At a high level, I agree that something related to empathy can happen when the same circuits are used for processing thoughts-about-others from thoughts-about-self. This seems like a design pattern that might be worth copying. My main concerns are:
It seems like the AIs we build will be very different from us, at least in terms of basic drives. I can definitely empathize when there’s some common currency to the experience (for ex. they’re feeling pain, and I’ve also experienced pain), but probably less so when there’s a greater gap. Since AIs won’t share any of our physiology or evolutionary history, I worry that that common currency will be missing, which would seemingly incentivize the AI having separate circuits for modeling humans and for modeling itself.
This doesn’t seem like it’d give us a robust enough version of empathy by itself, because the agent isn’t motivated to actively seek out opportunities to empathize. As an analogy, I know if I were forced to think of, and even look at, the process that produces hamburger meat, I would probably have a visceral reaction and not want to eat the burger. But I like burgers, so I don’t seek out that train of thought, so the hypothetical empathy & disgust that would’ve been invoked lays inactive. Maybe something like Anthropic’s Constitutional AI method would help in this direction...
Nitpick about terminology: I think the stuff you’re talking about is primarily attributable to having a learned value function rather than to having a learned reward model in the narrow sense of a predictor of immediate reward. I tend to use value function to refer to the thing that, alongside the reward function, produces visceral (gut-like) reactions to thoughts based on forecasts that were learned via something like TD learning. A reward model, on the other hand, is just another part of your model of the world, so it might not be connected to visceral “feels”. It doesn’t necessarily have any sway over decision-making, in the same way as your “will this number be even or odd” model isn’t necessarily connected to any visceral “feels”, so you don’t tend to make decisions based primarily on those predictions.
Also if you haven’t read this post, I think it’s a good one and very related.
It seems like the AIs we build will be very different from us, at least in terms of basic drives. I can definitely empathize when there’s some common currency to the experience (for ex. they’re feeling pain, and I’ve also experienced pain), but probably less so when there’s a greater gap. Since AIs won’t share any of our physiology or evolutionary history, I worry that that common currency will be missing, which would seemingly incentivize the AI having separate circuits for modeling humans and for modeling itself
Yes, this depends a lot on the self model of the AGI. It’s definitely not a silver bullet. The AGI will almost certainly have a very good model of humans, their culture, and how their minds work from various self-supervised losses. Whether the AGI conceptualises itself as close to this or not depends on the representations of AGI in the dataset as well as potentially our training regime.
Nitpick about terminology: I think the stuff you’re talking about is primarily attributable to having a learned value function rather than to having a learned reward model in the narrow sense of a predictor of immediate reward. I tend to use value function to refer to the thing that, alongside the reward function, produces visceral (gut-like) reactions to thoughts based on forecasts that were learned via something like TD learning
I agree it is not necessarily the reward model that generates direct feelings. I think it is hard to connect any part of an RL system directly to gut level ‘feels’ because we don’t really know what these are. The value function is just the estimate of the long run reward and is trained on a supervised bellman equation. It is very possible that the machinery that creates this won’t exist at all in the AGI, or maybe it is just some intrinsic property of RL agents I don’t know.
At a high level, I agree that something related to empathy can happen when the same circuits are used for processing thoughts-about-others from thoughts-about-self. This seems like a design pattern that might be worth copying. My main concerns are:
It seems like the AIs we build will be very different from us, at least in terms of basic drives. I can definitely empathize when there’s some common currency to the experience (for ex. they’re feeling pain, and I’ve also experienced pain), but probably less so when there’s a greater gap. Since AIs won’t share any of our physiology or evolutionary history, I worry that that common currency will be missing, which would seemingly incentivize the AI having separate circuits for modeling humans and for modeling itself.
This doesn’t seem like it’d give us a robust enough version of empathy by itself, because the agent isn’t motivated to actively seek out opportunities to empathize. As an analogy, I know if I were forced to think of, and even look at, the process that produces hamburger meat, I would probably have a visceral reaction and not want to eat the burger. But I like burgers, so I don’t seek out that train of thought, so the hypothetical empathy & disgust that would’ve been invoked lays inactive. Maybe something like Anthropic’s Constitutional AI method would help in this direction...
Nitpick about terminology: I think the stuff you’re talking about is primarily attributable to having a learned value function rather than to having a learned reward model in the narrow sense of a predictor of immediate reward. I tend to use value function to refer to the thing that, alongside the reward function, produces visceral (gut-like) reactions to thoughts based on forecasts that were learned via something like TD learning. A reward model, on the other hand, is just another part of your model of the world, so it might not be connected to visceral “feels”. It doesn’t necessarily have any sway over decision-making, in the same way as your “will this number be even or odd” model isn’t necessarily connected to any visceral “feels”, so you don’t tend to make decisions based primarily on those predictions.
Also if you haven’t read this post, I think it’s a good one and very related.
Yes, this depends a lot on the self model of the AGI. It’s definitely not a silver bullet. The AGI will almost certainly have a very good model of humans, their culture, and how their minds work from various self-supervised losses. Whether the AGI conceptualises itself as close to this or not depends on the representations of AGI in the dataset as well as potentially our training regime.
I agree it is not necessarily the reward model that generates direct feelings. I think it is hard to connect any part of an RL system directly to gut level ‘feels’ because we don’t really know what these are. The value function is just the estimate of the long run reward and is trained on a supervised bellman equation. It is very possible that the machinery that creates this won’t exist at all in the AGI, or maybe it is just some intrinsic property of RL agents I don’t know.