It seems like the AIs we build will be very different from us, at least in terms of basic drives. I can definitely empathize when there’s some common currency to the experience (for ex. they’re feeling pain, and I’ve also experienced pain), but probably less so when there’s a greater gap. Since AIs won’t share any of our physiology or evolutionary history, I worry that that common currency will be missing, which would seemingly incentivize the AI having separate circuits for modeling humans and for modeling itself
Yes, this depends a lot on the self model of the AGI. It’s definitely not a silver bullet. The AGI will almost certainly have a very good model of humans, their culture, and how their minds work from various self-supervised losses. Whether the AGI conceptualises itself as close to this or not depends on the representations of AGI in the dataset as well as potentially our training regime.
Nitpick about terminology: I think the stuff you’re talking about is primarily attributable to having a learned value function rather than to having a learned reward model in the narrow sense of a predictor of immediate reward. I tend to use value function to refer to the thing that, alongside the reward function, produces visceral (gut-like) reactions to thoughts based on forecasts that were learned via something like TD learning
I agree it is not necessarily the reward model that generates direct feelings. I think it is hard to connect any part of an RL system directly to gut level ‘feels’ because we don’t really know what these are. The value function is just the estimate of the long run reward and is trained on a supervised bellman equation. It is very possible that the machinery that creates this won’t exist at all in the AGI, or maybe it is just some intrinsic property of RL agents I don’t know.
Yes, this depends a lot on the self model of the AGI. It’s definitely not a silver bullet. The AGI will almost certainly have a very good model of humans, their culture, and how their minds work from various self-supervised losses. Whether the AGI conceptualises itself as close to this or not depends on the representations of AGI in the dataset as well as potentially our training regime.
I agree it is not necessarily the reward model that generates direct feelings. I think it is hard to connect any part of an RL system directly to gut level ‘feels’ because we don’t really know what these are. The value function is just the estimate of the long run reward and is trained on a supervised bellman equation. It is very possible that the machinery that creates this won’t exist at all in the AGI, or maybe it is just some intrinsic property of RL agents I don’t know.