To the extent Steve is right that “[understanding] the algorithms in the human brain that give rise to social instincts and [putting] some modified version of those algorithms into our AGIs” is a worthwhile safety proposal, I think we should be focusing our attention on instantiating the relevant algorithms that underlie affective and cognitive ToM + affective empathy.
It seems to me like you would very likely get both cognitive and affective theory of mind “for free” in the sense that they’re necessary things to understand for predicting humans well. If we expect to gain something from studying how humans implement these processes, it’d have to be something like ensuring that our AIs understand them “in the same way that humans do,” which e.g. might help our AIs generalize in a similar way to humans.
This is notably in contrast to affective empathy, though, which is not something that’s inherently necessary for predictive accuracy—so figuring out how/why humans do that has a more concrete story for how that could be helpful.
If a system can’t do online learning at all, it is unclear how it would end up with Jim-like preferences about its own preferences—presumably, while bitterness aversion is hardcoded into the reward function calculator at “deployment,” his preference to keep a healthy diet is not. So, if this latter preference is to emerge at some point, there has to be some mechanism for incorporating it into the value function in an online manner (condition 1, above).
Why couldn’t the preference for a healthy diet emerge during training? I don’t understand why you think online learning is necessary here. It feels like, rather than “online learning” being the important thing here, what you’re really relying on is just “learning.”
For my part, I strongly agree with the first part, and I said something similar in my comment.
For the second part, if we’re talking about within-lifetime brain learning / thinking, we’re talking about online-learning. For example, if I’m having a conversation with someone, and they tell me their name is Fred, and then 2 minutes later I say “Well Fred, this has been a lovely conversation”, I can thank online-learning for my remembering their name. Another example: the math student trying to solve a homework problem (and learning from the experience) is using the same basic algorithms as the math professor trying to prove a new theorem—even if the first is vaguely analogous to “training” and the second to “deployment”.
So then you can say: “Well fine, but online learning is pretty unfashionable in ML today. Can we talk about what the brain’s within-lifetime learning algorithms would look like without online learning?” And I would say: “Ummmm, I don’t know. I’m not sure that’s a coherent or useful thing to talk about. A brain without online-learning would look like unusually severe retrograde [oops I meant anterograde] amnesia.”
That’s not a criticism of what you said. Just a warning that “non-online-learning versions of brain algorithms” is maybe an incoherent notion that we shouldn’t think too hard about. :)
If we expect to gain something from studying how humans implement these processes, it’d have to be something like ensuring that our AIs understand them “in the same way that humans do,” which e.g. might help our AIs generalize in a similar way to humans.
I take your point that there is probably nothing special about the specific way(s) that humans get good at predicting other humans. I do think that “help[ing] our AIs generalize in a similar way to humans” might be important for safety (e.g., we probably don’t want an AGI that figures out its programmers way faster/more deeply than they can figure it out). I also think it’s the case that we don’t currently have a learning algorithm that can predict humans as well as humans can predict humans. (Someattempts, but not there yet.) So to the degree that current approaches are lacking, it makes sense to me to draw some inspiration from the brain-based algorithms that already implement these processes extremely well—i.e., to first understand these algorithms, and to later develop training goals in accordance with the heuristics/architecture these algorithms seem to instantiate.
This is notably in contrast to affective empathy, though, which is not something that’s inherently necessary for predictive accuracy—so figuring out how/why humans do that has a more concrete story for how that could be helpful.
Agreed! I think it’s worth noting that if you take seriously the ‘hierarchical IRL’ model I proposed in the ToM section, understanding the algorithm(s) underlying affective empathy might actually require understanding cognitive and affective ToM (i.e., if these are the substrate of affective empathy, we’ll probably need a good model of them before we can have a good model of affective empathy).
And wrt learning vs. online learning, I think I’m largely in agreement with Steve’s reply. I would also add that this might end up just being a terminological dispute depending on how flexible we are with calling particular phases “training” vs. “deployment.” E.g., is a brain “deployed” when the person’s genetic make-up as a zygote is determined? Or is it when they’re born? When their brain stops developing? When they learn the last thing they’ll ever learn? To the degree we think these questions are awkward/their answers are arbitrary, I would think this counts as evidence that the notion of “online learning” is useful to invoke here/gives us more parsimonious answers.
It seems to me like you would very likely get both cognitive and affective theory of mind “for free” in the sense that they’re necessary things to understand for predicting humans well. If we expect to gain something from studying how humans implement these processes, it’d have to be something like ensuring that our AIs understand them “in the same way that humans do,” which e.g. might help our AIs generalize in a similar way to humans.
This is notably in contrast to affective empathy, though, which is not something that’s inherently necessary for predictive accuracy—so figuring out how/why humans do that has a more concrete story for how that could be helpful.
Why couldn’t the preference for a healthy diet emerge during training? I don’t understand why you think online learning is necessary here. It feels like, rather than “online learning” being the important thing here, what you’re really relying on is just “learning.”
For my part, I strongly agree with the first part, and I said something similar in my comment.
For the second part, if we’re talking about within-lifetime brain learning / thinking, we’re talking about online-learning. For example, if I’m having a conversation with someone, and they tell me their name is Fred, and then 2 minutes later I say “Well Fred, this has been a lovely conversation”, I can thank online-learning for my remembering their name. Another example: the math student trying to solve a homework problem (and learning from the experience) is using the same basic algorithms as the math professor trying to prove a new theorem—even if the first is vaguely analogous to “training” and the second to “deployment”.
So then you can say: “Well fine, but online learning is pretty unfashionable in ML today. Can we talk about what the brain’s within-lifetime learning algorithms would look like without online learning?” And I would say: “Ummmm, I don’t know. I’m not sure that’s a coherent or useful thing to talk about. A brain without online-learning would look like unusually severe
retrograde[oops I meant anterograde] amnesia.”That’s not a criticism of what you said. Just a warning that “non-online-learning versions of brain algorithms” is maybe an incoherent notion that we shouldn’t think too hard about. :)
I take your point that there is probably nothing special about the specific way(s) that humans get good at predicting other humans. I do think that “help[ing] our AIs generalize in a similar way to humans” might be important for safety (e.g., we probably don’t want an AGI that figures out its programmers way faster/more deeply than they can figure it out). I also think it’s the case that we don’t currently have a learning algorithm that can predict humans as well as humans can predict humans. (Some attempts, but not there yet.) So to the degree that current approaches are lacking, it makes sense to me to draw some inspiration from the brain-based algorithms that already implement these processes extremely well—i.e., to first understand these algorithms, and to later develop training goals in accordance with the heuristics/architecture these algorithms seem to instantiate.
Agreed! I think it’s worth noting that if you take seriously the ‘hierarchical IRL’ model I proposed in the ToM section, understanding the algorithm(s) underlying affective empathy might actually require understanding cognitive and affective ToM (i.e., if these are the substrate of affective empathy, we’ll probably need a good model of them before we can have a good model of affective empathy).
And wrt learning vs. online learning, I think I’m largely in agreement with Steve’s reply. I would also add that this might end up just being a terminological dispute depending on how flexible we are with calling particular phases “training” vs. “deployment.” E.g., is a brain “deployed” when the person’s genetic make-up as a zygote is determined? Or is it when they’re born? When their brain stops developing? When they learn the last thing they’ll ever learn? To the degree we think these questions are awkward/their answers are arbitrary, I would think this counts as evidence that the notion of “online learning” is useful to invoke here/gives us more parsimonious answers.