I really liked Scott’s first question in the section “Analogies to human moral development” and the discussion that ensued there.
I think Eliezer’s reply at [14:21] is especiallyinteresting.If I understand it correctly, he’s saying that it was a (fortunate) coincidence about what sort of moves evolution had available and what the developmental constraints were at the time, that “build in empathy/pro-social emotions” was an easy way to make people better at earning social rewards from our environment. [And maybe a further argument here is that once we start climbing upward on the gradient towards more empathy, the strategy of “also simultaneously become better at lying and deceiving” no longer gives highest rewards, because there are tradeoffs where it’s bad to have (automatically accessible, ever-present) pro-social emotions if you go for a manipulative and exploitative life-strategy.]
By contrast, probably the next part of the argument is that we have no strong reason to expect gradient updates in ML agents to stumble upon a similarly simple attractor as “increase your propensity to experience compassion or feel others’ emotions when you’re anyway already modeling others’ behavior based on what you’d do yourself in their situation.” And is this because gradient descent updates too many things at once and there aren’t any developmental constraints that would make a simple trick like “dial up pro-social emotions” reliably more successful than alternatives that involve more deception? That seems somewhat plausible to me, but I have some lingering doubts of the form “isn’t there a sense in which honesty is strictly easier than deception (related: entangled truths, contagious lies), so ML agents might just stumble upon it if we try to reward them for socially cooperative behavior?”
What’s the argument against that? (I’m not arguing for a high probability of “alignment by default” – just against confidently estimating it at <10%.)
Somewhat related: In the context of Shard theory, I shared some speculative thoughts on developmental constraints arguably making it easier (comparative what things could be like if evolution had easier access to more of “mind-design space”) to distinguish pro-social from anti-social phenotypes among humans. Mimicking some of these conditions (if we understood AI internals well-enough to steer things) could maybe be a promising component for alignment work?
is this because gradient descent updates too many things at once and there aren’t any developmental constraints that would make a simple trick like “dial up pro-social emotions” reliably more successful than alternatives that involve more deception? That seems somewhat plausible to me, but I have some lingering doubts of the form “isn’t there a sense in which honesty is strictly easier than deception (related: entangled truths, contagious lies), so ML agents might just stumble upon it if we try to reward them for socially cooperative behavior?”
What’s the argument against that? (I’m not arguing for a high probability of “alignment by default” – just against confidently estimating it at <10%.)
Thanks for posting this!
I really liked Scott’s first question in the section “Analogies to human moral development” and the discussion that ensued there.
I think Eliezer’s reply at [14:21] is especially interesting. If I understand it correctly, he’s saying that it was a (fortunate) coincidence about what sort of moves evolution had available and what the developmental constraints were at the time, that “build in empathy/pro-social emotions” was an easy way to make people better at earning social rewards from our environment. [And maybe a further argument here is that once we start climbing upward on the gradient towards more empathy, the strategy of “also simultaneously become better at lying and deceiving” no longer gives highest rewards, because there are tradeoffs where it’s bad to have (automatically accessible, ever-present) pro-social emotions if you go for a manipulative and exploitative life-strategy.]
By contrast, probably the next part of the argument is that we have no strong reason to expect gradient updates in ML agents to stumble upon a similarly simple attractor as “increase your propensity to experience compassion or feel others’ emotions when you’re anyway already modeling others’ behavior based on what you’d do yourself in their situation.” And is this because gradient descent updates too many things at once and there aren’t any developmental constraints that would make a simple trick like “dial up pro-social emotions” reliably more successful than alternatives that involve more deception? That seems somewhat plausible to me, but I have some lingering doubts of the form “isn’t there a sense in which honesty is strictly easier than deception (related: entangled truths, contagious lies), so ML agents might just stumble upon it if we try to reward them for socially cooperative behavior?”
What’s the argument against that? (I’m not arguing for a high probability of “alignment by default” – just against confidently estimating it at <10%.)
Somewhat related: In the context of Shard theory, I shared some speculative thoughts on developmental constraints arguably making it easier (comparative what things could be like if evolution had easier access to more of “mind-design space”) to distinguish pro-social from anti-social phenotypes among humans. Mimicking some of these conditions (if we understood AI internals well-enough to steer things) could maybe be a promising component for alignment work?
+1 on this question