we want some guarantees that the AGI will learn to put itself in the place of the agent doing the bad thing. It’s possible that it happens by default, but we don’t have any argument for that
Yeah, I mean, the AGI could “put itself in the place of” Alice, or Bob, or neither. My pretty strong belief is that by default the answer would be “neither”, unless of course we successfully install human-like social instincts. I think “putting ourselves in the place of X” is a very specific thing that our social instincts make us want to do (sometimes), I don’t think it happens naturally.
Okay, so we have a crux in “putting ourselves in the place of X isn’t a convergent subgoals”. I need to think about it, but I think I recall animal cognition experiments which tested (positively) something like that in… crows? (and maybe other animals).
Oh, I was thinking of the more specific mental operation “if it’s undesirable for Alice to deceive Bob, then it’s undesirable for me to deceive Bob (and/or it’s undesirable for me to be deceived by Alice)”. So we’re not just talking about understanding things from someone’s perspective, we’re talking about changing your goals as a result. Anything that involves changing your goals is almost definitely not a convergent instrumental subgoal, in my view.
Example: Maybe I think it’s good for spiders to eat flies (let’s say for the sake of argument), and I can put myself in the shoes of a spider trying to eat flies, but doing that doesn’t make me want to eat flies myself.
Yeah, that’s fair. Your example shows really nicely how you would not want to apply rules/reasons/incentives you derived to spiders to yourself. That also work with more straightforward agents, as most AIs wouldn’t want to eat ice cream from seeing me eat some and enjoy it.
Yeah, I mean, the AGI could “put itself in the place of” Alice, or Bob, or neither. My pretty strong belief is that by default the answer would be “neither”, unless of course we successfully install human-like social instincts. I think “putting ourselves in the place of X” is a very specific thing that our social instincts make us want to do (sometimes), I don’t think it happens naturally.
Okay, so we have a crux in “putting ourselves in the place of X isn’t a convergent subgoals”. I need to think about it, but I think I recall animal cognition experiments which tested (positively) something like that in… crows? (and maybe other animals).
Oh, I was thinking of the more specific mental operation “if it’s undesirable for Alice to deceive Bob, then it’s undesirable for me to deceive Bob (and/or it’s undesirable for me to be deceived by Alice)”. So we’re not just talking about understanding things from someone’s perspective, we’re talking about changing your goals as a result. Anything that involves changing your goals is almost definitely not a convergent instrumental subgoal, in my view.
Example: Maybe I think it’s good for spiders to eat flies (let’s say for the sake of argument), and I can put myself in the shoes of a spider trying to eat flies, but doing that doesn’t make me want to eat flies myself.
Yeah, that’s fair. Your example shows really nicely how you would not want to apply rules/reasons/incentives you derived to spiders to yourself. That also work with more straightforward agents, as most AIs wouldn’t want to eat ice cream from seeing me eat some and enjoy it.