Rephrasing it, you mean that we want some guarantees that the AGI will learn to put itself in the place of the agent doing the bad thing. It’s possible that it happens by default, but we don’t have any argument for that, so let’s try solving the problem by transforming its knowledge into 1st person knowledge.
Is that right?
Another issue is that we may in fact want the AI to apply different standards to itself versus humans, like it’s very bad for the AGI to be deceptive but we want the AGI to literally not care about people being deceptive to each other, and in particular we want the AGI to not try to intervene when it sees one bystander being deceptive to another bystander.
we want some guarantees that the AGI will learn to put itself in the place of the agent doing the bad thing. It’s possible that it happens by default, but we don’t have any argument for that
Yeah, I mean, the AGI could “put itself in the place of” Alice, or Bob, or neither. My pretty strong belief is that by default the answer would be “neither”, unless of course we successfully install human-like social instincts. I think “putting ourselves in the place of X” is a very specific thing that our social instincts make us want to do (sometimes), I don’t think it happens naturally.
Okay, so we have a crux in “putting ourselves in the place of X isn’t a convergent subgoals”. I need to think about it, but I think I recall animal cognition experiments which tested (positively) something like that in… crows? (and maybe other animals).
Oh, I was thinking of the more specific mental operation “if it’s undesirable for Alice to deceive Bob, then it’s undesirable for me to deceive Bob (and/or it’s undesirable for me to be deceived by Alice)”. So we’re not just talking about understanding things from someone’s perspective, we’re talking about changing your goals as a result. Anything that involves changing your goals is almost definitely not a convergent instrumental subgoal, in my view.
Example: Maybe I think it’s good for spiders to eat flies (let’s say for the sake of argument), and I can put myself in the shoes of a spider trying to eat flies, but doing that doesn’t make me want to eat flies myself.
Yeah, that’s fair. Your example shows really nicely how you would not want to apply rules/reasons/incentives you derived to spiders to yourself. That also work with more straightforward agents, as most AIs wouldn’t want to eat ice cream from seeing me eat some and enjoy it.
Rephrasing it, you mean that we want some guarantees that the AGI will learn to put itself in the place of the agent doing the bad thing. It’s possible that it happens by default, but we don’t have any argument for that, so let’s try solving the problem by transforming its knowledge into 1st person knowledge.
Is that right?
Fair enough, I hadn’t thought about that.
Yeah, I mean, the AGI could “put itself in the place of” Alice, or Bob, or neither. My pretty strong belief is that by default the answer would be “neither”, unless of course we successfully install human-like social instincts. I think “putting ourselves in the place of X” is a very specific thing that our social instincts make us want to do (sometimes), I don’t think it happens naturally.
Okay, so we have a crux in “putting ourselves in the place of X isn’t a convergent subgoals”. I need to think about it, but I think I recall animal cognition experiments which tested (positively) something like that in… crows? (and maybe other animals).
Oh, I was thinking of the more specific mental operation “if it’s undesirable for Alice to deceive Bob, then it’s undesirable for me to deceive Bob (and/or it’s undesirable for me to be deceived by Alice)”. So we’re not just talking about understanding things from someone’s perspective, we’re talking about changing your goals as a result. Anything that involves changing your goals is almost definitely not a convergent instrumental subgoal, in my view.
Example: Maybe I think it’s good for spiders to eat flies (let’s say for the sake of argument), and I can put myself in the shoes of a spider trying to eat flies, but doing that doesn’t make me want to eat flies myself.
Yeah, that’s fair. Your example shows really nicely how you would not want to apply rules/reasons/incentives you derived to spiders to yourself. That also work with more straightforward agents, as most AIs wouldn’t want to eat ice cream from seeing me eat some and enjoy it.