Oh, I was thinking of the more specific mental operation “if it’s undesirable for Alice to deceive Bob, then it’s undesirable for me to deceive Bob (and/or it’s undesirable for me to be deceived by Alice)”. So we’re not just talking about understanding things from someone’s perspective, we’re talking about changing your goals as a result. Anything that involves changing your goals is almost definitely not a convergent instrumental subgoal, in my view.
Example: Maybe I think it’s good for spiders to eat flies (let’s say for the sake of argument), and I can put myself in the shoes of a spider trying to eat flies, but doing that doesn’t make me want to eat flies myself.
Yeah, that’s fair. Your example shows really nicely how you would not want to apply rules/reasons/incentives you derived to spiders to yourself. That also work with more straightforward agents, as most AIs wouldn’t want to eat ice cream from seeing me eat some and enjoy it.
Oh, I was thinking of the more specific mental operation “if it’s undesirable for Alice to deceive Bob, then it’s undesirable for me to deceive Bob (and/or it’s undesirable for me to be deceived by Alice)”. So we’re not just talking about understanding things from someone’s perspective, we’re talking about changing your goals as a result. Anything that involves changing your goals is almost definitely not a convergent instrumental subgoal, in my view.
Example: Maybe I think it’s good for spiders to eat flies (let’s say for the sake of argument), and I can put myself in the shoes of a spider trying to eat flies, but doing that doesn’t make me want to eat flies myself.
Yeah, that’s fair. Your example shows really nicely how you would not want to apply rules/reasons/incentives you derived to spiders to yourself. That also work with more straightforward agents, as most AIs wouldn’t want to eat ice cream from seeing me eat some and enjoy it.