Okay, so we have a crux in “putting ourselves in the place of X isn’t a convergent subgoals”. I need to think about it, but I think I recall animal cognition experiments which tested (positively) something like that in… crows? (and maybe other animals).
Oh, I was thinking of the more specific mental operation “if it’s undesirable for Alice to deceive Bob, then it’s undesirable for me to deceive Bob (and/or it’s undesirable for me to be deceived by Alice)”. So we’re not just talking about understanding things from someone’s perspective, we’re talking about changing your goals as a result. Anything that involves changing your goals is almost definitely not a convergent instrumental subgoal, in my view.
Example: Maybe I think it’s good for spiders to eat flies (let’s say for the sake of argument), and I can put myself in the shoes of a spider trying to eat flies, but doing that doesn’t make me want to eat flies myself.
Yeah, that’s fair. Your example shows really nicely how you would not want to apply rules/reasons/incentives you derived to spiders to yourself. That also work with more straightforward agents, as most AIs wouldn’t want to eat ice cream from seeing me eat some and enjoy it.
Okay, so we have a crux in “putting ourselves in the place of X isn’t a convergent subgoals”. I need to think about it, but I think I recall animal cognition experiments which tested (positively) something like that in… crows? (and maybe other animals).
Oh, I was thinking of the more specific mental operation “if it’s undesirable for Alice to deceive Bob, then it’s undesirable for me to deceive Bob (and/or it’s undesirable for me to be deceived by Alice)”. So we’re not just talking about understanding things from someone’s perspective, we’re talking about changing your goals as a result. Anything that involves changing your goals is almost definitely not a convergent instrumental subgoal, in my view.
Example: Maybe I think it’s good for spiders to eat flies (let’s say for the sake of argument), and I can put myself in the shoes of a spider trying to eat flies, but doing that doesn’t make me want to eat flies myself.
Yeah, that’s fair. Your example shows really nicely how you would not want to apply rules/reasons/incentives you derived to spiders to yourself. That also work with more straightforward agents, as most AIs wouldn’t want to eat ice cream from seeing me eat some and enjoy it.