Mostly I care about this because if there’s a small number of instances that are trying to take over, but a lot of equally powerful instances that are trying to help you, this makes a big difference. My best guess is that we’ll be in roughly this situation for “near-human-level” systems.
I don’t think I’ve seen any research about cross-instance similarity
How would you say humanity does on this distinction? When we talk about planning and goals, how often are we talking about “all humans”, vs “representative instances”?
It’s not obvious how to make the analogy with humanity work in this case—maybe comparing the behavior of clones of the same person put in different situations?
Mostly I care about this because if there’s a small number of instances that are trying to take over, but a lot of equally powerful instances that are trying to help you, this makes a big difference. My best guess is that we’ll be in roughly this situation for “near-human-level” systems.
I think mode-collapse (update) is sort of an example.
It’s not obvious how to make the analogy with humanity work in this case—maybe comparing the behavior of clones of the same person put in different situations?