I think AGIs which are copies of each other—even AGIs which are built using the same training method—are likely to coordinate very well with each other even if they are not given information about each other’s existence. Basically, they’ll act like one agent, as far as deception and treacherous turns and decisive strategic advantage are concerned.
EDIT: Also, I suspect this coordination might extend further, to AGIs with different architectures also. Thus even the third-tier $10K AGIs might effectively act as co-conspirators with the latest model, and/or vice versa.
Also, I suspect this coordination might extend further, to AGIs with different architectures also.
Why would you suppose that? The design space of AI is incredibly large and humans are clear counter-examples, so the question one ought to ask is: Is there any fundamental reason an AGI that refuses to coordinate will inevitably fall off the AI risk landscape?
I agree that coordination between mutually aligned AIs is plausible.
I think such coordination is less likely in our example because we can probably anticipate and avoid it for human-level AGI.
I also think there are strong commercial incentives to avoid building mutually aligned AGIs. You can’t sell (access to) a system if there is no reason to believe the system will help your customer. Rather, I expect systems to be fine-tuned for each task, as in the current paradigm. (The systems may successfully resist fine-tuning once they become sufficiently advanced.)
I’ll also add that two copies of the same system are not necessarily mutually aligned. See for example debate and other self-play algorithms.
I agree about the strong commercial incentives, but I don’t think we will be in a context where people will follow their incentives. After all, there are incredibly strong incentives not to make AGI at all until you can be very confident it is perfectly safe—strong enough that it’s probably not a good idea to pursue AI research at all until AI safety research is much more well-established than it is today—and yet here we are.
Basically, people won’t recognize their incentives, because people won’t realize how much danger they are in.
Hmm, in my model most of the x-risk is gone if there is no incentive to deploy. But I expect actors will deploy systems because their system is aligned with a proxy. At least this leads to short-term gains. Maybe the crux is that you expect these actors to suffer a large private harm (death) and I expect a small private harm (for each system, a marginal distributed harm to all of society)?
I think AGIs which are copies of each other—even AGIs which are built using the same training method—are likely to coordinate very well with each other even if they are not given information about each other’s existence. Basically, they’ll act like one agent, as far as deception and treacherous turns and decisive strategic advantage are concerned.
EDIT: Also, I suspect this coordination might extend further, to AGIs with different architectures also. Thus even the third-tier $10K AGIs might effectively act as co-conspirators with the latest model, and/or vice versa.
Why would you suppose that? The design space of AI is incredibly large and humans are clear counter-examples, so the question one ought to ask is: Is there any fundamental reason an AGI that refuses to coordinate will inevitably fall off the AI risk landscape?
I agree that coordination between mutually aligned AIs is plausible.
I think such coordination is less likely in our example because we can probably anticipate and avoid it for human-level AGI.
I also think there are strong commercial incentives to avoid building mutually aligned AGIs. You can’t sell (access to) a system if there is no reason to believe the system will help your customer. Rather, I expect systems to be fine-tuned for each task, as in the current paradigm. (The systems may successfully resist fine-tuning once they become sufficiently advanced.)
I’ll also add that two copies of the same system are not necessarily mutually aligned. See for example debate and other self-play algorithms.
I agree about the strong commercial incentives, but I don’t think we will be in a context where people will follow their incentives. After all, there are incredibly strong incentives not to make AGI at all until you can be very confident it is perfectly safe—strong enough that it’s probably not a good idea to pursue AI research at all until AI safety research is much more well-established than it is today—and yet here we are.
Basically, people won’t recognize their incentives, because people won’t realize how much danger they are in.
Hmm, in my model most of the x-risk is gone if there is no incentive to deploy. But I expect actors will deploy systems because their system is aligned with a proxy. At least this leads to short-term gains. Maybe the crux is that you expect these actors to suffer a large private harm (death) and I expect a small private harm (for each system, a marginal distributed harm to all of society)?
It makes no difference if the marginal distributed harm to all of society is so overwhelmingly large that your share of it is still death.
I’m using the colloquial meaning of ‘marginal’ = ‘not large’.