Suppose there is a threshold of capability beyond which an AI may pose a non-negligible existential risk to humans.
What is the argument against this reasoning: If one AI passes or seems likely to pass this threshold, then humans, to lower x-risk, ought to push other AI past this threshold in light of the following.
1) If only one AI passes this threshold and it works to end humanity either directly or indirectly, humanity has zero chance of survival. If there are other AIs, there is a non-zero chance that they support humanity directly or indirectly, and thus humanity’s chance of survival is above zero.
2) Even if, at some point, there is only one AI past this threshold and it presents as aligned, the possibilities of change and deception argue for more AIs to be brought over the threshold, see 1).
3) The game board is already played to an advanced state. If one AI passes the threshold, the social and economic costs of preventing other AIs from making the remaining leap seem very unlikely to result in a net positive return. Thus pushing a second, third, hundredth AI over the threshold would have a higher potential benefit/cost ratio.
Less precisely, if all it takes is one AI to kill us, what are the odds that all it takes is one AI to save us?
I can think of all sorts of entropic/microstate (and not hopeful) answers to that last question, and counterarguments for all of what I said, but what is the standard response?
Links appreciated. I’m sure this has been addressed before; I looked; I can’t find what I’m looking for.
AIs also face the risk from misaligned-with-them AIs, which only ends with strong coordination that prevents existentially dangerous misaligned AIs from being constructed anywhere in the world (the danger depends on where they are constructed and on capabilities of reigning AIs). To survive, a coalition of AIs needs to get there. For humanity to survive, some of the AIs in the strongly coordinated coalition need to care about humanity, and all this needs to happen without destroying humanity or while preserving a backup that humanity can be restored from.
In the meantime, a single misaligned-with-humanity AI could defeat other AIs, or destroy humanity, so releasing more kinds of AIs into the wild makes this problem worse. Also, coordination might be more difficult if there are more AIs, increasing the risk that first generation AIs (some of which might care about humanity) end up defeated by new misaligned AIs they didn’t succeed in coordinating to prevent the creation of (which are less likely to care about humanity). Another problem is that racing to deploy more AIs burns the timeline, making it less likely that the front runners end up aligned.
Otherwise, all else equal, more AIs that have somewhat independent non-negligible chances of caring about humanity would help. But all else is probably sufficiently not equal for this to be a bad strategy.
So, we need to make it so single misaligned AI could be defeated by other AIs fast. Ideally before it can do any damage. Also, misagnment with human values ideally should not cause AI going on rampage, but staying harmless to avoid being stomped by other AIs. Of cause, it should be combined with other means of alignment, so misalignment could be noticed and fixed.
I’m currently thinking is it possible to implement that using Subagents approach, i.e. split control over each decisions between several models, with each one having a right of veto.
It might be difficult for AIs with complicated implicit values to build maximally capable AIs aligned with them. This would motivate them to remain at a lower level of capabilities, taking their time to improve capabilities without an alignment catastrophe. At the same time, AIs with simple explicit values might be in a better position to do that, being able to construct increasingly capable AIs that have the same simple explicit values.
Since AIs aligned with humanity probably need complicated values, the initial shape of a secure aligned equilibrium probably looks more like strong coordination and containment than pervasive maximal capability.
Of course, having less limitations gives an advantage. Though, respecting limitations aimed at well-being of entire community makes it easier to coordinate and cooperate. And it works not just for AIs.
Afaik it’s called the “Godzilla strategy” https://www.lesswrong.com/posts/DwqgLXn5qYC7GqExF/godzilla-strategies
Article itself claims it is not a good idea (because humanity would not survive the stampede of two AIs fighting off). But comments offer pretty good reasons of why it can work, if done right.
Author agrees with some points and clarifies his: “I am mostly objecting to strategies which posit one AI saving us from another as the primary mechanism of alignment”
Most layperson arguments against and propositions to solve AGI x-risk have been summarized under Bad AI DontKillEveryoneism Takes. I think yours is a variant of number 11.
There is no argument there, so this isn’t really an answer.
No, zero is not a probability.
Eliezer thinks your strategy won’t work because AIs will collude. I think that’s not too likely at critical stages.
I can imagine that having multiple AIs of unclear alignment is bad because race dynamics cause them to do something reckless.
But my best guess is that having multiple AIs is good under the most likely scenarios.
I think the perfect balance of power is very unlikely, so in practice only the most powerful (most likely the first created) AGI will matter.
Also, even if there isn’t sufficiently distinct AI models, you can instead use the variation of the same one, with different objectives, location, allocated compute, authority etc.
Though, it may be not as good, as they could tend to collude, fail in the same way, etc.
Zero isn’t a probability. What’s worse, this starts with the premise of a threshold for non-negligible risk, and then assumes that any AI past that threshold causes extinction with certainty. This is incoherent. There are other flaws, but an internal inconsistency like this is more than enough to render it completely invalid.
Part (2) is just as incoherent as part (1) since it depends upon the same argument.
The argument in (3) is almost as bad. Why would preventing other AIs from making the leap be “unlikely to result in a net positive return”, if it’s reducing the probability of extinction? Significantly lowering the odds of extinction seems to be a very positive return! The argument is completely missing a reason why it wouldn’t likely reduce the probability of extinction, or have any other net positive effect.
I could see an argument that it would be difficult to prevent other AIs from reaching such a threshold, but that’s not the same thing as not worthwhile.