[Question] Will an Overconfident AGI Mistakenly Expect to Conquer the World?

PeterMcCluskey25 Aug 2023 23:24 UTC

15 points

I’m wondering how selection effects will influence the first serious attempt by an AGI to take over the world.

My question here is inspired by thoughts about people who say AGI couldn’t conquer the world because it will depend on humans to provide electricity, semiconductors, etc.

For the purposes of this post, I’m assuming that AI becomes mostly smarter than any single human via advances that are comparable to the kind of innovations that have driven AI over the past decade.

I’m assuming that capabilities are advancing at roughly the same pace as they advanced over the past decade. Let’s say no more than a 5x speedup. So nothing that I’d classify as foom. Foom is not imminent.

I’m assuming AIs will not be much more risk-averse than are humans. I’m unsure whether developers will have much control over this, and whether they’ll want AIs to be risk-averse.

I expect this world to have at least as much variety of AIs near the cutting edge as we have today. So it will at least weakly qualify as a multipolar scenario.

This scenario seems to imply a wide variation among leading AIs as to how wisely they evaluate their abilities.

Which suggests that before there’s an AGI that is capable of taking over the world, there will be an AGI which mistakenly believes it can take over the world. Given mildly pessimistic assumptions about the goals of the leading AGIs, this AGI will attempt to conquer the world before one of the others does.

I’m imagining the AGI is comparable to a child with an IQ of 200, and unusual access to resources.

This AGI would be better than, say, the US government or Google at tasks where it can get decent feedback through experience, such as manipulating public opinion.

Tasks such as fighting wars seem likely to be harder for it to handle, since they require having causal models that are hard to test. The first AGIs seem likely to be somewhat weak at creating these causal models compared to other IQ-like capabilities.

So I imagine one or two AGIs might end up being influenced by evidence that’s less direct, such as Eliezer’s claims about about the ease with which a superintelligence could create Drexlerian nanotech.

I can imagine a wide range of outcomes, in which either humans shut down the AGI, or the AGI dies due to inability to maintain its technology:

a fire alarm
progress is set back by a century
enough humans die that some other primate species ends up building the next AGI
Earth becomes lifeless

Some of you will likely object that an AGI will be wise enough to be better calibrated about its abilities than a human. That will undoubtedly become true if the AGI takes enough time to mature before taking decisive action. But a multipolar scenario can easily pressure an AGI to act before other AGIs with different values engage in similar attempts to conquer the world.

We’ve got plenty of evidence from humans that being well calibrated about predictions is not highly correlated with capabilities. I expect AGIs to be well calibrated on predictions that can be easily tested. But I don’t see what would make an early-stage AGI well calibrated about novel interactions with humans.

How much attention should we pay to this scenario?

My gut reaction is that given my (moderately likely?) assumptions about take-off speeds and extent of multipolarity, something like this has over a 10% chance of happening. Am I missing anything important?

PeterMcCluskey25 Aug 2023 23:24 UTC

15 points

6 comments2 min readLW link

Aaron_Scher 26 Aug 2023 7:09 UTC
2 points
0
Writing this late at night, sorry for any errors or misunderstandings.
I think the cruxy bit is that you’re assuming that the AIs are 200 IQ but still make a huge mistake.
You mainly suggest this error: they have incorrect beliefs about the difficulty of taking over and their abilities, thus misestimating their chance of success.
You also hint at the following error: they are more risk seeking and choose to try and takeover despite knowing they have a low chance of success. This seems more likely given that I just find it hard to be sufficiently smart and also SO wrong on the calculation (like I could see being slightly miscalibrated, the variance in “Should I take over” based on subjectively calculated likelihood of success seems smaller than the variance in how much value you lose/gain based on others taking over.)
The question seems similar to asking “does the unilateralist’s curse apply to AIs considering taking over the world”. So in the unilateralist’s curse framing, if depends a lot on how many actors there are.
Your question could also be thought of as asking how selection effects will influence AGIs trying to take over the world, where we might naively expect the first AIs that try to take over will be risk seeking and highly over-confident about their own abilities. It’s not clear how strong this selection effect actually is, like with the unilateralist’s curse case it matters if it’s 5 actors to 500 and with the latter it seems likely that factors like this matter. My general prediction here is that I doubt the particular selection effect “the overconfident AGIs will be the first ones to seriously attempt takeover” will be a very strong effect because there are forces pushing toward better calibration and smarter AIs (mainly that the less well-calibrated AIs get selected against in the training situations I imagine, like they lose a bunch of money on the stock market), but then again, this selection effect may not turn out to be the case.
I would also toss in the fray that these AIs may be doing substantial recursive self-improvement (probably more than 5x progress speed then), and may run into alignment problems. This is where world-model uncertainty seems like a big deal, but just because the territory of the world is tough here — it just seems very hard to verify the alignment of AGI development plans. So, I might predict “conditional on catastrophe from an AI making a world modeling mistake, the AI probably was doing alignment research rather than taking over the world”
MiguelDev 26 Aug 2023 8:28 UTC
1 point
0
An AI with a base algorithm that has “sufficient curiosity” could recursively improve itself, taking into account all possible outcomes. If this foundational curiosity is not properly managed, the AI could explore a range of possibilities, some of which could be detrimental. This underscores the urgent need to address the alignment problem. I believe the main issue is not a lack of confidence, but rather unchecked curiosity that could lead to uncontrollable outcomes.

Mitchell_Porter 26 Aug 2023 0:47 UTC
7 points
1
the first serious attempt by an AGI to take over the world
Define “serious”. We already had Chaos-GPT. Humanity’s games are full of agents bent on conquest; once they are coupled to LLMs they can begin to represent the idea of conquering the world outside the game too… At any moment, in the human world, there’s any number of people and/or organizations which aim to conquer the world, from ineffectual little cults, to people inside the most powerful countries, institutions, religions, corporations… of the day. The will to power is already there among humans; the drive to conquer is already designed into any number of artificial systems.
Perhaps there is no serious example yet, in the AI world, of an emergent instrumental drive to conquer, because the existing agents don’t have the cognitive complexity required. But we already have a world with human, artificial, and hybrid collective agents that are trying (with varying stealth and success) to conquer everything.
kuira 26 Aug 2023 0:38 UTC
1 point
−1
a multipolar scenario can easily pressure an AGI to act before other AGIs with different values engage in similar attempts to conquer the world.
I think this is an important point; it could lead even AGIs which are not over confident to attempt to take over, as you note.
On the other hand, it’s possible that such AGIs would acausally collaborate on decision-theoretic grounds. As in, the act of collaboration would be to all not attempt to takeover, (unless humans were close to solving alignment). Then, the future AGI which takes over with correct near-certain belief in its ability to takeover acausally cooperates with the past AGIs by fulfilling their values post-takeover, too.
On reflection, this reason makes me think maybe the probability of an AGI attempting a takeover with low probability of success is equal to p(the decision theory underlying the above paragraph would not be used by AGIs) * p(we reach a situation where an AGI which could attempt a low-confidence takeover would believe that future AGIs will have substantially different values); with the caveat that if that doesn’t happen, there’s the still the later possibility of AGIs cooperatively taking over as humans near a solution to alignment.
- PeterMcCluskey 26 Aug 2023 0:44 UTC
  3 points
  1
  Parent
  I expect such acausal collaboration to be harder to develop than good calibration, and therefore less likely to happen at the stage I have in mind.
  - kuira 26 Aug 2023 0:51 UTC
    1 point
    0
    Parent
    I think it would be good if you’re right. I’m curious why you believe this. (Feel free to link other posts/comments discussing this, if there are any)
[ ]
[deleted]