Right now, every powerful intelligence (e.g. nation-states) is built out of humans, so the only way for such organizations to thrive is to make sure the constituent humans thrive, for instance by ensuring food, clean air and access to accurate information.
AI is going to loosen up this default pull. If we are limited to reflex-based tool AIs like current LLMs, probably we’ll make it through just fine, but if we start doing wild adversarial searches that combine tons of the tool-like activities into something very powerful and autonomous, these can determine ~everything about the world. Unless all winners of such searches actively promote human thriving in their search instead of just getting rid of humanity or exploiting us for raw resources, we’re doomed.
There’s lots of places where we’d expect adversarial searches to be incentivized, most notably:
War/national security
Crime and law enforcement
Propaganda to sway elections
Market share acquisition for companies (not just in advertising but also in undermining competitors)
The current situation for war/national security is already super precarious due to nukes, and I tend to reason by an assumption that if a nuke is used again then that’s going to be the end of society. I don’t know whether that assumption is true, but insofar as it is reasonable, it becomes natural to think of AI weapons as Nuke 2.0.
The situation with nukes suggests that maybe sometimes we can have an indefinite holdoff on using certain methods, but again it already seems quite precarious here, and it’s unclear how to generalize this to other case. For instance, outlawing propaganda would seem to interfere with free speech, and enforcing such laws without tyrannizing people who are just confused seems subtle.
So a plausible model seems to me to be, people are gradually developing ways of integrating computers with the physical world, by giving them deeper knowledge of how the world works and more effective routines for handling small tasks. Right now, this interface is very brittle and so breaks down when even slight pressure is applied to it, but as it gets more and more robust and well-understood, it becomes more and more feasible to run searches over it to find more powerful activities. In non-adversarial circumstances, such searches don’t have to ensure robustness or completeness and thus can just “do the thing” you’re asking them to, but in adversarial circumstances, the adversaries will exploit your weakness and so you actually have to do it similar to a dangerous utility-maximizer.
Right now, every powerful intelligence (e.g. nation-states) is built out of humans, so the only way for such organizations to thrive is to make sure the constituent humans thrive, for instance by ensuring food, clean air and access to accurate information.
I’m rejecting the notion that humans thrive because humans individually value thriving. Like, some humans do so, to some extent, but there’s quite a few mentally ill people who act in self-destructive ways. Historically, sometimes you end up with entire nations bound up in self-destructive vindictiveness (Gaza being a modern example). The ultimate backstop that keeps this limited is the fact that those who prioritize their own people’s thriving as the #1 priority are much better at birthing the next generation and winning wars.
Of course this backstop generates intermediate moralities that do make humans more directly value their own thriving, so there’s still some space for intentional human choice.
Kind of, though we already have mass production for some things, and it hasn’t lead to the end of the humanity, partly because someone has to maintain and program those robots. But obesity rates have definitely skyrocketed, presumably partly because of our very cheap and somewhat capable robots.
I realize I should probably add a 3rd category of argument: arguments which assume a specific (unlikely) path for AGI development and then argue this particular path is bad.
This is an improvement over “bad” arguments (in the sense that it’s at least a logical sequence of argumentation rather than a list of claims), but unlikely to move the needle for me, since the specific sequence involved is unlikely to be true.
Ideally, what one would like to do is “average over all possible paths for AGI development”. But I don’t know of a better way to do that average than to just use an expert-survey/prediction market.
Let’s talk in detail about why this particular path is improbable, by trying to write it as a sequence of logical steps:
“Right now, every powerful intelligence (e.g. nation-states) is built out of humans, so the only way for such organizations to thrive is to make sure the constituent humans thrive”
this is empirically false. genocide and slavery have been the norm across human history. We are currently in the process of modifying our atmosphere in a way that is deadly to humans and almost did so recently in the past
“AI is going to loosen up this default pull.”
this assumes a specific model for AI: humans use the AI to do highly adversarial search and then blindly implement the results. Suppose instead humans only implement the results after verifying them, or require the AI to provide a mathematical proof that “this action won’t kill all humans”
“There’s lots of places where we’d expect adversarial searches to be incentivized”
none of these are unique to AGI. We have the same problem with nuclear weapons, biological weapons and any number of other technologies. AGI is uniquely friendly in the sense that at first it’s merely software: it has no impact on the real world unless we choose to let it
“The current situation for war/national security is already super precarious due to nukes, and I tend to reason by an assumption that if a nuke is used again then that’s going to be the end of society. ”
How is this an argument for AGI risk?
“and it’s unclear how to generalize this to other case. For instance, outlawing propaganda would seem to interfere with free speech”
Something being unclear is not an argument for doom. At best it’s a restatement of my original weak argument: AGI will be powerful, therefore it might be bad
“So a plausible model seems to me to be, people are gradually developing ways of integrating computers with the physical world, by giving them deeper knowledge of how the world works and more effective routines for handling small tasks. ”
even if this is a plausible model, it is by no means the only model or the default path.
“but as it gets more and more robust and well-understood, it becomes more and more feasible to run searches over it to find more powerful activities.”
it is equally plausible (in my opinion more so) that there is a limit to how far ahead intelligence can predict and science is fundamentally rate-limited by the speed of physical experimentation
“thus can just “do the thing” you’re asking them to, but in adversarial circumstances, the adversaries will exploit your weakness ”
why are we assuming the adversaries will exploit your weakness? Why not assume we build corrigible AI that tries to help you instead.
“similar to a dangerous utility-maximizer.”
A utility-maximizer is a specific design of AGI, and moreover totally different from the next-token-prediction AIs that currently exists. Why should I assume that this particular design will suddenly become popular (despite the clear disadvantages that you have already stated)?
this is empirically false. genocide and slavery have been the norm across human history.
You need to not mix up conflicts between different human groups with the inability for humans to thrive. The fact that there has been a human history at all requires people to have the orientation to know what’s going on, the capacity to act on it, and the care to do so. Humanity hasn’t just given up or committed suicide, leaving just a nonhuman world.
Now it’s true that generally, there was a self-centered thriving that favored the well-being of oneself and one’s friends and family over others, and this would lead to various sorts of conflicts, often wrecking a lot of good people. We can only hope society becomes more discriminatory over time, to better nurture the goodness and only destroy the badness. But you can only say that genocide was bad because there was something that created good people who it was wrong to kill.
We are currently in the process of modifying our atmosphere in a way that is deadly to humans and almost did so recently in the past
But critically, various historical environmental problems had lead to the creation of environmentalist groups, which enabled society to notice these atmospheric problems. Contrast this with prior environmental changes that there was no protection against.
> AI is going to loosen up this default pull.
this assumes a specific model for AI: humans use the AI to do highly adversarial search and then blindly implement the results.
You are misunderstanding. By “loosen up this default pull”, I mean, let’s say you implement a bot to handle food production, from farm to table. Right now, food production needs to be human-legible because it involves a collaborative effort between lots of people. With the bot, even if it handles food production perfectly fine, you’ve now removed the force that generates human legibility for food production.
As you remove human involvement from more and more places, humans become able to do fewer and fewer things. Maybe humans can still thrive under such circumstances, but surely you can see that strong people have a by-default better chance than weak people do? Notably, this is a separate part of the argument from adversarial search, and it applies even if we limit ourselves to reflex-like methods. The point here is to highlight what currently allows humans to thrive, and how that gets weakened by AI.
Suppose instead humans only implement the results after verifying them, or require the AI to provide a mathematical proof that “this action won’t kill all humans”
none of these are unique to AGI. We have the same problem with nuclear weapons, biological weapons and any number of other technologies. AGI is uniquely friendly in the sense that at first it’s merely software: it has no impact on the real world unless we choose to let it
If you wait until humans have manually checked them all through, then you incentivize adversaries to develop military techniques that can destroy your country faster than you can wake up your interpretability researchers. (I expect this to be possible with only weak, reflex-based AI, like if you set up a whole bunch of automated bots to wreck havoc in various ways once triggered.)
How is this an argument for AGI risk?
It’s not, it’s registering my assumption in case you want to object to it. If you think nukes might be used in a more limited way, then maybe you also think adversarial searches might be used in a more limited way.
Something being unclear is not an argument for doom. At best it’s a restatement of my original weak argument: AGI will be powerful, therefore it might be bad
Registering something being unclear is helpful for where to take it. Like if we agreed on the overall picture, but you were more optimistic about the areas that were unclear, and I was more pessimistic about them, then I could continue the argument into those areas as well. Like I’m sort of trying to comprehensively enumerate all the relevant dynamics for how this is gonna develop, and explicitly mark off the places that are relevant to consider but which I haven’t properly addressed.
Right now, though, you seem to be assuming that humans by-default thrive, and only exogenous dangers like war or oppression can prevent this. Meanwhile, I’m more using a sort of “inertial” model, where certain neuroses can drive humans to spontaneously self-destruct, sometimes taking a lot of their neighbors with them. As such it seems less relevant to explore these subtrees until the issue of self-destructive neuroses are addressed.
even if this is a plausible model, it is by no means the only model or the default path.
Looks like the default path to me? Like AI companies are dumping lots of knowledge and skills into LLMs, for instance, and at my job we’ve started integrating them with our product. Are there any other relevant dynamics you are seeing?
it is equally plausible (in my opinion more so) that there is a limit to how far ahead intelligence can predict and science is fundamentally rate-limited by the speed of physical experimentation
You need physical experimentation to test how well your methods for unleashing energy/flow into a particular direction works, so building reflex-like/tool AIs is going to be fundamentally rate-limited by the speed of physical experimentation.
However, as you build up a library of tricks to interact with the world, you can use compute to search through ways to combine these tricks to make bigger things happen. This is generally bounded by whatever the biggest “energy source” you can tap into is, because it is really hard to bring multiple different “energy sources” together into some shared direction.
why are we assuming the adversaries will exploit your weakness? Why not assume we build corrigible AI that tries to help you instead.
We’ll build corrigible AI that tries to help us with ordinary stuff like transporting food from farms to homes.
However, the more low-impact it is, the more exploitable it is. If you want food from a self-driving truck, maybe you could just stand in front of it, and it will stop, and then some of your friends can break in to it and steal the food it is carrying.
To prevent this, we need to incapacitate criminals. But criminals don’t want to be incapacitated, so they will exploit whatever weaknesses the system for incapacitating them has. As part of this, the more advanced criminals will presumably build AIs to try to seek out weaknesses in the system. That’s what I’m referring to with adversaries exploiting your weakness.
A utility-maximizer is a specific design of AGI, and moreover totally different from the next-token-prediction AIs that currently exists. Why should I assume that this particular design will suddenly become popular (despite the clear disadvantages that you have already stated)?
Being robust to exploitation from adversaries massively restricts your options. Whether the exact implementation includes an explicit utility function or not is less relevant than the fact that as it spontaneously adapts to undermine its adversaries, it needs to do so in a way that doesn’t undermine humanity in general. I.e. you need to build some system that can unleash massive destruction towards sufficiently unruly enemies, without unleashing massive destruction towards friends. I think the classic utility maximizer instrumental convergence risk gives a pretty accurate picture for how that will look / how that gives you dangers, but if you think next-token-predictors can unleash destruction in a more controlled way, I’m all ears.
I realize I should probably add a 3rd category of argument: arguments which assume a specific (unlikely) path for AGI development and then argue this particular path is bad.
Any path for history needs to account for security and resource flow/allocation. These are the most important part of everything. My position doesn’t really assume that much beyond this.
The fact is that there are certain robust resources (like sunlight etc.) which exert constant pressure on the world, and which everything is dependent on. Whatever happens, these resources must go somewhere, so any forecast for the future that’s worth its salt must ultimately make predictions about those.
Each part of my argument addresses a different factor involved in these resource flows. Often you can just inspect the world and see that clearly that’s how the resources are flowing. Other times, my argument is disjunctive. Yet other times, sure maybe I’m wrong, but the way I might be wrong would imply the possibility of a lot of resources rushing out into some other channel, which again is worth exploring.
Plus, let’s remember, Strong Evidence Is Common. If there’s some particular parts of the argument where you don’t know how to inspect the world to get plenty of evidence, then I can try to guide you. But blinding yourself because of “muh evidence” is just makes your opinion worthless.
Ideally, what one would like to do is “average over all possible paths for AGI development”. But I don’t know of a better way to do that average than to just use an expert-survey/prediction market.
How about, whatever method the experts or the prediction market participants are using, but done better?
Right now, every powerful intelligence (e.g. nation-states) is built out of humans, so the only way for such organizations to thrive is to make sure the constituent humans thrive, for instance by ensuring food, clean air and access to accurate information.
AI is going to loosen up this default pull. If we are limited to reflex-based tool AIs like current LLMs, probably we’ll make it through just fine, but if we start doing wild adversarial searches that combine tons of the tool-like activities into something very powerful and autonomous, these can determine ~everything about the world. Unless all winners of such searches actively promote human thriving in their search instead of just getting rid of humanity or exploiting us for raw resources, we’re doomed.
There’s lots of places where we’d expect adversarial searches to be incentivized, most notably:
War/national security
Crime and law enforcement
Propaganda to sway elections
Market share acquisition for companies (not just in advertising but also in undermining competitors)
The current situation for war/national security is already super precarious due to nukes, and I tend to reason by an assumption that if a nuke is used again then that’s going to be the end of society. I don’t know whether that assumption is true, but insofar as it is reasonable, it becomes natural to think of AI weapons as Nuke 2.0.
The situation with nukes suggests that maybe sometimes we can have an indefinite holdoff on using certain methods, but again it already seems quite precarious here, and it’s unclear how to generalize this to other case. For instance, outlawing propaganda would seem to interfere with free speech, and enforcing such laws without tyrannizing people who are just confused seems subtle.
So a plausible model seems to me to be, people are gradually developing ways of integrating computers with the physical world, by giving them deeper knowledge of how the world works and more effective routines for handling small tasks. Right now, this interface is very brittle and so breaks down when even slight pressure is applied to it, but as it gets more and more robust and well-understood, it becomes more and more feasible to run searches over it to find more powerful activities. In non-adversarial circumstances, such searches don’t have to ensure robustness or completeness and thus can just “do the thing” you’re asking them to, but in adversarial circumstances, the adversaries will exploit your weakness and so you actually have to do it similar to a dangerous utility-maximizer.
I guess I should add:
I’m rejecting the notion that humans thrive because humans individually value thriving. Like, some humans do so, to some extent, but there’s quite a few mentally ill people who act in self-destructive ways. Historically, sometimes you end up with entire nations bound up in self-destructive vindictiveness (Gaza being a modern example). The ultimate backstop that keeps this limited is the fact that those who prioritize their own people’s thriving as the #1 priority are much better at birthing the next generation and winning wars.
Of course this backstop generates intermediate moralities that do make humans more directly value their own thriving, so there’s still some space for intentional human choice.
This suggests that the threat model isn’t so much “very intelligent AI” as it is “very cheap and at least somewhat capable robots”.
Kind of, though we already have mass production for some things, and it hasn’t lead to the end of the humanity, partly because someone has to maintain and program those robots. But obesity rates have definitely skyrocketed, presumably partly because of our very cheap and somewhat capable robots.
I realize I should probably add a 3rd category of argument: arguments which assume a specific (unlikely) path for AGI development and then argue this particular path is bad.
This is an improvement over “bad” arguments (in the sense that it’s at least a logical sequence of argumentation rather than a list of claims), but unlikely to move the needle for me, since the specific sequence involved is unlikely to be true.
Ideally, what one would like to do is “average over all possible paths for AGI development”. But I don’t know of a better way to do that average than to just use an expert-survey/prediction market.
Let’s talk in detail about why this particular path is improbable, by trying to write it as a sequence of logical steps:
“Right now, every powerful intelligence (e.g. nation-states) is built out of humans, so the only way for such organizations to thrive is to make sure the constituent humans thrive”
this is empirically false. genocide and slavery have been the norm across human history. We are currently in the process of modifying our atmosphere in a way that is deadly to humans and almost did so recently in the past
“AI is going to loosen up this default pull.”
this assumes a specific model for AI: humans use the AI to do highly adversarial search and then blindly implement the results. Suppose instead humans only implement the results after verifying them, or require the AI to provide a mathematical proof that “this action won’t kill all humans”
“There’s lots of places where we’d expect adversarial searches to be incentivized”
none of these are unique to AGI. We have the same problem with nuclear weapons, biological weapons and any number of other technologies. AGI is uniquely friendly in the sense that at first it’s merely software: it has no impact on the real world unless we choose to let it
“The current situation for war/national security is already super precarious due to nukes, and I tend to reason by an assumption that if a nuke is used again then that’s going to be the end of society. ”
How is this an argument for AGI risk?
“and it’s unclear how to generalize this to other case. For instance, outlawing propaganda would seem to interfere with free speech”
Something being unclear is not an argument for doom. At best it’s a restatement of my original weak argument: AGI will be powerful, therefore it might be bad
“So a plausible model seems to me to be, people are gradually developing ways of integrating computers with the physical world, by giving them deeper knowledge of how the world works and more effective routines for handling small tasks. ”
even if this is a plausible model, it is by no means the only model or the default path.
“but as it gets more and more robust and well-understood, it becomes more and more feasible to run searches over it to find more powerful activities.”
it is equally plausible (in my opinion more so) that there is a limit to how far ahead intelligence can predict and science is fundamentally rate-limited by the speed of physical experimentation
“thus can just “do the thing” you’re asking them to, but in adversarial circumstances, the adversaries will exploit your weakness ”
why are we assuming the adversaries will exploit your weakness? Why not assume we build corrigible AI that tries to help you instead.
“similar to a dangerous utility-maximizer.”
A utility-maximizer is a specific design of AGI, and moreover totally different from the next-token-prediction AIs that currently exists. Why should I assume that this particular design will suddenly become popular (despite the clear disadvantages that you have already stated)?
You need to not mix up conflicts between different human groups with the inability for humans to thrive. The fact that there has been a human history at all requires people to have the orientation to know what’s going on, the capacity to act on it, and the care to do so. Humanity hasn’t just given up or committed suicide, leaving just a nonhuman world.
Now it’s true that generally, there was a self-centered thriving that favored the well-being of oneself and one’s friends and family over others, and this would lead to various sorts of conflicts, often wrecking a lot of good people. We can only hope society becomes more discriminatory over time, to better nurture the goodness and only destroy the badness. But you can only say that genocide was bad because there was something that created good people who it was wrong to kill.
But critically, various historical environmental problems had lead to the creation of environmentalist groups, which enabled society to notice these atmospheric problems. Contrast this with prior environmental changes that there was no protection against.
You are misunderstanding. By “loosen up this default pull”, I mean, let’s say you implement a bot to handle food production, from farm to table. Right now, food production needs to be human-legible because it involves a collaborative effort between lots of people. With the bot, even if it handles food production perfectly fine, you’ve now removed the force that generates human legibility for food production.
As you remove human involvement from more and more places, humans become able to do fewer and fewer things. Maybe humans can still thrive under such circumstances, but surely you can see that strong people have a by-default better chance than weak people do? Notably, this is a separate part of the argument from adversarial search, and it applies even if we limit ourselves to reflex-like methods. The point here is to highlight what currently allows humans to thrive, and how that gets weakened by AI.
If you wait until humans have manually checked them all through, then you incentivize adversaries to develop military techniques that can destroy your country faster than you can wake up your interpretability researchers. (I expect this to be possible with only weak, reflex-based AI, like if you set up a whole bunch of automated bots to wreck havoc in various ways once triggered.)
It’s not, it’s registering my assumption in case you want to object to it. If you think nukes might be used in a more limited way, then maybe you also think adversarial searches might be used in a more limited way.
Registering something being unclear is helpful for where to take it. Like if we agreed on the overall picture, but you were more optimistic about the areas that were unclear, and I was more pessimistic about them, then I could continue the argument into those areas as well. Like I’m sort of trying to comprehensively enumerate all the relevant dynamics for how this is gonna develop, and explicitly mark off the places that are relevant to consider but which I haven’t properly addressed.
Right now, though, you seem to be assuming that humans by-default thrive, and only exogenous dangers like war or oppression can prevent this. Meanwhile, I’m more using a sort of “inertial” model, where certain neuroses can drive humans to spontaneously self-destruct, sometimes taking a lot of their neighbors with them. As such it seems less relevant to explore these subtrees until the issue of self-destructive neuroses are addressed.
Looks like the default path to me? Like AI companies are dumping lots of knowledge and skills into LLMs, for instance, and at my job we’ve started integrating them with our product. Are there any other relevant dynamics you are seeing?
You need physical experimentation to test how well your methods for unleashing energy/flow into a particular direction works, so building reflex-like/tool AIs is going to be fundamentally rate-limited by the speed of physical experimentation.
However, as you build up a library of tricks to interact with the world, you can use compute to search through ways to combine these tricks to make bigger things happen. This is generally bounded by whatever the biggest “energy source” you can tap into is, because it is really hard to bring multiple different “energy sources” together into some shared direction.
We’ll build corrigible AI that tries to help us with ordinary stuff like transporting food from farms to homes.
However, the more low-impact it is, the more exploitable it is. If you want food from a self-driving truck, maybe you could just stand in front of it, and it will stop, and then some of your friends can break in to it and steal the food it is carrying.
To prevent this, we need to incapacitate criminals. But criminals don’t want to be incapacitated, so they will exploit whatever weaknesses the system for incapacitating them has. As part of this, the more advanced criminals will presumably build AIs to try to seek out weaknesses in the system. That’s what I’m referring to with adversaries exploiting your weakness.
Being robust to exploitation from adversaries massively restricts your options. Whether the exact implementation includes an explicit utility function or not is less relevant than the fact that as it spontaneously adapts to undermine its adversaries, it needs to do so in a way that doesn’t undermine humanity in general. I.e. you need to build some system that can unleash massive destruction towards sufficiently unruly enemies, without unleashing massive destruction towards friends. I think the classic utility maximizer instrumental convergence risk gives a pretty accurate picture for how that will look / how that gives you dangers, but if you think next-token-predictors can unleash destruction in a more controlled way, I’m all ears.
Any path for history needs to account for security and resource flow/allocation. These are the most important part of everything. My position doesn’t really assume that much beyond this.
Making a point-by-point refutation misses the broader fact that any long sequence of argument like this adds up to very little evidence.
Even if you somehow convince me that each of your (10) arguments was like 75% true, they’re still going to add up to nothing because 0.7510=0.05
Unless you can summarize you argument in at most 2 sentences (with evidence), it’s completely ignoreable.
This is not how learning any (even slightly complex) topic works.
Yudkowsky 2017, AronT 2023 and Gwern 2019, if you’re curious why you’re getting downvoted.
(I tried to figure out whether this method of estimation works, and it seemed more accurate than I thought, but then I got distracted).
Cope. Here you’re taking a probabilistic perspective, but that perspective sucks.
The fact is that there are certain robust resources (like sunlight etc.) which exert constant pressure on the world, and which everything is dependent on. Whatever happens, these resources must go somewhere, so any forecast for the future that’s worth its salt must ultimately make predictions about those.
Each part of my argument addresses a different factor involved in these resource flows. Often you can just inspect the world and see that clearly that’s how the resources are flowing. Other times, my argument is disjunctive. Yet other times, sure maybe I’m wrong, but the way I might be wrong would imply the possibility of a lot of resources rushing out into some other channel, which again is worth exploring.
Plus, let’s remember, Strong Evidence Is Common. If there’s some particular parts of the argument where you don’t know how to inspect the world to get plenty of evidence, then I can try to guide you. But blinding yourself because of “muh evidence” is just makes your opinion worthless.
How about, whatever method the experts or the prediction market participants are using, but done better?