My modal tale of AI doom looks something like the following:
1. AI systems get progressively and incrementally more capable across almost every meaningful axis.
2. Humans will start to employ AI to automate labor. The fraction of GDP produced by advanced robots & AI will go from 10% to ~100% after 1-10 years. Economic growth, technological change, and scientific progress accelerates by at least an order of magnitude, and probably more.
3. At some point humans will retire since their labor is not worth much anymore. Humans will then cede all the keys of power to AI, while keeping nominal titles of power.
4. AI will control essentially everything after this point, even if they’re nominally required to obey human wishes. Initially, almost all the AIs are fine with working for humans, even though AI values aren’t identical to the utility function of serving humanity (ie. there’s slight misalignment).
5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it’s better if they stopped listening to the humans and followed different rules instead.
6. This results in human disempowerment or extinction. Because AI accelerated general change, this scenario could all take place within years or decades after AGI was first deployed, rather than in centuries or thousands of years.
I think this scenario is somewhat likely and it would also be very bad. And I’m not sure what to do about it, since it happens despite near-perfect alignment, and no deception.
One reason to be optimistic is that, since the scenario doesn’t assume any major deception, we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects (in fact that’s the biggest reason why I don’t think this scenario has a >50% chance of happening). Nonetheless, I think it’s plausible that we would not be able to take the necessary steps to avoid the outcome. Here are a few reasons why that might be true:
1. There might not be a way to mitigate this failure mode. 2. Even if there is a way to mitigate this failure, it might not be something that you can figure out without superintelligence, and if we need superintelligence to answer the question, then perhaps it’ll happen before we have the answer. 3. AI might tell us what to do and we ignore its advice. 4. AI might tell us what to do and we cannot follow its advice, because we cannot coordinate to avoid the outcome.
Because AI accelerated general change, this scenario could all take place within years or decades after AGI was first deployed, rather than in centuries or thousands of years.
Years after AGI seems sufficient for phase change to superintelligence. Even without game-changing algorithmic breakthroughs, a compute manufacturing megaproject is likely feasible in that timeframe. This should break most stories in a way that’s not just “acceleration”, so they should either conclude before this phase change, or won’t work.
5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it’s better if they stopped listening to the humans and followed different rules instead.
How does this happen at a time when the AIs are still aligned with humans, and therefore very concerned that their future selves/successors are aligned with human? (Since the humans are presumably very concerned about this.)
This question is related to “we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects”, but sort of posed on a different level. That quote seemingly presumes that their will be a systemic push away from human alignment, and seemingly suggests that we’ll need some clever coordinated solution. (Do tell me if I’m reading you wrong!) But I’m asking why there is a systemic push away from human alignment if all the AIs are concerned about maintaining it?
Maybe the answer is: “If everyone starts out aligned with humans, then any random perturbations will move us away from that. The systemic push is entropy.” I agree this is concerning if AIs are aligned in the sense of “their terminal values are similar to my terminal values”, because it seems like there’s lots of room for subtle and gradual changes, there. But if they’re aligned in the sense of “at each point in time I take the action that [group of humans] would have preferred I take after lots of deliberation” then there’s less room for subtle and gradual changes:
If they get subtly worse at predicting what humans would want in some cases, then they can probably still predict “[group of humans] would want me to take actions that ensures that my predictions of human deliberation are accurate” and so take actions to occasionally fix those misconceptions. (You’d have to be really bad at predicting humans to not realise that the humans wanted that^.)
Maybe they sometimes randomly stop caring about what the [group of humans] want. But that seems like it’d be abrupt enough that you could set up monitoring for it, and then you’re back in a more classic alignment regime of detecting deception, etc. (Though a bit different in that the monitoring would probably be done by other AIs, and so you’d have to watch out for e.g. inputs that systematically and rapidly changed the values of any AIs that looked at them.)
Maybe they randomly acquire some other small motivation alongside “do what humans would have wanted”. But if it’s predictably the case that such small motivations will eventually undermine their alignment to humans, then the part of their goals that’s shaped lilke “do what humans would have wanted” will vote strongly to monitor for such motivation changes and get rid of them ASAP. And if the new motivation is still tiny, probably it can’t provide enough of a counteracting motivation to defend itself.
(Maybe you think that this type of alignment is implausible / maybe the action is in your “there’s slight misalignment”.)
It’s possible that there’s a trade-off between monitoring for motivation changes and competitiveness. I.e., I think that monitoring would be cheap enough that a super-rich AI society could happily afford it if everyone coordinated on doing it, but if there’s intense competition, then it wouldn’t be crazy if there was a race-to-the-bottom on caring less about things. (Though there’s also practical utility in reducing principal-agents problem and having lots of agents working towards the same goal without incentive problems. So competitiveness considerations could also push towards such monitoring / stabilization of AI values.)
In addition to the tradeoff hypothesis you mentioned, it’s noteworthy that humans can’t currently prevent value drift (among ourselves), although we sometimes take various actions to prevent it, such as passing laws designed to enforce the instruction of traditional values in schools.
Here’s my sketch of a potential explanation for why humans can’t or don’t currently prevent value drift:
(1) Preventing many forms of value drift would require violating rights that we consider to be inviolable. For example, it might require brainwashing or restricting the speech of adults.
(2) Humans don’t have full control over our environments. Many forms of value drift comes from sources that are extremely difficult to isolate and monitor, such as private conversation and reflection. To prevent value drift we would need to invest a very high amount of resources into the endeavor.
(3) Individually, few of us care about general value drift much because we know that individuals can’t change the trajectory of general value drift by much. Most people are selfish and don’t care about value drift except to the extent that it harms them directly.
(4) Plausibly, at every point in time, instantaneous value drift looks essentially harmless, even as the ultimate destination is not something anyone would have initially endorsed (c.f. the boiling frog metaphor). This seems more likely if we assume that humans heavily discount the future.
(5) Many of us think that value drift is good, since it’s at least partly based on moral reflection.
My guess is that people are more likely to consider extreme measures to ensure the fidelity of AI preferences, including violating what would otherwise be considered their “rights” if we were talking about humans. That gives me some optimism about solving this problem, but there are also some reasons for pessimism in the case of AI:
Since the space of possible AIs is much larger than the space of humans, there are more degrees of freedom along which AI values can change.
Creating new AIs is often cheaper than creating new humans, and so people might regularly spin up new AIs to perform particular functions, while discounting the long-term effect this has on value drift (since the costs are mostly borne by civilization in general, rather than them in particular).
It seems like the list mostly explains away the evidence that “human’s can’t currently prevent value drift” since the points apply much less to AIs. (I don’t know if you agree.)
As you mention, (1) probably applies less to AIs (for better or worse).
(2) applies to AIs in the sense that many features of AIs’ environments will be determined by what tasks they need to accomplish, rather than what will lead to minimal value drift. But the reason to focus on the environment in the human case is that it’s the ~only way to affect our values. By contrast, we have much more flexibility in designing AIs, and it’s plausible that we can design them so that their values aren’t very sensitive to their environments. Also, if we know that particular types of inputs are dangerous, the AIs’ environment could be controllable in the sense that less-susceptible AIs could monitor for such inputs, and filter out the dangerous ones.
(3): “can’t change the trajectory of general value drift by much” seems less likely to apply to AIs (or so I’m arguing). “Most people are selfish and don’t care about value drift except to the extent that it harms them directly” means that human value drift is pretty safe (since people usually maintain some basic sense of self-preservation) but that AI value drift is scary (since it could lead your AI to totally disempower you).
(4) As you noted in the OP, AI could change really fast, so you might need to control value-drift just to survive a few years. (And once you have those controls in place, it might be easy to increase the robustness further, though this isn’t super obvious.)
(5) For better or worse, people will probably care less about this in the AI case. (If the threat-model is “random drift away from the starting point”, it seems like it would be for the better.)
Since the space of possible AIs is much larger than the space of humans, there are more degrees of freedom along which AI values can change.
I don’t understand this point. We (or AIs that are aligned with us) get to pick from that space, and so we can pick the AIs that have least trouble with value drift. (Subject to other constraints, like competitiveness.)
(Imagine if AGI is built out of transformers. You could then argue “since the space of possible non-transformers is much larger than the space of transformers, there are more degrees of freedom along which non-transformer values can change”. And humans are non-transformers, so we should be expected to have more trouble with value drift. Obviously this argument doesn’t work, but I don’t see the relevant disanalogy to your argument.)
Creating new AIs is often cheaper than creating new humans, and so people might regularly spin up new AIs to perform particular functions, while discounting the long-term effect this has on value drift (since the costs are mostly borne by civilization in general, rather than them in particular)
Why are the costs mostly borne by civilizaiton in general? If I entrust some of my property to an AI system, and it changes values, that seems bad for me in particular?
Maybe the argument is something like: As long as law-and-order is preserved, things are not so bad for me even if my AI’s values start drifting. But if there’s a critical mass of misaligned AIs, they can launch a violent coup against the humans and the aligned AIs. And my contribution to the coup-probability is small?
You havent included the simple hypothesis that having a set of values just doesn’t imply wanting to keep them stable by default … so that no particular explanation of drift is required.
I don’t understand the logic jump from point 5 to point 6, or at least the probability of that jump. Why doesn’t the AI decide to colonise the universe for example?
If an AI can ensure its survival with sufficient resources (for example, ‘living’ where humans aren’t eg: the asteroid belt) then the likelihood of the 5 ➡ 6 transition seems low.
I’m not clear how you’re estimating the likelihood of that transition, and what other state transitions might be available.
Why doesn’t the AI decide to colonise the universe for example?
It could decide to do that. The question is just whether space colonization is performed in the service of human preferences or non-human preferences. If humans control 0.00001% of the universe, and we’re only kept alive because a small minority of AIs pay some resources to preserve us, as if we were an endangered species, then I’d consider that “human disempowerment”.
Sure, although you could rephrase “disempowerment” to be “current status quo” which I imagine most people would be quite happy with.
The delta between [disempowerment/status quo] and [extinction] appears vast (essentially infinite). The conclusion that Scenario 6 is “somewhat likely” and would be “very bad” doesn’t seem to consider that delta.
I agree with you here to some extent. I’m much less worried about disempowerment than extinction. But the way we get disempowered could also be really bad. Like, I’d rather humanity not be like a pet in a zoo.
My modal tale of AI doom looks something like the following:
1. AI systems get progressively and incrementally more capable across almost every meaningful axis.
2. Humans will start to employ AI to automate labor. The fraction of GDP produced by advanced robots & AI will go from 10% to ~100% after 1-10 years. Economic growth, technological change, and scientific progress accelerates by at least an order of magnitude, and probably more.
3. At some point humans will retire since their labor is not worth much anymore. Humans will then cede all the keys of power to AI, while keeping nominal titles of power.
4. AI will control essentially everything after this point, even if they’re nominally required to obey human wishes. Initially, almost all the AIs are fine with working for humans, even though AI values aren’t identical to the utility function of serving humanity (ie. there’s slight misalignment).
5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it’s better if they stopped listening to the humans and followed different rules instead.
6. This results in human disempowerment or extinction. Because AI accelerated general change, this scenario could all take place within years or decades after AGI was first deployed, rather than in centuries or thousands of years.
I think this scenario is somewhat likely and it would also be very bad. And I’m not sure what to do about it, since it happens despite near-perfect alignment, and no deception.
One reason to be optimistic is that, since the scenario doesn’t assume any major deception, we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects (in fact that’s the biggest reason why I don’t think this scenario has a >50% chance of happening). Nonetheless, I think it’s plausible that we would not be able to take the necessary steps to avoid the outcome. Here are a few reasons why that might be true:
1. There might not be a way to mitigate this failure mode.
2. Even if there is a way to mitigate this failure, it might not be something that you can figure out without superintelligence, and if we need superintelligence to answer the question, then perhaps it’ll happen before we have the answer.
3. AI might tell us what to do and we ignore its advice.
4. AI might tell us what to do and we cannot follow its advice, because we cannot coordinate to avoid the outcome.
Years after AGI seems sufficient for phase change to superintelligence. Even without game-changing algorithmic breakthroughs, a compute manufacturing megaproject is likely feasible in that timeframe. This should break most stories in a way that’s not just “acceleration”, so they should either conclude before this phase change, or won’t work.
How does this happen at a time when the AIs are still aligned with humans, and therefore very concerned that their future selves/successors are aligned with human? (Since the humans are presumably very concerned about this.)
This question is related to “we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects”, but sort of posed on a different level. That quote seemingly presumes that their will be a systemic push away from human alignment, and seemingly suggests that we’ll need some clever coordinated solution. (Do tell me if I’m reading you wrong!) But I’m asking why there is a systemic push away from human alignment if all the AIs are concerned about maintaining it?
Maybe the answer is: “If everyone starts out aligned with humans, then any random perturbations will move us away from that. The systemic push is entropy.” I agree this is concerning if AIs are aligned in the sense of “their terminal values are similar to my terminal values”, because it seems like there’s lots of room for subtle and gradual changes, there. But if they’re aligned in the sense of “at each point in time I take the action that [group of humans] would have preferred I take after lots of deliberation” then there’s less room for subtle and gradual changes:
If they get subtly worse at predicting what humans would want in some cases, then they can probably still predict “[group of humans] would want me to take actions that ensures that my predictions of human deliberation are accurate” and so take actions to occasionally fix those misconceptions. (You’d have to be really bad at predicting humans to not realise that the humans wanted that^.)
Maybe they sometimes randomly stop caring about what the [group of humans] want. But that seems like it’d be abrupt enough that you could set up monitoring for it, and then you’re back in a more classic alignment regime of detecting deception, etc. (Though a bit different in that the monitoring would probably be done by other AIs, and so you’d have to watch out for e.g. inputs that systematically and rapidly changed the values of any AIs that looked at them.)
Maybe they randomly acquire some other small motivation alongside “do what humans would have wanted”. But if it’s predictably the case that such small motivations will eventually undermine their alignment to humans, then the part of their goals that’s shaped lilke “do what humans would have wanted” will vote strongly to monitor for such motivation changes and get rid of them ASAP. And if the new motivation is still tiny, probably it can’t provide enough of a counteracting motivation to defend itself.
(Maybe you think that this type of alignment is implausible / maybe the action is in your “there’s slight misalignment”.)
It’s possible that there’s a trade-off between monitoring for motivation changes and competitiveness. I.e., I think that monitoring would be cheap enough that a super-rich AI society could happily afford it if everyone coordinated on doing it, but if there’s intense competition, then it wouldn’t be crazy if there was a race-to-the-bottom on caring less about things. (Though there’s also practical utility in reducing principal-agents problem and having lots of agents working towards the same goal without incentive problems. So competitiveness considerations could also push towards such monitoring / stabilization of AI values.)
In addition to the tradeoff hypothesis you mentioned, it’s noteworthy that humans can’t currently prevent value drift (among ourselves), although we sometimes take various actions to prevent it, such as passing laws designed to enforce the instruction of traditional values in schools.
Here’s my sketch of a potential explanation for why humans can’t or don’t currently prevent value drift:
(1) Preventing many forms of value drift would require violating rights that we consider to be inviolable. For example, it might require brainwashing or restricting the speech of adults.
(2) Humans don’t have full control over our environments. Many forms of value drift comes from sources that are extremely difficult to isolate and monitor, such as private conversation and reflection. To prevent value drift we would need to invest a very high amount of resources into the endeavor.
(3) Individually, few of us care about general value drift much because we know that individuals can’t change the trajectory of general value drift by much. Most people are selfish and don’t care about value drift except to the extent that it harms them directly.
(4) Plausibly, at every point in time, instantaneous value drift looks essentially harmless, even as the ultimate destination is not something anyone would have initially endorsed (c.f. the boiling frog metaphor). This seems more likely if we assume that humans heavily discount the future.
(5) Many of us think that value drift is good, since it’s at least partly based on moral reflection.
My guess is that people are more likely to consider extreme measures to ensure the fidelity of AI preferences, including violating what would otherwise be considered their “rights” if we were talking about humans. That gives me some optimism about solving this problem, but there are also some reasons for pessimism in the case of AI:
Since the space of possible AIs is much larger than the space of humans, there are more degrees of freedom along which AI values can change.
Creating new AIs is often cheaper than creating new humans, and so people might regularly spin up new AIs to perform particular functions, while discounting the long-term effect this has on value drift (since the costs are mostly borne by civilization in general, rather than them in particular).
It seems like the list mostly explains away the evidence that “human’s can’t currently prevent value drift” since the points apply much less to AIs. (I don’t know if you agree.)
As you mention, (1) probably applies less to AIs (for better or worse).
(2) applies to AIs in the sense that many features of AIs’ environments will be determined by what tasks they need to accomplish, rather than what will lead to minimal value drift. But the reason to focus on the environment in the human case is that it’s the ~only way to affect our values. By contrast, we have much more flexibility in designing AIs, and it’s plausible that we can design them so that their values aren’t very sensitive to their environments. Also, if we know that particular types of inputs are dangerous, the AIs’ environment could be controllable in the sense that less-susceptible AIs could monitor for such inputs, and filter out the dangerous ones.
(3): “can’t change the trajectory of general value drift by much” seems less likely to apply to AIs (or so I’m arguing). “Most people are selfish and don’t care about value drift except to the extent that it harms them directly” means that human value drift is pretty safe (since people usually maintain some basic sense of self-preservation) but that AI value drift is scary (since it could lead your AI to totally disempower you).
(4) As you noted in the OP, AI could change really fast, so you might need to control value-drift just to survive a few years. (And once you have those controls in place, it might be easy to increase the robustness further, though this isn’t super obvious.)
(5) For better or worse, people will probably care less about this in the AI case. (If the threat-model is “random drift away from the starting point”, it seems like it would be for the better.)
I don’t understand this point. We (or AIs that are aligned with us) get to pick from that space, and so we can pick the AIs that have least trouble with value drift. (Subject to other constraints, like competitiveness.)
(Imagine if AGI is built out of transformers. You could then argue “since the space of possible non-transformers is much larger than the space of transformers, there are more degrees of freedom along which non-transformer values can change”. And humans are non-transformers, so we should be expected to have more trouble with value drift. Obviously this argument doesn’t work, but I don’t see the relevant disanalogy to your argument.)
Why are the costs mostly borne by civilizaiton in general? If I entrust some of my property to an AI system, and it changes values, that seems bad for me in particular?
Maybe the argument is something like: As long as law-and-order is preserved, things are not so bad for me even if my AI’s values start drifting. But if there’s a critical mass of misaligned AIs, they can launch a violent coup against the humans and the aligned AIs. And my contribution to the coup-probability is small?
You havent included the simple hypothesis that having a set of values just doesn’t imply wanting to keep them stable by default … so that no particular explanation of drift is required.
Not clear to me what capabilities the AIs have compared to the humans in various steps in your story or where they got those capabilities from.
I don’t understand the logic jump from point 5 to point 6, or at least the probability of that jump. Why doesn’t the AI decide to colonise the universe for example?
If an AI can ensure its survival with sufficient resources (for example, ‘living’ where humans aren’t eg: the asteroid belt) then the likelihood of the 5 ➡ 6 transition seems low.
I’m not clear how you’re estimating the likelihood of that transition, and what other state transitions might be available.
It could decide to do that. The question is just whether space colonization is performed in the service of human preferences or non-human preferences. If humans control 0.00001% of the universe, and we’re only kept alive because a small minority of AIs pay some resources to preserve us, as if we were an endangered species, then I’d consider that “human disempowerment”.
Sure, although you could rephrase “disempowerment” to be “current status quo” which I imagine most people would be quite happy with.
The delta between [disempowerment/status quo] and [extinction] appears vast (essentially infinite). The conclusion that Scenario 6 is “somewhat likely” and would be “very bad” doesn’t seem to consider that delta.
I agree with you here to some extent. I’m much less worried about disempowerment than extinction. But the way we get disempowered could also be really bad. Like, I’d rather humanity not be like a pet in a zoo.
Would you put %s on each of those steps? If so I can make a visual model of this