People like Andrew Critch and Paul Christiano have criticized MIRI in the past for their “pivotal act” strategy. The latter can be described as “build superintelligence and use it to take unilateral world-scale actions in a manner inconsistent with existing law and order” (e.g. the notorious “melt all GPUs” example). The critics say (justifiably IMO), this strategy looks pretty hostile to many actors and can trigger preemptive actions against the project attempting it and generally foster mistrust.
Is there a good alternative? The critics tend to assume slow-takeoff multipole scenarios, which makes the comparison with their preferred solutions to be somewhat “apples and oranges”. Suppose that we do live in a hard-takeoff singleton world, what then? One answer is “create a trustworthy, competent, multinational megaproject”. Alright, but suppose you can’t create a multinational megaproject, but you can build aligned AI unilaterally. What is a relatively cooperative thing you can do which would still be effective?
Here is my proposed rough sketch of such a plan[1]:
Commit to not make anyone predictably regret supporting the project or not opposing it. This rule is the most important and the one I’m the most confident of by far. In an ideal world, it should be more-or-less sufficient in itself. But in the real world, it might be still useful to provide more tangible details, which the next items try to do.
Within the bounds of Earth, commit to obey the international law, and local law at least inasmuch as the latter is consistent with international law, with only two possible exceptions (see below). Notably, this allows for actions such as (i) distributing technology that cures diseases, reverses aging, produces cheap food etc. (ii) lobbying for societal improvements (but see superpersuation clause below).
Exception 1: You can violate any law if it’s absolutely necessary to prevent a catastrophe on the scale comparable with a nuclear war or worse, but only to the extent it’s necessary for that purpose. (e.g. if a lab is about to build unaligned AI that would kill millions of people and it’s not possible to persuade them to stop or convince the authorities to act in a timely manner, you can sabotage it.)[2]
Build space colonies. These space colonies will host utopic societies and most people on Earth are invited to immigrate there.
Exception 2: A person held in captivity in a manner legal according to local law, who faces death penalty or is treated in a manner violating accepted international rules about treatment of prisoners, might be given the option to leave to the colonies. If they exercise this option, their original jurisdiction is permitted to exile them from Earth permanently and/or bar them from any interaction with Earth than can plausibly enable activities illegal according to that jurisdiction[3].
Commit to adequately compensate any economy hurt by emigration to the colonies or other disruption by you. For example, if space emigration causes the loss of valuable labor, you can send robots to supplant it.
Commit to not directly intervene in international conflicts or upset the balance of powers by supplying military tech to any side, except in cases when it is absolutely necessary to prevent massive violations of international law and human rights.
Commit to only use superhuman persuasion when arguing towards a valid conclusion via valid arguments, in a manner that doesn’t go against the interests of the person being persuaded.
Importantly, this makes stronger assumptions about the kind of AI you can align than MIRI-style pivotal acts. Essentially, it assumes that you can directly or indirectly ask the AI to find good plans consistent with the commitments below, rather than directing it to do something much more specific. Otherwise, it is hard to use Exception 1 (see below) gracefully.
“build superintelligence and use it to take unilateral world-scale actions in a manner inconsistent with existing law and order”
The whole point of the pivotal act framing is that you are looking for something to do that you can do with the least advanced AI system. This means it’s definitely not a superintelligence. If you have an aligned superintelligence this I think makes that framing not really make sense. The problem the framing is trying to grapple with is that we want to somehow use AI to solve AI risk, and for that we want to use the very dumbest AI that we can use for a successful plan.
I know, this is what I pointed at in footnote 1. Although “dumbest AI” is not quite right: the sort of AI MIRI envision is still very superhuman in particular domains, but is somehow kept narrowly confined to acting within those domains (e.g. designing nanobots). The rationale mostly isn’t assuming that at that stage it won’t be possible to create a full superintelligence, but assuming that aligning such a restricted AI would be easier. I have different views on alignment, leading me to believe that aligning a full-fledged superintelligence (sovereign) is actually easier (via PSI or something in that vein). On this view, we still need to contend with the question, what is the thing we will (honestly!) tell other people that our AI is actually going to do. Hence, the above.
I always thought “you should use the least advanced superintelligence necessary”. I.e., in not-real-example of “melting all GPUs” your system should be able to design nanotech advanced enough to target all GPUs in open enviroment, which is superintelligent task, while not being able to, say, reason about anthropics and decision theory.
Im not particularly against pivotal acts. It seems plausible to me someone will take one. Would not exactly shock me if Sam Altman himself planned to take one to prevent dangerous AGI. He is intelligent and therefore isnt going to openly talk about considering them. But I dont have any serious objection to them being taken if people are reasonable about it.
Suppose that we do live in a hard-takeoff singleton world, what then?
What sort of evidence are you envisioning that would allow us to determine that we live in a hard takeoff singleton world, and that the proposed pivotal act would actually work, ahead of actually attempting said pivotal act? I can think of a couple options:
We have no such evidence, but we can choose an act that is only pivotal if the underlying world model that leads you to expect a hard takeoff singleton world actually holds, and harmlessly fails otherwise.
Galaxy brained game theory arguments, of the flavor John von Neumann made when he argued for preemptive nuclear strike on the Soviet Union.
Someone has done a lot of philosophical thinking, and come to the conclusion that something apocalyptically bad will happen in the near future. In order to prevent the bad thing from happening, they need to do something extremely destructive and costly that they say will prevent the apocalyptic event. What evidence do you want from that person before you are happy to have them do the destructive and costly thing?
I don’t have to know in advance that we’re in hard-takeoff singleton world, or even that my AI will succeed to achieve those objectives. The only thing I absolutely have to know in advance is that my AI is aligned. What sort of evidence will I have for this? A lot of detailed mathematical theory, with the modeling assumptions validated by computational experiments and knowledge from other fields of science (e.g. physics, cognitive science, evolutionary biology).
I think you’re misinterpreting Yudkowsky’s quote. “Using the null string as input” doesn’t mean “without evidence”, it means “without other people telling me parts of the answer (to this particular question)”.
I’m not sure what is “extremely destructive and costly” in what I described? Unless you mean the risk of misalignment, in which case, see above.
The critics tend to assume slow-takeoff multipole scenarios, which makes the comparison with their preferred solutions to be somewhat “apples and oranges”. Suppose that we do live in a hard-takeoff singleton world, what then?
It sounds like you do in fact believe we are in a hard-takeoff singleton world, or at least one in which a single actor can permanently prevent all other actors from engaging in catastrophic actions using a less destructive approach than “do unto others before they can do unto you”. Why do you think that describes the world we live in? What observations led you to that conclusion, and do you think others would come to the same conclusion if they saw the same evidence?
I think your set of guidelines from above is mostly[1] a good one, in worlds where a single actor can seize control while following those rules. I don’t think that we live in such a world, and honestly I can’t really imagine what sort of evidence would convince me that I do live in such a world though. Which is why I’m asking.
I think you’re misinterpreting Yudkowsky’s quote. “Using the null string as input” doesn’t mean “without evidence”, it means “without other people telling me parts of the answer (to this particular question)”.
Yeah, on examination of the comment section I think you’re right that by “from the null string” he meant “without direct social inputs on this particular topic”.
“Commit to not make anyone predictably regret supporting the project or not opposing it” is worrying only by omission—it’s a good guideline, but it leaves the door open for “punish anyone who failed to support the project once the project gets the power to do so”. To see why that’s a bad idea to allow, consider the situation where there are two such projects and you, the bystander, don’t know which one will succeed first.
I don’t know whether we live in a hard-takeoff singleton world or not. I think there is some evidence in that direction, e.g. from thinking about the kind of qualitative changes in AI algorithms that might come about in the future, and their implications on the capability growth curve, and also about the possibility of recursive self-improvement. But, the evidence is definitely far from conclusive (in any direction).
I think that the singleton world is definitely likely enough to merit some consideration. I also think that some of the same principles apply to some multipole worlds.
Commit to not make anyone predictably regret supporting the project or not opposing it” is worrying only by omission—it’s a good guideline, but it leaves the door open for “punish anyone who failed to support the project once the project gets the power to do so”.
Yes, I never imagined doing such a thing, but I definitely agree it should be made clear. Basically, don’t make threats, i.e. don’t try to shape others incentives in ways that they would be better off precommitting not to go along with it.
Yeah, I’m not actually worried about the “melt all GPUs” example of a pivotal act. If we actually live in a hard takeoff world, I think we’re probably just hosed. The specific plans I’m worried about are ones that ever-so-marginally increase our chances of survival in hard-takeoff singleton worlds, at massive costs in multipolar worlds.
A full nuclear exchange would probably kill less than a billion people. If someone convinces themself that a full nuclear exchange would prevent the development of superhuman AI, I would still strongly prefer that person not try their hardest to trigger a nuclear exchange. More generally, I think having a policy of “anyone who thinks the world will end unless they take some specific action should go ahead and take that action, as long as less than a billion people die” is a terrible policy.
If someone convinces themself that a full nuclear exchange would prevent the development of superhuman AI
I think the problem here is “convinces themself”. If you are capable to trigger nuclear war, you are probably capable to do something else which is not that, if you put your mind in that.
Does the” something else which is not that but is in the same difficulty class” also accomplish the goal of “ensure that nobody has access to what you think is enough compute to build an ASI?” If not, I think that implies that the “anything that probably kills less than a billion people is fair game” policy is a bad one.
Why do you think that the space colonists would be able to create a utopian society just because they are not on earth? You will still have all the same types of people up there as down here, and they will continue to exhibit the Seven Deadly Sins. They will just be in a much smaller and more fragile environment, most likely making the consequences of bad behavior worse than here on earth.
They have superintelligence, the augmenting technologies that come of it, and the self-reflection that follows receiving those, they are not the same types of people.
It’s not because they’re not on Earth, it’s because they have a superintelligence helping them. Which might give them advice and guidance, take care of their physical and mental health, create physical constraints (e.g. that prevent violence), or even give them mind augmentation like mako yass suggested (although I don’t think that’s likely to be a good idea early on). And I don’t expect their environment to be fragile because, again, designed by superintelligence. But I don’t know the details of the solution: the AI will decide those, as it will be much smarter than me.
I would guess that getting space colonies to the kind of a state where they could support significant human inhabitation would be a multi-decade project, even with superintelligence? Especially taking into account that they won’t have much nature without significant terraforming efforts, and quite a few people would find any colony without any forests etc. to be intrinsically dystopian.
First, given nanotechnology, it might be possible to build colonies much faster.
Second, I think the best way to live is probably as uploads inside virtual reality, so terraforming is probably irrelevant.
Third, it’s sufficient that the colonists are uploaded or cryopreserved (via some superintelligence-vetted method) and stored someplace safe (whether on Earth or in space) until the colony is entirely ready.
Fourth, if we can stop aging and prevent other dangers (including unaligned AI), then a timeline of decades is fine.
Does it make sense to plan for one possible world or do you think that the other possible worlds are being adequately planned for and it is only the fast unilateral take off that is neglected currently?
Limiting AI to operating in space makes sense. You might want to pay off or compensate all space launch capability in some way as there would likely be less need.
Some recompense for the people who paused working on AI or were otherwise hurt in the build up to AI makes sense.
Also trying to communicate ahead of time what a utopic vision of AI and humans might look like, so the cognitive stress isn’t too major is probably a good idea to commit to.
Committing to support multilateral acts if unilateral acts fail is probably a good idea too. Perhaps even partnering with a multilateral effort so that effort on shared goals can be spread around?
Commit to only use superhuman persuasion when arguing towards a valid conclusion via valid arguments, in a manner that doesn’t go against the interests of the person being persuaded.
In this plan, how should the AI define what’s in the interest of the person being persuaded? For example, say you have a North Korean soldier who can be persuaded to quite for the west (at the risk of getting the shitty jobs most migrants have) or who can be persuaded to remain loyal to his bosses (at the risk of raising his children in the shitty country most north korean have), what set of rules would you suggest?
People like Andrew Critch and Paul Christiano have criticized MIRI in the past for their “pivotal act” strategy. The latter can be described as “build superintelligence and use it to take unilateral world-scale actions in a manner inconsistent with existing law and order” (e.g. the notorious “melt all GPUs” example). The critics say (justifiably IMO), this strategy looks pretty hostile to many actors and can trigger preemptive actions against the project attempting it and generally foster mistrust.
Is there a good alternative? The critics tend to assume slow-takeoff multipole scenarios, which makes the comparison with their preferred solutions to be somewhat “apples and oranges”. Suppose that we do live in a hard-takeoff singleton world, what then? One answer is “create a trustworthy, competent, multinational megaproject”. Alright, but suppose you can’t create a multinational megaproject, but you can build aligned AI unilaterally. What is a relatively cooperative thing you can do which would still be effective?
Here is my proposed rough sketch of such a plan[1]:
Commit to not make anyone predictably regret supporting the project or not opposing it. This rule is the most important and the one I’m the most confident of by far. In an ideal world, it should be more-or-less sufficient in itself. But in the real world, it might be still useful to provide more tangible details, which the next items try to do.
Within the bounds of Earth, commit to obey the international law, and local law at least inasmuch as the latter is consistent with international law, with only two possible exceptions (see below). Notably, this allows for actions such as (i) distributing technology that cures diseases, reverses aging, produces cheap food etc. (ii) lobbying for societal improvements (but see superpersuation clause below).
Exception 1: You can violate any law if it’s absolutely necessary to prevent a catastrophe on the scale comparable with a nuclear war or worse, but only to the extent it’s necessary for that purpose. (e.g. if a lab is about to build unaligned AI that would kill millions of people and it’s not possible to persuade them to stop or convince the authorities to act in a timely manner, you can sabotage it.)[2]
Build space colonies. These space colonies will host utopic societies and most people on Earth are invited to immigrate there.
Exception 2: A person held in captivity in a manner legal according to local law, who faces death penalty or is treated in a manner violating accepted international rules about treatment of prisoners, might be given the option to leave to the colonies. If they exercise this option, their original jurisdiction is permitted to exile them from Earth permanently and/or bar them from any interaction with Earth than can plausibly enable activities illegal according to that jurisdiction[3].
Commit to adequately compensate any economy hurt by emigration to the colonies or other disruption by you. For example, if space emigration causes the loss of valuable labor, you can send robots to supplant it.
Commit to not directly intervene in international conflicts or upset the balance of powers by supplying military tech to any side, except in cases when it is absolutely necessary to prevent massive violations of international law and human rights.
Commit to only use superhuman persuasion when arguing towards a valid conclusion via valid arguments, in a manner that doesn’t go against the interests of the person being persuaded.
Importantly, this makes stronger assumptions about the kind of AI you can align than MIRI-style pivotal acts. Essentially, it assumes that you can directly or indirectly ask the AI to find good plans consistent with the commitments below, rather than directing it to do something much more specific. Otherwise, it is hard to use Exception 1 (see below) gracefully.
A more conservative alternative is to limit Exception 1 to catastrophes that would spill over to the space colonies (see next item).
It might be sensible to consider a more conservative version which doesn’t have Exception 2, even though the implications are unpleasant.
IMO it was a big mistake for MIRI to talk about pivotal acts without saying they should even attempt to follow laws.
The whole point of the pivotal act framing is that you are looking for something to do that you can do with the least advanced AI system. This means it’s definitely not a superintelligence. If you have an aligned superintelligence this I think makes that framing not really make sense. The problem the framing is trying to grapple with is that we want to somehow use AI to solve AI risk, and for that we want to use the very dumbest AI that we can use for a successful plan.
I know, this is what I pointed at in footnote 1. Although “dumbest AI” is not quite right: the sort of AI MIRI envision is still very superhuman in particular domains, but is somehow kept narrowly confined to acting within those domains (e.g. designing nanobots). The rationale mostly isn’t assuming that at that stage it won’t be possible to create a full superintelligence, but assuming that aligning such a restricted AI would be easier. I have different views on alignment, leading me to believe that aligning a full-fledged superintelligence (sovereign) is actually easier (via PSI or something in that vein). On this view, we still need to contend with the question, what is the thing we will (honestly!) tell other people that our AI is actually going to do. Hence, the above.
I always thought “you should use the least advanced superintelligence necessary”. I.e., in not-real-example of “melting all GPUs” your system should be able to design nanotech advanced enough to target all GPUs in open enviroment, which is superintelligent task, while not being able to, say, reason about anthropics and decision theory.
Im not particularly against pivotal acts. It seems plausible to me someone will take one. Would not exactly shock me if Sam Altman himself planned to take one to prevent dangerous AGI. He is intelligent and therefore isnt going to openly talk about considering them. But I dont have any serious objection to them being taken if people are reasonable about it.
What sort of evidence are you envisioning that would allow us to determine that we live in a hard takeoff singleton world, and that the proposed pivotal act would actually work, ahead of actually attempting said pivotal act? I can think of a couple options:
We have no such evidence, but we can choose an act that is only pivotal if the underlying world model that leads you to expect a hard takeoff singleton world actually holds, and harmlessly fails otherwise.
Galaxy brained game theory arguments, of the flavor John von Neumann made when he argued for preemptive nuclear strike on the Soviet Union.
Something else entirely
My worry, given the things Yudkowsky has said like “I figured this stuff out using the null string as input”, is that the argument is closer to (2).
So to reframe the question:
Someone has done a lot of philosophical thinking, and come to the conclusion that something apocalyptically bad will happen in the near future. In order to prevent the bad thing from happening, they need to do something extremely destructive and costly that they say will prevent the apocalyptic event. What evidence do you want from that person before you are happy to have them do the destructive and costly thing?
I don’t have to know in advance that we’re in hard-takeoff singleton world, or even that my AI will succeed to achieve those objectives. The only thing I absolutely have to know in advance is that my AI is aligned. What sort of evidence will I have for this? A lot of detailed mathematical theory, with the modeling assumptions validated by computational experiments and knowledge from other fields of science (e.g. physics, cognitive science, evolutionary biology).
I think you’re misinterpreting Yudkowsky’s quote. “Using the null string as input” doesn’t mean “without evidence”, it means “without other people telling me parts of the answer (to this particular question)”.
I’m not sure what is “extremely destructive and costly” in what I described? Unless you mean the risk of misalignment, in which case, see above.
This was specifically in response to
It sounds like you do in fact believe we are in a hard-takeoff singleton world, or at least one in which a single actor can permanently prevent all other actors from engaging in catastrophic actions using a less destructive approach than “do unto others before they can do unto you”. Why do you think that describes the world we live in? What observations led you to that conclusion, and do you think others would come to the same conclusion if they saw the same evidence?
I think your set of guidelines from above is mostly[1] a good one, in worlds where a single actor can seize control while following those rules. I don’t think that we live in such a world, and honestly I can’t really imagine what sort of evidence would convince me that I do live in such a world though. Which is why I’m asking.
Yeah, on examination of the comment section I think you’re right that by “from the null string” he meant “without direct social inputs on this particular topic”.
“Commit to not make anyone predictably regret supporting the project or not opposing it” is worrying only by omission—it’s a good guideline, but it leaves the door open for “punish anyone who failed to support the project once the project gets the power to do so”. To see why that’s a bad idea to allow, consider the situation where there are two such projects and you, the bystander, don’t know which one will succeed first.
I don’t know whether we live in a hard-takeoff singleton world or not. I think there is some evidence in that direction, e.g. from thinking about the kind of qualitative changes in AI algorithms that might come about in the future, and their implications on the capability growth curve, and also about the possibility of recursive self-improvement. But, the evidence is definitely far from conclusive (in any direction).
I think that the singleton world is definitely likely enough to merit some consideration. I also think that some of the same principles apply to some multipole worlds.
Yes, I never imagined doing such a thing, but I definitely agree it should be made clear. Basically, don’t make threats, i.e. don’t try to shape others incentives in ways that they would be better off precommitting not to go along with it.
If you are capable to use AI to do harmful and costly thing, like “melt GPUs”, you are in hard takeoff world.
Yeah, I’m not actually worried about the “melt all GPUs” example of a pivotal act. If we actually live in a hard takeoff world, I think we’re probably just hosed. The specific plans I’m worried about are ones that ever-so-marginally increase our chances of survival in hard-takeoff singleton worlds, at massive costs in multipolar worlds.
A full nuclear exchange would probably kill less than a billion people. If someone convinces themself that a full nuclear exchange would prevent the development of superhuman AI, I would still strongly prefer that person not try their hardest to trigger a nuclear exchange. More generally, I think having a policy of “anyone who thinks the world will end unless they take some specific action should go ahead and take that action, as long as less than a billion people die” is a terrible policy.
I think the problem here is “convinces themself”. If you are capable to trigger nuclear war, you are probably capable to do something else which is not that, if you put your mind in that.
Does the” something else which is not that but is in the same difficulty class” also accomplish the goal of “ensure that nobody has access to what you think is enough compute to build an ASI?” If not, I think that implies that the “anything that probably kills less than a billion people is fair game” policy is a bad one.
Why do you think that the space colonists would be able to create a utopian society just because they are not on earth? You will still have all the same types of people up there as down here, and they will continue to exhibit the Seven Deadly Sins. They will just be in a much smaller and more fragile environment, most likely making the consequences of bad behavior worse than here on earth.
They have superintelligence, the augmenting technologies that come of it, and the self-reflection that follows receiving those, they are not the same types of people.
It’s not because they’re not on Earth, it’s because they have a superintelligence helping them. Which might give them advice and guidance, take care of their physical and mental health, create physical constraints (e.g. that prevent violence), or even give them mind augmentation like mako yass suggested (although I don’t think that’s likely to be a good idea early on). And I don’t expect their environment to be fragile because, again, designed by superintelligence. But I don’t know the details of the solution: the AI will decide those, as it will be much smarter than me.
I would guess that getting space colonies to the kind of a state where they could support significant human inhabitation would be a multi-decade project, even with superintelligence? Especially taking into account that they won’t have much nature without significant terraforming efforts, and quite a few people would find any colony without any forests etc. to be intrinsically dystopian.
First, given nanotechnology, it might be possible to build colonies much faster.
Second, I think the best way to live is probably as uploads inside virtual reality, so terraforming is probably irrelevant.
Third, it’s sufficient that the colonists are uploaded or cryopreserved (via some superintelligence-vetted method) and stored someplace safe (whether on Earth or in space) until the colony is entirely ready.
Fourth, if we can stop aging and prevent other dangers (including unaligned AI), then a timeline of decades is fine.
Does it make sense to plan for one possible world or do you think that the other possible worlds are being adequately planned for and it is only the fast unilateral take off that is neglected currently?
Limiting AI to operating in space makes sense. You might want to pay off or compensate all space launch capability in some way as there would likely be less need.
Some recompense for the people who paused working on AI or were otherwise hurt in the build up to AI makes sense.
Also trying to communicate ahead of time what a utopic vision of AI and humans might look like, so the cognitive stress isn’t too major is probably a good idea to commit to.
Committing to support multilateral acts if unilateral acts fail is probably a good idea too. Perhaps even partnering with a multilateral effort so that effort on shared goals can be spread around?
In this plan, how should the AI define what’s in the interest of the person being persuaded? For example, say you have a North Korean soldier who can be persuaded to quite for the west (at the risk of getting the shitty jobs most migrants have) or who can be persuaded to remain loyal to his bosses (at the risk of raising his children in the shitty country most north korean have), what set of rules would you suggest?