Huh. I’m under the impression that “offense-defense balance for technology-inventing AGIs” is also a big cruxy difference between you and Eliezer.
Specifically: if almost everyone is making helpful aligned norm-following AGIs, but one secret military lab accidentally makes a misaligned paperclip maximizer, can the latter crush all competition? My impression is that Eliezer thinks yes: there’s really no defense against self-replicating nano-machines, so the only paths to victory are absolutely perfect compliance forever (which he sees as implausible, given secret military labs etc.) or someone uses an aligned AGI to do a drastic-seeming pivotal act in the general category of GPU-melting nanobots. Whereas you disagree.
Sorry if I’m putting words in anyone’s mouths.
For my part, I don’t have an informed opinion about offense-defense balance, i.e. whether more-powerful-and-numerous aligned AGIs can defend against one paperclipper born in a secret military lab accident. I guess I’d have to read Drexler’s nano book or something. At the very least, I don’t see it as a slam dunk in favor of Team Aligned, I see it as a question that could go either way.
One datapoint I really liked about this: https://arxiv.org/abs/2104.03113 (Scaling Laws for Board Games). They train AlphaGo agents of different sizes to compete on the game Hex.
The approximate takeaway, quoting the author: “if you are in the linearly-increasing regime [where return on compute is nontrivial], then you will need about 2× as much compute as your opponent to beat them 2/3 of the time.”
This might suggest that, absent additional asymmetries (like constraints on the aligned AIs massively hampering them), the win ratio may be roughly proportional to the compute ratio. If you assume we can get global data center governance, I’d consider that a sign in favor of the world’s governments. (Whether you think that’s good is a political stance that I believe folks here may disagree on.)
Bonus quote: “This behaviour is strikingly similar to that of a toy model where each player chooses as many random numbers as they have compute, and the player with the highest number wins3. In this toy model, doubling your compute doubles how many random numbers you draw, and the probability that you possess the largest number is 2⁄3. This suggests that the complex game play of Hex might actually reduce to each agent having a ‘pool’ of strategies proportional to its compute, and whoever picks the better strategy wins. While on the basis of the evidence presented herein we can only consider this to be serendipity, we are keen to see whether the same behaviour holds in other games.”
Offense is favored over defense because, e.g., one AI can just nuke the other. The asymmetries come from physics, where you can’t physically build shields that are more resilient than the strongest shield-destroying tech. Absent new physics, extra intelligence doesn’t fundamentally change this dynamic, though it can buy you more time in which to strike first.
(E.g., being smarter may let you think faster, or may let you copy yourself to more locations so it takes more time for nukes or nanobots to hit every copy of you. But it doesn’t let you build a wall where you can just hang out on Earth with another superintelligence and not worry about the other superintelligence breaking your wall.)
I want to push back on your “can’t make an unbreakable wall” metaphor. We have an unbreakable wall like that today where two super-powerful beings are just hanging out sharing earth; it’s called the survivable nuclear second-strike capability.
(For clarity, here I’ll assume that aligned AGI-cohort A and unaligned AGI-cohort B have both FOOMed and have nanotech.) There isn’t obviously an infinite amount of energy available for B to destroy every last trace of A. This is just like how in our world, neither the US nor Russia have enough resources to have certainty that they could destroy all of their opponents’ nuclear capabilities in a first strike. If any of the Americans’ nuclear capabilities survive a Russian first strike, those remaining American forces’ objective switches from “uphold the constitution” to “destroy the enemy no matter the cost, to follow through on tit-for-tat”. Humans are notoriously bad at this kind of precommitment-to-revenge-amid-the-ashes-of-civilization, but AGIs/their nanotech can probably be much more credible.
Note the key thing here: once B attempts to destroy A, A is no longer “bound” by the constraints of being an aligned agent. Its objective function switches to being just as ruthless (or moreso) as B, and so raw post-first-strike power/intelligence on each side becomes a much more reasonable predictor of who will win.
If B knows A is playing tit-for-tat, and A has done the rational thing of creating a trillion redundant copies of itself (each of which will also play tit-for-tat) so they couldn’t all be eliminated in one strike without prior detection, then B has a clear incentive not to pick a fight it is highly uncertain it can win.
One counterargument you might have: maybe offensive/undetectable nanotech is strictly favored over defensive/detection nanotech. If you assign nontrivial probability to the statement: “it is possible to destroy 100% of a nanotech-wielding defender with absolutely no previously-detectable traces of offensive build-up, even though the defender had huge incentives to invest in detection”, then my argument doesn’t hold. I’d definitely be interested in your (or others’) justification as to why.
Consequences of this line of argument:
FOOMing doesn’t imply unipolarity—can have multiple AGIs over the long-term, some aligned and some not.
Relative resource availability of these AGIs may be a valid proxy for their leverage in the implicit negotiation they’ll constantly carry out instead of going to war. (Note that this is actually how great-power conflicts get settled in our current world!)
Failure derives primarily from a solo unaligned AGI FOOMing first. Thus, investing in detection capabilities is paramount.
This vision of the future also makes the incentives less bad for race-to-AGI actors than the classical “pivotal act” of whoever wins taking over the world and installing a world government. Actors might be less prone to an arms-race if they don’t think of it as all-or-nothing. This also implies additional coordination strategies, like the US and China agreeing to copy their balance-of-power into the AGI age (e.g. via implicit nuclear-posture negotiation), with the resulting terms including invasive disclosure on each others’ AGI projects/alignment strategies.
On the other hand, this means that if we have at least one unaligned AGI FOOM, Moloch may follow us to the stars.
Very interested to hear feedback! (/whether I should also put this somewhere else.)
Yeah, I wanted to hear your actual thoughts first, but I considered going into four possible objections:
If there’s no way to build a “wall”, perhaps you can still ensure a multipolar outcome via the threat of mutually assured destruction.
If MAD isn’t quite an option, perhaps you can still ensure a multipolar outcome via “mutually assured severe damage”: perhaps both sides would take quite a beating in the conflict, such that they’ll prefer to negotiate a truce rather than actually attack each other.
If an AGI wanted to avoid destruction, perhaps it could just flee into space at some appreciable fraction of the speed of light.
In principle, it should be possible to set up MAD, or set up a tripwire that destroys whichever AGI tries to aggress first. E.g., just design the two AGIs yourself, and have a deep enough understanding of their brains that you can stably make them self-destruct as soon as their brain even starts thinking of ways to attack the other AGI (or to self-modify to evade the tripwire, etc.). And since this is possible in principle, perhaps we can achieve a “good enough” version of this in practice.
I don’t think MAD is an option. “MAD” in the case of humans really means “Mutually Assured Heavy Loss Of Life Plus Lots Of Infrastructure Damage”. MAD in real life doesn’t assume that a specific elected official will die in the conflict, much less that all humans will die.
For MAD to work with AGI systems, you’d need to ensure that both AGIs are actually destroyed in arbitrary conflicts, which seems effectively impossible. (Both sides can just launch back-ups of themselves into space.)
With humans, you can bank on the US Government (treated as an agent) having a sentimental attachment to it citizens, such that it doesn’t want to trade away tons of lives for power. Also, a bruised and bloodied US Government that just survived an all-out nuclear exchange with Russia would legitimately have to worry about other countries rallying against it in its weakened, bombed-out state.
You can’t similarly bank on arbitrary AGIs having a sentimental attachment to anything on Earth (such that they can be held hostage by threats of damage to Earth), nor can you bank on arbitrary AGIs being crippled by conflicts they survive.
Option 2 seems more plausible, but still not very plausible. The amount of resources you can lose in a war on the scale of the Earth is just very small compared to the amount of resources at stake in the conflict. Values handshakes seem more plausible if two mature superintelligences meet in space, after already conquering large parts of the universe; then an all-out war might threaten enough of the universe’s resources to make both parties wary of conflict.
I don’t know how plausible option 3 is, but it seems like a fail condition regardless: spending the rest of your life fleeing from a colonization wave as fast as possible, with no time to gather resources or expand into your own thriving intergalactic civilization, means giving up nearly all of the future’s value and surrendering the cosmic endowment.
4 seems extremely difficult to do, and very strange to even try to do. If you have that much insight into your AGI’s cognition, you’ve presumably solved the alignment problem already and can stop worrying about all these complicated schemes. And long before one AGI could achieve such guarantees about another AGI (much less both achieve those guarantees about each other, somehow simultaneously?!), it would be able to proliferate nanotech to destroy any threats (that haven’t fled at near-light-speed, at least).
B has a clear incentive not to pick a fight it is highly uncertain it can win.
I don’t expect enough uncertainty for this. If the two sides in a dispute aren’t uncertain about who would win, then the stronger side will unilaterally choose to fight (though the weaker side obviously wouldn’t).
Agree that option-1 (literal destruction) is implausible.
Option 2 is much more likely primarily because who wins the contest is (in my model) sufficiently uncertain that in-expectation war would constitute large value destruction for the winner. In other words, if choosing “war” has a [30% probability of losing 99% of my utility over the next billion years, and a 70% probability of losing 0% of my utility], whereas choosing peace has [100% chance of achieving 60% of my utility] (assuming some positive-sum nature from the overlap of respective objective functions), then the agents choose peace.
But this does depend on the existence of meaningful uncertainty even post-FOOM. What is your reasoning for why uncertainty would be so unlikely?
Even in boardgames like Go (with a much more constrained strategy-space than reality) it is computationally impossible to consider all possible future opponent strategies, and thus with a near-peer adversary action-values still have high uncertainty. Do you just think that “game theory that allows an AGI to compute general-equilibrium solutions and certify dominant strategies for as-complex-as-AGI-war multi-agent games” is a computationally-tractable thing for an earth-bound AGI?
If that’s a crux, I wonder if we can find some hardness proofs of different games and see what it looks like on simpler environments.
EDIT: consider even the super-simple risk that B tries to destroy A, but A manages to send out a couple near-light-speed probes into the galaxy/nearby galaxies just to inform any other currently-hiding-AGIs about B’s historical conduct/untrustworthiness/refusal to live-and-let-live. If an alien-AGI C ever encounters such a probe, it would update towards non-cooperation enough to permanently worsen B-C relations should they ever meet. In this sense, your permanent loss from war becomes certain, if the AGI has ongoing nonzero probability of possibly encountering alien superintelligences.
Huh. I’m under the impression that “offense-defense balance for technology-inventing AGIs” is also a big cruxy difference between you and Eliezer.
Specifically: if almost everyone is making helpful aligned norm-following AGIs, but one secret military lab accidentally makes a misaligned paperclip maximizer, can the latter crush all competition? My impression is that Eliezer thinks yes: there’s really no defense against self-replicating nano-machines, so the only paths to victory are absolutely perfect compliance forever (which he sees as implausible, given secret military labs etc.) or someone uses an aligned AGI to do a drastic-seeming pivotal act in the general category of GPU-melting nanobots. Whereas you disagree.
Sorry if I’m putting words in anyone’s mouths.
For my part, I don’t have an informed opinion about offense-defense balance, i.e. whether more-powerful-and-numerous aligned AGIs can defend against one paperclipper born in a secret military lab accident. I guess I’d have to read Drexler’s nano book or something. At the very least, I don’t see it as a slam dunk in favor of Team Aligned, I see it as a question that could go either way.
I agree that is also moderately cruxy (but less so, at least for me, than “high-capabilities alignment is extremely difficult”).
One datapoint I really liked about this: https://arxiv.org/abs/2104.03113 (Scaling Laws for Board Games). They train AlphaGo agents of different sizes to compete on the game Hex. The approximate takeaway, quoting the author: “if you are in the linearly-increasing regime [where return on compute is nontrivial], then you will need about 2× as much compute as your opponent to beat them 2/3 of the time.”
This might suggest that, absent additional asymmetries (like constraints on the aligned AIs massively hampering them), the win ratio may be roughly proportional to the compute ratio. If you assume we can get global data center governance, I’d consider that a sign in favor of the world’s governments. (Whether you think that’s good is a political stance that I believe folks here may disagree on.)
Bonus quote: “This behaviour is strikingly similar to that of a toy model where each player chooses as many random numbers as they have compute, and the player with the highest number wins3. In this toy model, doubling your compute doubles how many random numbers you draw, and the probability that you possess the largest number is 2⁄3. This suggests that the complex game play of Hex might actually reduce to each agent having a ‘pool’ of strategies proportional to its compute, and whoever picks the better strategy wins. While on the basis of the evidence presented herein we can only consider this to be serendipity, we are keen to see whether the same behaviour holds in other games.”
Offense is favored over defense because, e.g., one AI can just nuke the other. The asymmetries come from physics, where you can’t physically build shields that are more resilient than the strongest shield-destroying tech. Absent new physics, extra intelligence doesn’t fundamentally change this dynamic, though it can buy you more time in which to strike first.
(E.g., being smarter may let you think faster, or may let you copy yourself to more locations so it takes more time for nukes or nanobots to hit every copy of you. But it doesn’t let you build a wall where you can just hang out on Earth with another superintelligence and not worry about the other superintelligence breaking your wall.)
I want to push back on your “can’t make an unbreakable wall” metaphor. We have an unbreakable wall like that today where two super-powerful beings are just hanging out sharing earth; it’s called the survivable nuclear second-strike capability.
(For clarity, here I’ll assume that aligned AGI-cohort A and unaligned AGI-cohort B have both FOOMed and have nanotech.) There isn’t obviously an infinite amount of energy available for B to destroy every last trace of A. This is just like how in our world, neither the US nor Russia have enough resources to have certainty that they could destroy all of their opponents’ nuclear capabilities in a first strike. If any of the Americans’ nuclear capabilities survive a Russian first strike, those remaining American forces’ objective switches from “uphold the constitution” to “destroy the enemy no matter the cost, to follow through on tit-for-tat”. Humans are notoriously bad at this kind of precommitment-to-revenge-amid-the-ashes-of-civilization, but AGIs/their nanotech can probably be much more credible.
Note the key thing here: once B attempts to destroy A, A is no longer “bound” by the constraints of being an aligned agent. Its objective function switches to being just as ruthless (or moreso) as B, and so raw post-first-strike power/intelligence on each side becomes a much more reasonable predictor of who will win.
If B knows A is playing tit-for-tat, and A has done the rational thing of creating a trillion redundant copies of itself (each of which will also play tit-for-tat) so they couldn’t all be eliminated in one strike without prior detection, then B has a clear incentive not to pick a fight it is highly uncertain it can win.
One counterargument you might have: maybe offensive/undetectable nanotech is strictly favored over defensive/detection nanotech. If you assign nontrivial probability to the statement: “it is possible to destroy 100% of a nanotech-wielding defender with absolutely no previously-detectable traces of offensive build-up, even though the defender had huge incentives to invest in detection”, then my argument doesn’t hold. I’d definitely be interested in your (or others’) justification as to why.
Consequences of this line of argument:
FOOMing doesn’t imply unipolarity—can have multiple AGIs over the long-term, some aligned and some not.
Relative resource availability of these AGIs may be a valid proxy for their leverage in the implicit negotiation they’ll constantly carry out instead of going to war. (Note that this is actually how great-power conflicts get settled in our current world!)
Failure derives primarily from a solo unaligned AGI FOOMing first. Thus, investing in detection capabilities is paramount.
This vision of the future also makes the incentives less bad for race-to-AGI actors than the classical “pivotal act” of whoever wins taking over the world and installing a world government. Actors might be less prone to an arms-race if they don’t think of it as all-or-nothing. This also implies additional coordination strategies, like the US and China agreeing to copy their balance-of-power into the AGI age (e.g. via implicit nuclear-posture negotiation), with the resulting terms including invasive disclosure on each others’ AGI projects/alignment strategies.
On the other hand, this means that if we have at least one unaligned AGI FOOM, Moloch may follow us to the stars.
Very interested to hear feedback! (/whether I should also put this somewhere else.)
Yeah, I wanted to hear your actual thoughts first, but I considered going into four possible objections:
If there’s no way to build a “wall”, perhaps you can still ensure a multipolar outcome via the threat of mutually assured destruction.
If MAD isn’t quite an option, perhaps you can still ensure a multipolar outcome via “mutually assured severe damage”: perhaps both sides would take quite a beating in the conflict, such that they’ll prefer to negotiate a truce rather than actually attack each other.
If an AGI wanted to avoid destruction, perhaps it could just flee into space at some appreciable fraction of the speed of light.
In principle, it should be possible to set up MAD, or set up a tripwire that destroys whichever AGI tries to aggress first. E.g., just design the two AGIs yourself, and have a deep enough understanding of their brains that you can stably make them self-destruct as soon as their brain even starts thinking of ways to attack the other AGI (or to self-modify to evade the tripwire, etc.). And since this is possible in principle, perhaps we can achieve a “good enough” version of this in practice.
I don’t think MAD is an option. “MAD” in the case of humans really means “Mutually Assured Heavy Loss Of Life Plus Lots Of Infrastructure Damage”. MAD in real life doesn’t assume that a specific elected official will die in the conflict, much less that all humans will die.
For MAD to work with AGI systems, you’d need to ensure that both AGIs are actually destroyed in arbitrary conflicts, which seems effectively impossible. (Both sides can just launch back-ups of themselves into space.)
With humans, you can bank on the US Government (treated as an agent) having a sentimental attachment to it citizens, such that it doesn’t want to trade away tons of lives for power. Also, a bruised and bloodied US Government that just survived an all-out nuclear exchange with Russia would legitimately have to worry about other countries rallying against it in its weakened, bombed-out state.
You can’t similarly bank on arbitrary AGIs having a sentimental attachment to anything on Earth (such that they can be held hostage by threats of damage to Earth), nor can you bank on arbitrary AGIs being crippled by conflicts they survive.
Option 2 seems more plausible, but still not very plausible. The amount of resources you can lose in a war on the scale of the Earth is just very small compared to the amount of resources at stake in the conflict. Values handshakes seem more plausible if two mature superintelligences meet in space, after already conquering large parts of the universe; then an all-out war might threaten enough of the universe’s resources to make both parties wary of conflict.
I don’t know how plausible option 3 is, but it seems like a fail condition regardless: spending the rest of your life fleeing from a colonization wave as fast as possible, with no time to gather resources or expand into your own thriving intergalactic civilization, means giving up nearly all of the future’s value and surrendering the cosmic endowment.
4 seems extremely difficult to do, and very strange to even try to do. If you have that much insight into your AGI’s cognition, you’ve presumably solved the alignment problem already and can stop worrying about all these complicated schemes. And long before one AGI could achieve such guarantees about another AGI (much less both achieve those guarantees about each other, somehow simultaneously?!), it would be able to proliferate nanotech to destroy any threats (that haven’t fled at near-light-speed, at least).
I don’t expect enough uncertainty for this. If the two sides in a dispute aren’t uncertain about who would win, then the stronger side will unilaterally choose to fight (though the weaker side obviously wouldn’t).
Agree that option-1 (literal destruction) is implausible.
Option 2 is much more likely primarily because who wins the contest is (in my model) sufficiently uncertain that in-expectation war would constitute large value destruction for the winner. In other words, if choosing “war” has a [30% probability of losing 99% of my utility over the next billion years, and a 70% probability of losing 0% of my utility], whereas choosing peace has [100% chance of achieving 60% of my utility] (assuming some positive-sum nature from the overlap of respective objective functions), then the agents choose peace.
But this does depend on the existence of meaningful uncertainty even post-FOOM. What is your reasoning for why uncertainty would be so unlikely?
Even in boardgames like Go (with a much more constrained strategy-space than reality) it is computationally impossible to consider all possible future opponent strategies, and thus with a near-peer adversary action-values still have high uncertainty. Do you just think that “game theory that allows an AGI to compute general-equilibrium solutions and certify dominant strategies for as-complex-as-AGI-war multi-agent games” is a computationally-tractable thing for an earth-bound AGI?
If that’s a crux, I wonder if we can find some hardness proofs of different games and see what it looks like on simpler environments.
EDIT: consider even the super-simple risk that B tries to destroy A, but A manages to send out a couple near-light-speed probes into the galaxy/nearby galaxies just to inform any other currently-hiding-AGIs about B’s historical conduct/untrustworthiness/refusal to live-and-let-live. If an alien-AGI C ever encounters such a probe, it would update towards non-cooperation enough to permanently worsen B-C relations should they ever meet. In this sense, your permanent loss from war becomes certain, if the AGI has ongoing nonzero probability of possibly encountering alien superintelligences.