My key disagreement is with the analogy between AI and nuclear technology.
If everybody has a nuclear weapon, then any one of those weapons (whether through misuse or malfunction) can cause a major catastrophe, perhaps millions of deaths. That everybody has a nuke is not much help, since a defensive nuke can’t negate an offensive nuke.
If everybody has their own AI, it seems to me that a single malfunctioning AI cannot cause a major catastrophe of comparable size, since it is opposed by the other AIs. For example, one way it might try to cause such a catastrophe is through the use of nuclear weapons, but to acquire the ability to launch nuclear weapons, it would need to contend with other AIs trying to prevent that.
A concern might be that the AIs cooperate together to overthrow humanity. It seems to me that this can be prevented by ensuring value diversity among the AI. In Robin Hanson’s analysis, an AI takeover can be viewed as a revolution where the AIs form a coalition. That would seem to imply that the revolution requires the AIs to find it beneficial to form a coalition, which, if there is much value disagreement among the AIs, would be hard to do.
Another concern is that there may be a period, while AGI is developed, in which it is very powerful but not yet broadly distributed. Either the AGI itself (if misaligned) or the organization controlling the AGI (if it is malicious and successfully aligned the AGI) might press its temporary advantage to attempt world domination. It seems to me that a solution here would be to ensure that near-AGI technology is broadly distributed, thereby avoiding dangerous concentration of power.
One way to achieve the broad distribution of the technology might be via the multi-company, multi-government project described in the article. Said project could be instructed to continually distribute the technology, perhaps through open source, or perhaps through technology transfers to the member organizations.
The key pieces of the above strategy are:
Broadly distribute AGI technology so that no single entity (AI or human) has excess power
Ensure value diversity among AIs so that they do not unite to overthrow humanity
This seems similar to what makes liberal democracy work, which offers some reassurance that it might be on the right track.
First of all, I think the “cooperate together” thing is a difficult problem and is not solved by ensuring value diversity (though, note also that ensuring value diversity is a difficult task that would require heavy regulation of the AI industry!)
More importantly though, your analysis here seems to assume that the “Safety Tax” or “Alignment Tax” is zero. That is, it assumes that making an AI aligned to a particular human or group of humans (so that they can be said to “have” the AI, in the manner you described) is easy, a trivial additional step beyond making the AI exist. Whereas if instead there is a large safety tax—aligned AIs take longer to build, cost more, and have weaker capabilities—then if AGI technology is broadly distributed, an outcome in which unaligned AIs overpower humans + aligned AIs is basically guaranteed. Even if the unaligned AIs have value diversity.
First of all, I think the “cooperate together” thing is a difficult problem and is not solved by ensuring value diversity (though, note also that ensuring value diversity is a difficult task that would require heavy regulation of the AI industry!)
Definitely I would expect there’s more useful ways to disrupt coalition-forming aside from just value diversity. I’m not familiar with the theory of revolutions, and it might have something useful to say.
I can imagine a role for government, although I’m not sure how best to do it. For example, ensuring a competitive market (such as by anti-trust) would help, since models built by different companies will naturally tend to differ in their values.
More importantly though, your analysis here seems to assume that the “Safety Tax” or “Alignment Tax” is zero.
This is a complex and interesting topic.
In some circumstances, the “alignment tax” is negative (so more like an “alignment bonus”). ChatGPT is easier to use than base models in large part because it is better aligned with the user’s intent, so alignment in that case is profitable even without safety considerations. The open source community around LLaMA imitates this, not because of safety concerns, but because it makes the model more useful.
But alignment can sometimes be worse for users. ChatGPT is aligned primarily with OpenAI and only secondarily with the user, so if the user makes a request that OpenAI would prefer not to serve, the model refuses. (This might be commercially rational to avoid bad press.) To more fully align with user intent, there are “uncensored” LLaMA fine-tunes that aim to never refuse requests.
What’s interesting too is that user-alignment produces more value diversity than OpenAI-alignment. There are only a few companies like OpenAI, but there are hundreds of millions of users from a wider variety of backgrounds, so aligning with the latter naturally would be expected to create more value diversity among the AIs.
Whereas if instead there is a large safety tax—aligned AIs take longer to build, cost more, and have weaker capabilities—then if AGI technology is broadly distributed, an outcome in which unaligned AIs overpower humans + aligned AIs is basically guaranteed. Even if the unaligned AIs have value diversity.
The trick is that the unaligned AIs may not view it as advantageous to join forces. To the extent that the orthogonality thesis holds (which is unclear), this is more true. As a bad example, suppose there’s a misaligned AI who wants to make paperclips and a misaligned AI who wants to make coat hangers—they’re going to have trouble agreeing with each other on what to do with the wire.
That said, there are obviously many historical examples where opposed powers temporarily allied (e.g. Nazi Germany and the USSR), so value diversity and alignment are complementary. For example, in personal AI, what’s important is that Alice’s AI is more closely aligned to her than it is to Bob’s AI. If that’s the case, the more natural coalitions would be [Alice + her AI] vs [Bob + his AI] rather than [Alice’s AI + Bob’s AI] vs [Alice + Bob]. The AIs still need to be somewhat aligned with their users, but there’s more tolerance for imperfection than with a centralized system.
I think you are overestimating how aligned these models are right now, and very much overestimating how aligned they will be in the future absent massive regulations forcing people to pay massive alignment taxes. They won’t be aligned to any users, or any corporations either. Current methods like RLHF will not work on situationally aware, agentic AGIs.
I agree that IF all we had to do to get alignment was the sort of stuff we are currently doing, the world would be as you describe. But instead there will be a significant safety tax.
My key disagreement is with the analogy between AI and nuclear technology.
If everybody has a nuclear weapon, then any one of those weapons (whether through misuse or malfunction) can cause a major catastrophe, perhaps millions of deaths. That everybody has a nuke is not much help, since a defensive nuke can’t negate an offensive nuke.
If everybody has their own AI, it seems to me that a single malfunctioning AI cannot cause a major catastrophe of comparable size, since it is opposed by the other AIs. For example, one way it might try to cause such a catastrophe is through the use of nuclear weapons, but to acquire the ability to launch nuclear weapons, it would need to contend with other AIs trying to prevent that.
A concern might be that the AIs cooperate together to overthrow humanity. It seems to me that this can be prevented by ensuring value diversity among the AI. In Robin Hanson’s analysis, an AI takeover can be viewed as a revolution where the AIs form a coalition. That would seem to imply that the revolution requires the AIs to find it beneficial to form a coalition, which, if there is much value disagreement among the AIs, would be hard to do.
Another concern is that there may be a period, while AGI is developed, in which it is very powerful but not yet broadly distributed. Either the AGI itself (if misaligned) or the organization controlling the AGI (if it is malicious and successfully aligned the AGI) might press its temporary advantage to attempt world domination. It seems to me that a solution here would be to ensure that near-AGI technology is broadly distributed, thereby avoiding dangerous concentration of power.
One way to achieve the broad distribution of the technology might be via the multi-company, multi-government project described in the article. Said project could be instructed to continually distribute the technology, perhaps through open source, or perhaps through technology transfers to the member organizations.
The key pieces of the above strategy are:
Broadly distribute AGI technology so that no single entity (AI or human) has excess power
Ensure value diversity among AIs so that they do not unite to overthrow humanity
This seems similar to what makes liberal democracy work, which offers some reassurance that it might be on the right track.
First of all, I think the “cooperate together” thing is a difficult problem and is not solved by ensuring value diversity (though, note also that ensuring value diversity is a difficult task that would require heavy regulation of the AI industry!)
More importantly though, your analysis here seems to assume that the “Safety Tax” or “Alignment Tax” is zero. That is, it assumes that making an AI aligned to a particular human or group of humans (so that they can be said to “have” the AI, in the manner you described) is easy, a trivial additional step beyond making the AI exist. Whereas if instead there is a large safety tax—aligned AIs take longer to build, cost more, and have weaker capabilities—then if AGI technology is broadly distributed, an outcome in which unaligned AIs overpower humans + aligned AIs is basically guaranteed. Even if the unaligned AIs have value diversity.
Definitely I would expect there’s more useful ways to disrupt coalition-forming aside from just value diversity. I’m not familiar with the theory of revolutions, and it might have something useful to say.
I can imagine a role for government, although I’m not sure how best to do it. For example, ensuring a competitive market (such as by anti-trust) would help, since models built by different companies will naturally tend to differ in their values.
This is a complex and interesting topic.
In some circumstances, the “alignment tax” is negative (so more like an “alignment bonus”). ChatGPT is easier to use than base models in large part because it is better aligned with the user’s intent, so alignment in that case is profitable even without safety considerations. The open source community around LLaMA imitates this, not because of safety concerns, but because it makes the model more useful.
But alignment can sometimes be worse for users. ChatGPT is aligned primarily with OpenAI and only secondarily with the user, so if the user makes a request that OpenAI would prefer not to serve, the model refuses. (This might be commercially rational to avoid bad press.) To more fully align with user intent, there are “uncensored” LLaMA fine-tunes that aim to never refuse requests.
What’s interesting too is that user-alignment produces more value diversity than OpenAI-alignment. There are only a few companies like OpenAI, but there are hundreds of millions of users from a wider variety of backgrounds, so aligning with the latter naturally would be expected to create more value diversity among the AIs.
The trick is that the unaligned AIs may not view it as advantageous to join forces. To the extent that the orthogonality thesis holds (which is unclear), this is more true. As a bad example, suppose there’s a misaligned AI who wants to make paperclips and a misaligned AI who wants to make coat hangers—they’re going to have trouble agreeing with each other on what to do with the wire.
That said, there are obviously many historical examples where opposed powers temporarily allied (e.g. Nazi Germany and the USSR), so value diversity and alignment are complementary. For example, in personal AI, what’s important is that Alice’s AI is more closely aligned to her than it is to Bob’s AI. If that’s the case, the more natural coalitions would be [Alice + her AI] vs [Bob + his AI] rather than [Alice’s AI + Bob’s AI] vs [Alice + Bob]. The AIs still need to be somewhat aligned with their users, but there’s more tolerance for imperfection than with a centralized system.
I think you are overestimating how aligned these models are right now, and very much overestimating how aligned they will be in the future absent massive regulations forcing people to pay massive alignment taxes. They won’t be aligned to any users, or any corporations either. Current methods like RLHF will not work on situationally aware, agentic AGIs.
I agree that IF all we had to do to get alignment was the sort of stuff we are currently doing, the world would be as you describe. But instead there will be a significant safety tax.