First of all, I think the “cooperate together” thing is a difficult problem and is not solved by ensuring value diversity (though, note also that ensuring value diversity is a difficult task that would require heavy regulation of the AI industry!)
Definitely I would expect there’s more useful ways to disrupt coalition-forming aside from just value diversity. I’m not familiar with the theory of revolutions, and it might have something useful to say.
I can imagine a role for government, although I’m not sure how best to do it. For example, ensuring a competitive market (such as by anti-trust) would help, since models built by different companies will naturally tend to differ in their values.
More importantly though, your analysis here seems to assume that the “Safety Tax” or “Alignment Tax” is zero.
This is a complex and interesting topic.
In some circumstances, the “alignment tax” is negative (so more like an “alignment bonus”). ChatGPT is easier to use than base models in large part because it is better aligned with the user’s intent, so alignment in that case is profitable even without safety considerations. The open source community around LLaMA imitates this, not because of safety concerns, but because it makes the model more useful.
But alignment can sometimes be worse for users. ChatGPT is aligned primarily with OpenAI and only secondarily with the user, so if the user makes a request that OpenAI would prefer not to serve, the model refuses. (This might be commercially rational to avoid bad press.) To more fully align with user intent, there are “uncensored” LLaMA fine-tunes that aim to never refuse requests.
What’s interesting too is that user-alignment produces more value diversity than OpenAI-alignment. There are only a few companies like OpenAI, but there are hundreds of millions of users from a wider variety of backgrounds, so aligning with the latter naturally would be expected to create more value diversity among the AIs.
Whereas if instead there is a large safety tax—aligned AIs take longer to build, cost more, and have weaker capabilities—then if AGI technology is broadly distributed, an outcome in which unaligned AIs overpower humans + aligned AIs is basically guaranteed. Even if the unaligned AIs have value diversity.
The trick is that the unaligned AIs may not view it as advantageous to join forces. To the extent that the orthogonality thesis holds (which is unclear), this is more true. As a bad example, suppose there’s a misaligned AI who wants to make paperclips and a misaligned AI who wants to make coat hangers—they’re going to have trouble agreeing with each other on what to do with the wire.
That said, there are obviously many historical examples where opposed powers temporarily allied (e.g. Nazi Germany and the USSR), so value diversity and alignment are complementary. For example, in personal AI, what’s important is that Alice’s AI is more closely aligned to her than it is to Bob’s AI. If that’s the case, the more natural coalitions would be [Alice + her AI] vs [Bob + his AI] rather than [Alice’s AI + Bob’s AI] vs [Alice + Bob]. The AIs still need to be somewhat aligned with their users, but there’s more tolerance for imperfection than with a centralized system.
I think you are overestimating how aligned these models are right now, and very much overestimating how aligned they will be in the future absent massive regulations forcing people to pay massive alignment taxes. They won’t be aligned to any users, or any corporations either. Current methods like RLHF will not work on situationally aware, agentic AGIs.
I agree that IF all we had to do to get alignment was the sort of stuff we are currently doing, the world would be as you describe. But instead there will be a significant safety tax.
Definitely I would expect there’s more useful ways to disrupt coalition-forming aside from just value diversity. I’m not familiar with the theory of revolutions, and it might have something useful to say.
I can imagine a role for government, although I’m not sure how best to do it. For example, ensuring a competitive market (such as by anti-trust) would help, since models built by different companies will naturally tend to differ in their values.
This is a complex and interesting topic.
In some circumstances, the “alignment tax” is negative (so more like an “alignment bonus”). ChatGPT is easier to use than base models in large part because it is better aligned with the user’s intent, so alignment in that case is profitable even without safety considerations. The open source community around LLaMA imitates this, not because of safety concerns, but because it makes the model more useful.
But alignment can sometimes be worse for users. ChatGPT is aligned primarily with OpenAI and only secondarily with the user, so if the user makes a request that OpenAI would prefer not to serve, the model refuses. (This might be commercially rational to avoid bad press.) To more fully align with user intent, there are “uncensored” LLaMA fine-tunes that aim to never refuse requests.
What’s interesting too is that user-alignment produces more value diversity than OpenAI-alignment. There are only a few companies like OpenAI, but there are hundreds of millions of users from a wider variety of backgrounds, so aligning with the latter naturally would be expected to create more value diversity among the AIs.
The trick is that the unaligned AIs may not view it as advantageous to join forces. To the extent that the orthogonality thesis holds (which is unclear), this is more true. As a bad example, suppose there’s a misaligned AI who wants to make paperclips and a misaligned AI who wants to make coat hangers—they’re going to have trouble agreeing with each other on what to do with the wire.
That said, there are obviously many historical examples where opposed powers temporarily allied (e.g. Nazi Germany and the USSR), so value diversity and alignment are complementary. For example, in personal AI, what’s important is that Alice’s AI is more closely aligned to her than it is to Bob’s AI. If that’s the case, the more natural coalitions would be [Alice + her AI] vs [Bob + his AI] rather than [Alice’s AI + Bob’s AI] vs [Alice + Bob]. The AIs still need to be somewhat aligned with their users, but there’s more tolerance for imperfection than with a centralized system.
I think you are overestimating how aligned these models are right now, and very much overestimating how aligned they will be in the future absent massive regulations forcing people to pay massive alignment taxes. They won’t be aligned to any users, or any corporations either. Current methods like RLHF will not work on situationally aware, agentic AGIs.
I agree that IF all we had to do to get alignment was the sort of stuff we are currently doing, the world would be as you describe. But instead there will be a significant safety tax.