Yes, but I think we can say something a bit stronger than that: AIs competing with each other will be homogenous. Here’s my current model at least: Let’s say the competition for control of the future involves N skills: Persuasion, science, engineering, …. etc. Even if we suppose that it’s most efficient to design separate AIs for each skill, rather than a smaller number of AIs that have multiple skills each, insofar as there are factions competing for control of the future, they’ll have an AI for each of the skills. They wouldn’t want to leave one of the skills out, or how are they going to compete? So each faction will consist of a group of AIs working together, that collectively has all the relevant skills. And each of the AIs will be designed to be good at the skill it’s assigned, so (via the principle you articulated) each AI will be similar to the other-faction AIs it directly competes with, and the factions as a whole will be pretty similar too, since they’ll be collections of similar AIs. (Compare to militaries: Not only were fighter planes similar, and trucks similar, and battleships similar, the armed forces of Japan, USA, USSR, etc. were similar. By contrast with e.g. the conquistadors vs. the Aztecs, or in sci-fi the Protoss vs. the Zerg, etc.)
I think this is only right if we assume that we’ve solved alignment. Otherwise you might not be able to train a specialised AI that is loyal to your faction.
Here’s how I imagine Evan’s conclusions to fail in a very CAIS-like world:
1. Maybe we can align models that do supervised learning, but can’t align RL, so we’ll have humans+GPT-N competing against a rogue RL-agent that someone created. (And people initially trained both of these because GPT-N makes for a better chatbot, while the RL agent seemed better at making money-maximizing decisions at companies.)
2. A mesa-optimiser arising in GPT-N may be very dissimilar to a money-maximising RL-agent, but they may still end up in conflict. None of them can add an analogue to the other to their team, because they don’t know how to align it.
3. If we use lots of different methods for training lots of different specialised models, any one of them can produce a warning shot (which would ideally make us suspect all other models). Also, they won’t really understand or be able to coordinate with the other systems.
4. It’s not as important if the first advanced AI system is aligned, since there will be lots of different systems of different types. If everyone is training unaligned chatbots, you still care about aligning everyone’s personal assistants.
Thanks! I’m not sure I’m following everything you said, but I like the ideas. Just to be clear, I wasn’t imagining the AIs on the team of a faction to all be aligned necessarily. In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn’t really apply. Like AlphaFold2. Also, I think the relevant variable for homogeneity isn’t whether we’ve solved alignment—maybe it’s whether the people making AI think they’ve solved alignment. If the Chinese and US militaries think AI risk isn’t a big deal, and build AGI generals to prosecute the cyberwar, they’ll probably use similar designs, even if actually the generals are secretly planning treacherous turns.
In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn’t really apply.
Ah, yeah, for the purposes of my previous comment I count this as being aligned. If we only have tool AIs (or otherwise alignable AIs), I agree that Evan’s conclusion 2 follow (while the other ones aren’t relevant).
I think the relevant variable for homogeneity isn’t whether we’ve solved alignment—maybe it’s whether the people making AI think they’ve solved alignment
So for homogenity-of-factions, I was specifically trying to say that alignment is necessary to have multiple non-tool AIs on the same faction, because at some point, something must align them all to the faction’s goals.
However, I’m now noticing that this requirement is weaker than what we usually mean with alignment. For our purposes, we want to be able to align AIs to human values. However, for the purpose of building a faction, it’s enough if there exists an AI that can align other AIs to its values, which may be much easier.
Concretely, my best guess is that you need inner alignment, since failure of inner alignment probably produces random goals, which means that multiple inner-misaligned AIs are unlikely to share goals. However, outer alignment is much easier for easily-measurable values than for human values, so I can imagine a world where we fail outer alignment, unthinkingly create an AI that only care about something easy (e.g. maximize money) and then that AI can easily create other AIs that want to help it (with maximizing money).
Concretely, my best guess is that you need inner alignment, since failure of inner alignment probably produces random goals, which means that multiple inner-misaligned AIs are unlikely to share goals.
I disagree with this. I don’t expect a failure of inner alignment to produce random goals, but rather systematically produce goals which are simpler/faster proxies of what we actually want. That is to say, while I expect the goals to look random to us, I don’t actually expect them to differ that much between training runs, since it’s more about your training process’s inductive biases than inherent randomness in the training process in my opinion.
This is helpful, thanks. I’m not sure I agree that for something to count as a faction, the members must be aligned with each other. I think it still counts if the members have wildly different goals but are temporarily collaborating for instrumental reasons, or even if several of the members are secretly working for the other side. For example, in WW2 there were spies on both sides, as well as many people (e.g. most ordinary soldiers) who didn’t really believe in the cause and would happily defect if they could get away with it. Yet the overall structure of the opposing forces was very similar, from the fighter aircraft designs, to the battleship designs, to the relative proportions of fighter planes and battleships, to the way they were integrated into command structure.
Yes, but I think we can say something a bit stronger than that: AIs competing with each other will be homogenous. Here’s my current model at least: Let’s say the competition for control of the future involves N skills: Persuasion, science, engineering, …. etc. Even if we suppose that it’s most efficient to design separate AIs for each skill, rather than a smaller number of AIs that have multiple skills each, insofar as there are factions competing for control of the future, they’ll have an AI for each of the skills. They wouldn’t want to leave one of the skills out, or how are they going to compete? So each faction will consist of a group of AIs working together, that collectively has all the relevant skills. And each of the AIs will be designed to be good at the skill it’s assigned, so (via the principle you articulated) each AI will be similar to the other-faction AIs it directly competes with, and the factions as a whole will be pretty similar too, since they’ll be collections of similar AIs. (Compare to militaries: Not only were fighter planes similar, and trucks similar, and battleships similar, the armed forces of Japan, USA, USSR, etc. were similar. By contrast with e.g. the conquistadors vs. the Aztecs, or in sci-fi the Protoss vs. the Zerg, etc.)
I think this is only right if we assume that we’ve solved alignment. Otherwise you might not be able to train a specialised AI that is loyal to your faction.
Here’s how I imagine Evan’s conclusions to fail in a very CAIS-like world:
1. Maybe we can align models that do supervised learning, but can’t align RL, so we’ll have humans+GPT-N competing against a rogue RL-agent that someone created. (And people initially trained both of these because GPT-N makes for a better chatbot, while the RL agent seemed better at making money-maximizing decisions at companies.)
2. A mesa-optimiser arising in GPT-N may be very dissimilar to a money-maximising RL-agent, but they may still end up in conflict. None of them can add an analogue to the other to their team, because they don’t know how to align it.
3. If we use lots of different methods for training lots of different specialised models, any one of them can produce a warning shot (which would ideally make us suspect all other models). Also, they won’t really understand or be able to coordinate with the other systems.
4. It’s not as important if the first advanced AI system is aligned, since there will be lots of different systems of different types. If everyone is training unaligned chatbots, you still care about aligning everyone’s personal assistants.
Thanks! I’m not sure I’m following everything you said, but I like the ideas. Just to be clear, I wasn’t imagining the AIs on the team of a faction to all be aligned necessarily. In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn’t really apply. Like AlphaFold2. Also, I think the relevant variable for homogeneity isn’t whether we’ve solved alignment—maybe it’s whether the people making AI think they’ve solved alignment. If the Chinese and US militaries think AI risk isn’t a big deal, and build AGI generals to prosecute the cyberwar, they’ll probably use similar designs, even if actually the generals are secretly planning treacherous turns.
Ah, yeah, for the purposes of my previous comment I count this as being aligned. If we only have tool AIs (or otherwise alignable AIs), I agree that Evan’s conclusion 2 follow (while the other ones aren’t relevant).
So for homogenity-of-factions, I was specifically trying to say that alignment is necessary to have multiple non-tool AIs on the same faction, because at some point, something must align them all to the faction’s goals.
However, I’m now noticing that this requirement is weaker than what we usually mean with alignment. For our purposes, we want to be able to align AIs to human values. However, for the purpose of building a faction, it’s enough if there exists an AI that can align other AIs to its values, which may be much easier.
Concretely, my best guess is that you need inner alignment, since failure of inner alignment probably produces random goals, which means that multiple inner-misaligned AIs are unlikely to share goals. However, outer alignment is much easier for easily-measurable values than for human values, so I can imagine a world where we fail outer alignment, unthinkingly create an AI that only care about something easy (e.g. maximize money) and then that AI can easily create other AIs that want to help it (with maximizing money).
I disagree with this. I don’t expect a failure of inner alignment to produce random goals, but rather systematically produce goals which are simpler/faster proxies of what we actually want. That is to say, while I expect the goals to look random to us, I don’t actually expect them to differ that much between training runs, since it’s more about your training process’s inductive biases than inherent randomness in the training process in my opinion.
This is helpful, thanks. I’m not sure I agree that for something to count as a faction, the members must be aligned with each other. I think it still counts if the members have wildly different goals but are temporarily collaborating for instrumental reasons, or even if several of the members are secretly working for the other side. For example, in WW2 there were spies on both sides, as well as many people (e.g. most ordinary soldiers) who didn’t really believe in the cause and would happily defect if they could get away with it. Yet the overall structure of the opposing forces was very similar, from the fighter aircraft designs, to the battleship designs, to the relative proportions of fighter planes and battleships, to the way they were integrated into command structure.