I think this is only right if we assume that we’ve solved alignment. Otherwise you might not be able to train a specialised AI that is loyal to your faction.
Here’s how I imagine Evan’s conclusions to fail in a very CAIS-like world:
1. Maybe we can align models that do supervised learning, but can’t align RL, so we’ll have humans+GPT-N competing against a rogue RL-agent that someone created. (And people initially trained both of these because GPT-N makes for a better chatbot, while the RL agent seemed better at making money-maximizing decisions at companies.)
2. A mesa-optimiser arising in GPT-N may be very dissimilar to a money-maximising RL-agent, but they may still end up in conflict. None of them can add an analogue to the other to their team, because they don’t know how to align it.
3. If we use lots of different methods for training lots of different specialised models, any one of them can produce a warning shot (which would ideally make us suspect all other models). Also, they won’t really understand or be able to coordinate with the other systems.
4. It’s not as important if the first advanced AI system is aligned, since there will be lots of different systems of different types. If everyone is training unaligned chatbots, you still care about aligning everyone’s personal assistants.
Thanks! I’m not sure I’m following everything you said, but I like the ideas. Just to be clear, I wasn’t imagining the AIs on the team of a faction to all be aligned necessarily. In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn’t really apply. Like AlphaFold2. Also, I think the relevant variable for homogeneity isn’t whether we’ve solved alignment—maybe it’s whether the people making AI think they’ve solved alignment. If the Chinese and US militaries think AI risk isn’t a big deal, and build AGI generals to prosecute the cyberwar, they’ll probably use similar designs, even if actually the generals are secretly planning treacherous turns.
In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn’t really apply.
Ah, yeah, for the purposes of my previous comment I count this as being aligned. If we only have tool AIs (or otherwise alignable AIs), I agree that Evan’s conclusion 2 follow (while the other ones aren’t relevant).
I think the relevant variable for homogeneity isn’t whether we’ve solved alignment—maybe it’s whether the people making AI think they’ve solved alignment
So for homogenity-of-factions, I was specifically trying to say that alignment is necessary to have multiple non-tool AIs on the same faction, because at some point, something must align them all to the faction’s goals.
However, I’m now noticing that this requirement is weaker than what we usually mean with alignment. For our purposes, we want to be able to align AIs to human values. However, for the purpose of building a faction, it’s enough if there exists an AI that can align other AIs to its values, which may be much easier.
Concretely, my best guess is that you need inner alignment, since failure of inner alignment probably produces random goals, which means that multiple inner-misaligned AIs are unlikely to share goals. However, outer alignment is much easier for easily-measurable values than for human values, so I can imagine a world where we fail outer alignment, unthinkingly create an AI that only care about something easy (e.g. maximize money) and then that AI can easily create other AIs that want to help it (with maximizing money).
Concretely, my best guess is that you need inner alignment, since failure of inner alignment probably produces random goals, which means that multiple inner-misaligned AIs are unlikely to share goals.
I disagree with this. I don’t expect a failure of inner alignment to produce random goals, but rather systematically produce goals which are simpler/faster proxies of what we actually want. That is to say, while I expect the goals to look random to us, I don’t actually expect them to differ that much between training runs, since it’s more about your training process’s inductive biases than inherent randomness in the training process in my opinion.
This is helpful, thanks. I’m not sure I agree that for something to count as a faction, the members must be aligned with each other. I think it still counts if the members have wildly different goals but are temporarily collaborating for instrumental reasons, or even if several of the members are secretly working for the other side. For example, in WW2 there were spies on both sides, as well as many people (e.g. most ordinary soldiers) who didn’t really believe in the cause and would happily defect if they could get away with it. Yet the overall structure of the opposing forces was very similar, from the fighter aircraft designs, to the battleship designs, to the relative proportions of fighter planes and battleships, to the way they were integrated into command structure.
I think this is only right if we assume that we’ve solved alignment. Otherwise you might not be able to train a specialised AI that is loyal to your faction.
Here’s how I imagine Evan’s conclusions to fail in a very CAIS-like world:
1. Maybe we can align models that do supervised learning, but can’t align RL, so we’ll have humans+GPT-N competing against a rogue RL-agent that someone created. (And people initially trained both of these because GPT-N makes for a better chatbot, while the RL agent seemed better at making money-maximizing decisions at companies.)
2. A mesa-optimiser arising in GPT-N may be very dissimilar to a money-maximising RL-agent, but they may still end up in conflict. None of them can add an analogue to the other to their team, because they don’t know how to align it.
3. If we use lots of different methods for training lots of different specialised models, any one of them can produce a warning shot (which would ideally make us suspect all other models). Also, they won’t really understand or be able to coordinate with the other systems.
4. It’s not as important if the first advanced AI system is aligned, since there will be lots of different systems of different types. If everyone is training unaligned chatbots, you still care about aligning everyone’s personal assistants.
Thanks! I’m not sure I’m following everything you said, but I like the ideas. Just to be clear, I wasn’t imagining the AIs on the team of a faction to all be aligned necessarily. In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn’t really apply. Like AlphaFold2. Also, I think the relevant variable for homogeneity isn’t whether we’ve solved alignment—maybe it’s whether the people making AI think they’ve solved alignment. If the Chinese and US militaries think AI risk isn’t a big deal, and build AGI generals to prosecute the cyberwar, they’ll probably use similar designs, even if actually the generals are secretly planning treacherous turns.
Ah, yeah, for the purposes of my previous comment I count this as being aligned. If we only have tool AIs (or otherwise alignable AIs), I agree that Evan’s conclusion 2 follow (while the other ones aren’t relevant).
So for homogenity-of-factions, I was specifically trying to say that alignment is necessary to have multiple non-tool AIs on the same faction, because at some point, something must align them all to the faction’s goals.
However, I’m now noticing that this requirement is weaker than what we usually mean with alignment. For our purposes, we want to be able to align AIs to human values. However, for the purpose of building a faction, it’s enough if there exists an AI that can align other AIs to its values, which may be much easier.
Concretely, my best guess is that you need inner alignment, since failure of inner alignment probably produces random goals, which means that multiple inner-misaligned AIs are unlikely to share goals. However, outer alignment is much easier for easily-measurable values than for human values, so I can imagine a world where we fail outer alignment, unthinkingly create an AI that only care about something easy (e.g. maximize money) and then that AI can easily create other AIs that want to help it (with maximizing money).
I disagree with this. I don’t expect a failure of inner alignment to produce random goals, but rather systematically produce goals which are simpler/faster proxies of what we actually want. That is to say, while I expect the goals to look random to us, I don’t actually expect them to differ that much between training runs, since it’s more about your training process’s inductive biases than inherent randomness in the training process in my opinion.
This is helpful, thanks. I’m not sure I agree that for something to count as a faction, the members must be aligned with each other. I think it still counts if the members have wildly different goals but are temporarily collaborating for instrumental reasons, or even if several of the members are secretly working for the other side. For example, in WW2 there were spies on both sides, as well as many people (e.g. most ordinary soldiers) who didn’t really believe in the cause and would happily defect if they could get away with it. Yet the overall structure of the opposing forces was very similar, from the fighter aircraft designs, to the battleship designs, to the relative proportions of fighter planes and battleships, to the way they were integrated into command structure.