Pitting two models against each other in a zero-sum competition only works so long as both models actually learn the desired goals. Otherwise, they may be able to reach a compromise with each other and cooperate towards a non-zero-sum objective.
If training works well, then they can’t collude on average during training, only rarely or in some sustained burst prior to training crushing these failures.
In particular, in the purely supervised case with gradient descent, performing poorly on average in durining training requires gradient hacking (or more benign failures of gradient descent, but it’s unclear why the goals of the AIs would be particularly relevant in this case).
In the RL case, it requires exploration hacking (or benign failures as in the gradient case).
The only way to prevent this is for the decision maker to assign some probability to all possible actions, regardless of how bad the predicted outcome is. This necessarily means bad outcomes will occur more frequently than they would if they could make deterministic decisions based on honest conditional predictions. We might reasonably say we don’t want to ever randomly take an action that leads to the extinction of humanity with high probability, but if this is true then a predictor can lie about that to dissuade us from any action. Even if we would be willing to take such an action with very small probability in order to get honest conditional predictions, we likely cannot commit to following through on such an action if our randomizer lands on it.
Thinking about this in terms of precommitment seems to me like it’s presupposing that the AI perfectly optimizes the training objective in some deep sense (which seems implausible to me).
The reason why this exploration procedure works is presumably that you end up selecting such actions frequently during training which in turn selects for AIs which perform well. Epsilon exploration only works if you sample the epsilon. So, it doesn’t work if you set the epsilon to 1e-40 or something.
For the first point, I agree that the SGD pushes towards closing any gaps. My concern is that at the moment, we don’t know how small the gaps need to be to get the desired behavior (and this is what we are working on modelling now). On top of that, depending on how the models are initialized, the starting gap may be quite large, so the dynamics of how gaps close throughout the training process seems important to study further.
For the second point, I think we are also in agreement. If the training process leads the AI to learning “If I predict that this action will destroy the world, the humans won’t choose it”, which then leads to dishonest predictions. However, I also find the training process converging to a mesa-optimizer for the training objective (or something sufficiently close) to be somewhat more plausible.
If training works well, then they can’t collude on average during training, only rarely or in some sustained burst prior to training crushing these failures.
In particular, in the purely supervised case with gradient descent, performing poorly on average in durining training requires gradient hacking (or more benign failures of gradient descent, but it’s unclear why the goals of the AIs would be particularly relevant in this case).
In the RL case, it requires exploration hacking (or benign failures as in the gradient case).
Thinking about this in terms of precommitment seems to me like it’s presupposing that the AI perfectly optimizes the training objective in some deep sense (which seems implausible to me). The reason why this exploration procedure works is presumably that you end up selecting such actions frequently during training which in turn selects for AIs which perform well. Epsilon exploration only works if you sample the epsilon. So, it doesn’t work if you set the epsilon to 1e-40 or something.
For the first point, I agree that the SGD pushes towards closing any gaps. My concern is that at the moment, we don’t know how small the gaps need to be to get the desired behavior (and this is what we are working on modelling now). On top of that, depending on how the models are initialized, the starting gap may be quite large, so the dynamics of how gaps close throughout the training process seems important to study further.
For the second point, I think we are also in agreement. If the training process leads the AI to learning “If I predict that this action will destroy the world, the humans won’t choose it”, which then leads to dishonest predictions. However, I also find the training process converging to a mesa-optimizer for the training objective (or something sufficiently close) to be somewhat more plausible.