For the first point, I agree that the SGD pushes towards closing any gaps. My concern is that at the moment, we don’t know how small the gaps need to be to get the desired behavior (and this is what we are working on modelling now). On top of that, depending on how the models are initialized, the starting gap may be quite large, so the dynamics of how gaps close throughout the training process seems important to study further.
For the second point, I think we are also in agreement. If the training process leads the AI to learning “If I predict that this action will destroy the world, the humans won’t choose it”, which then leads to dishonest predictions. However, I also find the training process converging to a mesa-optimizer for the training objective (or something sufficiently close) to be somewhat more plausible.
For the first point, I agree that the SGD pushes towards closing any gaps. My concern is that at the moment, we don’t know how small the gaps need to be to get the desired behavior (and this is what we are working on modelling now). On top of that, depending on how the models are initialized, the starting gap may be quite large, so the dynamics of how gaps close throughout the training process seems important to study further.
For the second point, I think we are also in agreement. If the training process leads the AI to learning “If I predict that this action will destroy the world, the humans won’t choose it”, which then leads to dishonest predictions. However, I also find the training process converging to a mesa-optimizer for the training objective (or something sufficiently close) to be somewhat more plausible.