So then the largest remaining worry is that it will still gain power fast and correction processes will be slow enough that its somewhat misaligned values will be set in forever. But it isn’t obvious to me that by that point it isn’t sufficiently well aligned that we would recognize its future as a wondrous utopia, just not the very best wondrous utopia that we would have imagined if we had really carefully sat down and imagined utopias for thousands of years. This again seems like an empirical question of the scale of different effects, unless there is a an argument that some effect will be totally overwhelming.
I think this argument mostly holds in the case of proxy alignment, but fails in the case of deceptive alignment. If a model is deceptively aligned, then I don’t think there is any reason we should expect it to be only “somewhat misaligned”—once a mesa-optimizer becomes deceptive, there’s no longer optimization pressure acting to keep its mesa-objective in line with the base, which means it could be totally off, not just slightly wrong. Additionally, a deceptively aligned mesa-optimizer might be able to do things like gradient hacking to significantly hinder our correction processes.
Also, I think it’s worth pointing out that deception doesn’t just happen during training: it’s also possible for a non-deceptive proxy aligned mesa-optimizer to become deceptive during deployment, which could throw a huge wrench in your correction processes story. In particular, non-myopic proxy aligned mesa-optimizers “want to be deceptive” in the sense that, if presented with the strategy of deceptive alignment, they will choose to take it (this is a form of suboptimality alignment). This could be especially concerning in the presence of an adversary in the environment (a competitor AI, for example) that is choosing its output to cause other AIs to behave deceptively.
I think this argument mostly holds in the case of proxy alignment, but fails in the case of deceptive alignment. If a model is deceptively aligned, then I don’t think there is any reason we should expect it to be only “somewhat misaligned”—once a mesa-optimizer becomes deceptive, there’s no longer optimization pressure acting to keep its mesa-objective in line with the base, which means it could be totally off, not just slightly wrong. Additionally, a deceptively aligned mesa-optimizer might be able to do things like gradient hacking to significantly hinder our correction processes.
Also, I think it’s worth pointing out that deception doesn’t just happen during training: it’s also possible for a non-deceptive proxy aligned mesa-optimizer to become deceptive during deployment, which could throw a huge wrench in your correction processes story. In particular, non-myopic proxy aligned mesa-optimizers “want to be deceptive” in the sense that, if presented with the strategy of deceptive alignment, they will choose to take it (this is a form of suboptimality alignment). This could be especially concerning in the presence of an adversary in the environment (a competitor AI, for example) that is choosing its output to cause other AIs to behave deceptively.