To the extent that SGD can’t find the optimum, it hurts the performance of both the aligned model and the unaligned model. In some sense what we really want is a regret bound compared to the “best learnable model,” where the argument for a regret bound is heuristic but seems valid.
I’m not sure this goes through. In particular, it could be that the architecture which you would otherwise deploy (presumably human brains, or some other kind of automated system) would do better than the “best learnable model” for some other (architecture + training data + etc) combination. Perhaps in some sense what you want is a regret bound over the aligned model which you would use if you don’t deploy your AI model, not a regret bound between your AI model and the best learnable model in its (architecture + training data + etc) space.
That said, I’m really not familiar with regret bounds, and it could be that this is a non-concern.
I’m not sure this goes through. In particular, it could be that the architecture which you would otherwise deploy (presumably human brains, or some other kind of automated system) would do better than the “best learnable model” for some other (architecture + training data + etc) combination. Perhaps in some sense what you want is a regret bound over the aligned model which you would use if you don’t deploy your AI model, not a regret bound between your AI model and the best learnable model in its (architecture + training data + etc) space.
That said, I’m really not familiar with regret bounds, and it could be that this is a non-concern.