I don’t fully get your argument, but I’ll bite anyway. I think all of this is highly dependent on what you mean by “stronger model” and “alignment”.
I think it is easier to align a stronger model, if it’s stronger in the sense of understanding the goals or values you’re aligning it to better, and if you’re able to use its understanding to control its decisions. In that case, its models of your goals/values becomes its goals/values. In this scenario, it’s as aligned as the quality of its models, so a stronger model will be more aligned.
But that’s assuming you don’t lose control of the model before you do that alignment. I think this is a real, valid question, but it can be addressed by performing alignment early and often as the model is trained/gets smarter.
I recently wrote about this in The (partial) fallacy of dumb superintelligence. I think this is an intuition among risk-doubters about how it should be easy to align a smart AGI, and how that intuition is only partly wrong.
I don’t fully get your argument, but I’ll bite anyway. I think all of this is highly dependent on what you mean by “stronger model” and “alignment”.
I think it is easier to align a stronger model, if it’s stronger in the sense of understanding the goals or values you’re aligning it to better, and if you’re able to use its understanding to control its decisions. In that case, its models of your goals/values becomes its goals/values. In this scenario, it’s as aligned as the quality of its models, so a stronger model will be more aligned.
But that’s assuming you don’t lose control of the model before you do that alignment. I think this is a real, valid question, but it can be addressed by performing alignment early and often as the model is trained/gets smarter.
I recently wrote about this in The (partial) fallacy of dumb superintelligence. I think this is an intuition among risk-doubters about how it should be easy to align a smart AGI, and how that intuition is only partly wrong.