Maybe “Humans iteratively designing useful systems and fixing problems provide a robust feedback signal for capabilities, but not for alignment”?
(Also, I now realize that I left this out of the original comment because I assumed it was obvious, but to be explicit: basically any feedback signal on a reasonably-complex/difficult task will select for capabilities. That’s just instrumental convergence.)
Maybe “Humans iteratively designing useful systems and fixing problems provide a robust feedback signal for capabilities, but not for alignment”?
(Also, I now realize that I left this out of the original comment because I assumed it was obvious, but to be explicit: basically any feedback signal on a reasonably-complex/difficult task will select for capabilities. That’s just instrumental convergence.)