Nice—thanks for this comment—how would the argument be summarised as a nice heading to go on this list? Maybe “Capabilities can be optimised using feedback but alignment cannot” (and feedback is cheap, and optimisation eventually produces generality)?
Maybe “Humans iteratively designing useful systems and fixing problems provide a robust feedback signal for capabilities, but not for alignment”?
(Also, I now realize that I left this out of the original comment because I assumed it was obvious, but to be explicit: basically any feedback signal on a reasonably-complex/difficult task will select for capabilities. That’s just instrumental convergence.)
Nice—thanks for this comment—how would the argument be summarised as a nice heading to go on this list? Maybe “Capabilities can be optimised using feedback but alignment cannot” (and feedback is cheap, and optimisation eventually produces generality)?
Maybe “Humans iteratively designing useful systems and fixing problems provide a robust feedback signal for capabilities, but not for alignment”?
(Also, I now realize that I left this out of the original comment because I assumed it was obvious, but to be explicit: basically any feedback signal on a reasonably-complex/difficult task will select for capabilities. That’s just instrumental convergence.)