the core counterargument I’d make is that it’s not hard to find situations where reward ends up imprecisely specifying the optimization target, and the divergence remaining when training is completed causes severe loss of performance. it’s ultimately a capability concern, I agree there; but I think a good counterargument for it is https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924 because it contains demonstrations that can be analyzed to understand scenarios where the failure has been constructed to reliably occur. we can now ask questions about when a training scenario might have this same sort of catastrophic, compounding misalignment due to a perceptual correlate that appears causal of reward turning out to be nothing of the kind.
the core counterargument I’d make is that it’s not hard to find situations where reward ends up imprecisely specifying the optimization target, and the divergence remaining when training is completed causes severe loss of performance. it’s ultimately a capability concern, I agree there; but I think a good counterargument for it is https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924 because it contains demonstrations that can be analyzed to understand scenarios where the failure has been constructed to reliably occur. we can now ask questions about when a training scenario might have this same sort of catastrophic, compounding misalignment due to a perceptual correlate that appears causal of reward turning out to be nothing of the kind.