the gears to ascension comments on Reward IS the Optimization Target

the gears to ascension 11 Oct 2022 7:34 UTC
1 point
0
the core counterargument I’d make is that it’s not hard to find situations where reward ends up imprecisely specifying the optimization target, and the divergence remaining when training is completed causes severe loss of performance. it’s ultimately a capability concern, I agree there; but I think a good counterargument for it is https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924 because it contains demonstrations that can be analyzed to understand scenarios where the failure has been constructed to reliably occur. we can now ask questions about when a training scenario might have this same sort of catastrophic, compounding misalignment due to a perceptual correlate that appears causal of reward turning out to be nothing of the kind.