Their empirical result rhymes with adversarial robustness issues—we can train adversaries to maximise ~arbitrary functions subject to small perturbation from ground truth constraints. Here the maximised function is a faulty reward model and the constraint is KL to a base model instead of distance to a ground truth image.
I wonder if multiscale aggregation could help here too as it does with image adversarial robustness. We want the KL penalty to ensure that the generations should look normal at any “scale”, whether we look at them token by token or read a high-level summary of them. However, I suspect their “weird, low-KL” generations will have weird high-level summaries, whereas more desired policies would look more normal in summary (though it’s not immediately obvious if this translates to low and high probability summaries respectively—one would need to test). I think a KL penalty to the “true base policy” should operate this way automatically, but as the authors note we can’t actually implement that.
Their empirical result rhymes with adversarial robustness issues—we can train adversaries to maximise ~arbitrary functions subject to small perturbation from ground truth constraints. Here the maximised function is a faulty reward model and the constraint is KL to a base model instead of distance to a ground truth image.
I wonder if multiscale aggregation could help here too as it does with image adversarial robustness. We want the KL penalty to ensure that the generations should look normal at any “scale”, whether we look at them token by token or read a high-level summary of them. However, I suspect their “weird, low-KL” generations will have weird high-level summaries, whereas more desired policies would look more normal in summary (though it’s not immediately obvious if this translates to low and high probability summaries respectively—one would need to test). I think a KL penalty to the “true base policy” should operate this way automatically, but as the authors note we can’t actually implement that.