tailcalled comments on Don’t design agents which exploit adversarial inputs

tailcalled 24 Nov 2022 7:37 UTC
3 points
1
The error is that the KL divergence term doesn’t mean adding a cost proportional to the log probability of the continuation. In fact it’s not expressible at all in terms of argmaxing over a single continuation, but instead requires you to be argmaxing over a distribution of continuations.