I’m only referring to the reward constraint being satisfied for scenarios that are in the training distribution, since this maths is entirely applied to a decision taking place in training. Therefore I don’t think distributional shift applies.
Ah yep, that’s a good clarification.
I’m only referring to the reward constraint being satisfied for scenarios that are in the training distribution, since this maths is entirely applied to a decision taking place in training. Therefore I don’t think distributional shift applies.
Ah yep, that’s a good clarification.