Yes, I think part of the issue is the gap between being able to sample s2 given s1 and evaluating the density of s2 given s1. However, I am not sure how we should score hypotheses that assign density to future observations (instead of predicting bits one at a a time). We will have difficulty computing P(observations so far|hypothesis).
Predicting the rewards directly seems to fix this issue. I don’t know if this solution can be generalized to environments other than betting environments.
Yes, I think part of the issue is the gap between being able to sample s2 given s1 and evaluating the density of s2 given s1. However, I am not sure how we should score hypotheses that assign density to future observations (instead of predicting bits one at a a time). We will have difficulty computing P(observations so far|hypothesis).
Predicting the rewards directly seems to fix this issue. I don’t know if this solution can be generalized to environments other than betting environments.