I think the issue presented in the post is that the Solomonoff hypothesis cannot be sampled from, even though we can determine the probability density function computationally. If we were to compute the expected value of the reward based on our action, we run into the curse of dimensionality: there is a single point contributing most of the reward. A Solomonoff inductor would correctly find the probability density function that h(s_2)=s_1 with high probability.
However, I think that if we ask the Solomonoff predictor to predict the reward directly, then it will correctly arrive at a model that predicts the rewards. So we can fix the presented agent.
Yes, I think part of the issue is the gap between being able to sample s2 given s1 and evaluating the density of s2 given s1. However, I am not sure how we should score hypotheses that assign density to future observations (instead of predicting bits one at a a time). We will have difficulty computing P(observations so far|hypothesis).
Predicting the rewards directly seems to fix this issue. I don’t know if this solution can be generalized to environments other than betting environments.
I think the issue presented in the post is that the Solomonoff hypothesis cannot be sampled from, even though we can determine the probability density function computationally. If we were to compute the expected value of the reward based on our action, we run into the curse of dimensionality: there is a single point contributing most of the reward. A Solomonoff inductor would correctly find the probability density function that h(s_2)=s_1 with high probability.
However, I think that if we ask the Solomonoff predictor to predict the reward directly, then it will correctly arrive at a model that predicts the rewards. So we can fix the presented agent.
Yes, I think part of the issue is the gap between being able to sample s2 given s1 and evaluating the density of s2 given s1. However, I am not sure how we should score hypotheses that assign density to future observations (instead of predicting bits one at a a time). We will have difficulty computing P(observations so far|hypothesis).
Predicting the rewards directly seems to fix this issue. I don’t know if this solution can be generalized to environments other than betting environments.