Trying to think about this from more of a practical machine learning standpoint, but without full understanding of all the points you made...
I think in the case where X, Y, and S are all partially but not fully known, and you have to choose model M so as to minimize regret over time. Two things occur to me as possibly useful strategies for choosing M. You might find the opportunity to run ‘experiments’, to choose suboptimal output R at an early timestep such that from then on you’d have a better understanding of S, and be able to better refine M for future timesteps.
The other thing that occurs to me is that if the distributions of X and Y are unknown and you have some approximation of them, then perfectly fitting M to those approximations would potentially be overfitting. If you know you have only approximations, then you’d want to not fully minimize M but instead leave some ‘wiggle room’ for probable future cases of X and Y. You could even tailor this based on patterns you’ve seen in X and Y. Like, if X was gaussian noise, but the mean seemed to be roughly following a sine wave for n cycles, then a square wave for n cycles, then a triangular wave for n cycles, you’d be reasonably justified in guessing that there would be a new repeating wave shape coming up which would repeat for n cycles. You’d want to leave room in the model to account for this.
Perhaps someone who understands the post more completely can let me know if I seem to be on the right track for understanding and extrapolating from this.
Trying to think about this from more of a practical machine learning standpoint, but without full understanding of all the points you made...
I think in the case where X, Y, and S are all partially but not fully known, and you have to choose model M so as to minimize regret over time. Two things occur to me as possibly useful strategies for choosing M. You might find the opportunity to run ‘experiments’, to choose suboptimal output R at an early timestep such that from then on you’d have a better understanding of S, and be able to better refine M for future timesteps.
The other thing that occurs to me is that if the distributions of X and Y are unknown and you have some approximation of them, then perfectly fitting M to those approximations would potentially be overfitting. If you know you have only approximations, then you’d want to not fully minimize M but instead leave some ‘wiggle room’ for probable future cases of X and Y. You could even tailor this based on patterns you’ve seen in X and Y. Like, if X was gaussian noise, but the mean seemed to be roughly following a sine wave for n cycles, then a square wave for n cycles, then a triangular wave for n cycles, you’d be reasonably justified in guessing that there would be a new repeating wave shape coming up which would repeat for n cycles. You’d want to leave room in the model to account for this.
Perhaps someone who understands the post more completely can let me know if I seem to be on the right track for understanding and extrapolating from this.