I don’t quite have the exact specification of what I have in mind yet. Fortunately, this seems like a problem that I could try to address in a toy model with current techniques, so I can think about this a bit more and try to come up with a concrete system which would work.
I think that it should be possible to construct a reinforcement learning which can make use of side information. One proposal would be something like: construct A so that it internally estimates (with good uncertainty models) F(a,r) and B(a), and uses those estimates to predict and maximize E[B′(a)] (perhaps using something as simple as a monte-carlo simulation that can run for a large number of draws). Then allow A to, as an action or a prelude to acting, ask questions about the value of F(a,r) for specific (a,r) pairs, or for estimates of F(a) (with some model that estimates of F(a) might be incorrect). If A performs correct value of information calculations, it should value asking these questions in the training process and correctly learn values, even if it never experiences a situation where it is caught and punished.
I don’t quite have the exact specification of what I have in mind yet. Fortunately, this seems like a problem that I could try to address in a toy model with current techniques, so I can think about this a bit more and try to come up with a concrete system which would work.
I think that it should be possible to construct a reinforcement learning which can make use of side information. One proposal would be something like: construct A so that it internally estimates (with good uncertainty models) F(a,r) and B(a), and uses those estimates to predict and maximize E[B′(a)] (perhaps using something as simple as a monte-carlo simulation that can run for a large number of draws). Then allow A to, as an action or a prelude to acting, ask questions about the value of F(a,r) for specific (a,r) pairs, or for estimates of F(a) (with some model that estimates of F(a) might be incorrect). If A performs correct value of information calculations, it should value asking these questions in the training process and correctly learn values, even if it never experiences a situation where it is caught and punished.