If you have access to the episode rewards, you should be able to train an ensemble of NNs using bayes + MCMC, with the final reward as output and the entire episode as input. Maybe using something like this: http://people.ee.duke.edu/~lcarin/sgnht-4.pdf
This get’s a lot more difficult if you’re trying to directly learn behaviour from rewards or vise-versa because now you need to make assumptions to derive “P(behaviour | reward)” or “P(reward | behaviour)”.
From the paper
“we sample a large number of pairs of trajectory segments of length k, use each reward predictor in our ensemble to predict which segment will be preferred from each pair, and then select those trajectories for which the predictions have the highest variance across ensemble members.”
If you have access to the episode rewards, you should be able to train an ensemble of NNs using bayes + MCMC, with the final reward as output and the entire episode as input. Maybe using something like this: http://people.ee.duke.edu/~lcarin/sgnht-4.pdf
This get’s a lot more difficult if you’re trying to directly learn behaviour from rewards or vise-versa because now you need to make assumptions to derive “P(behaviour | reward)” or “P(reward | behaviour)”.
Edit: Pretty sure OAI used a reward ensemble in https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/ to generate candidate pairs for further data collection.
From the paper “we sample a large number of pairs of trajectory segments of length k, use each reward predictor in our ensemble to predict which segment will be preferred from each pair, and then select those trajectories for which the predictions have the highest variance across ensemble members.”