sludgepuddle comments on The reward engineering problem

sludgepuddle 17 Jan 2019 21:16 UTC
1 point
What about some form of indirect supervision, where we aim to find transcripts in which H has a decision of a particular hardness? A would ideally be trained starting with things that are very very easy for H, with the hardness ramped up until A maxes out it’s abilities. Rather than imitating H, we use a generative technique to create fake transcripts, imitating both H and it’s environment. We can incorporate into our loss function the amount of time H spends on a particular decision, the reliability of that decision, and maybe some kind of complexity measure on the transcript to find easier/harder situations which are of genuine importance to H.