William_S comments on Reinforcement Learning in the Iterated Amplification Framework

William_S 10 Feb 2019 22:09 UTC
3 points
I don’t understand why we want to find this X* in the imitation learning case.
Ah, with this example the intent was more like “we can frame what the RL case is doing as finding X* , let’s show how we could accomplish the same thing in the imitation learning case (in the limit of unlimited compute)”.
The reverse mapping (imitation to RL) just consists of applying reward 1 to M2′s demonstrated behaviour (which could be “execute some safe search and return the results), and reward 0 to everything else.
What is pM(X∗)?
$p_{M} (X *)$ is the probability of outputting $X *$ (where $p_{M}$ is a stochastic policy)
M2(“How good is answer X to Y?”)∗∇log(pM(X))
This is the REINFORCE gradient estimator (which tries to increase the log probability of actions that were rated highly)