I don’t understand why we want to find this X* in the imitation learning case.
Ah, with this example the intent was more like “we can frame what the RL case is doing as finding X* , let’s show how we could accomplish the same thing in the imitation learning case (in the limit of unlimited compute)”.
The reverse mapping (imitation to RL) just consists of applying reward 1 to M2′s demonstrated behaviour (which could be “execute some safe search and return the results), and reward 0 to everything else.
What is pM(X∗)?
pM(X∗) is the probability of outputting X∗ (where pM is a stochastic policy)
M2(“How good is answer X to Y?”)∗∇log(pM(X))
This is the REINFORCE gradient estimator (which tries to increase the log probability of actions that were rated highly)
Ah, with this example the intent was more like “we can frame what the RL case is doing as finding X* , let’s show how we could accomplish the same thing in the imitation learning case (in the limit of unlimited compute)”.
The reverse mapping (imitation to RL) just consists of applying reward 1 to M2′s demonstrated behaviour (which could be “execute some safe search and return the results), and reward 0 to everything else.
pM(X∗) is the probability of outputting X∗ (where pM is a stochastic policy)
This is the REINFORCE gradient estimator (which tries to increase the log probability of actions that were rated highly)