Rohin Shah comments on Reinforcement Learning in the Iterated Amplification Framework

Rohin Shah 13 Feb 2019 3:05 UTC
LW: 4 AF: 2
AF
What is $p_{M} (X *)$ ?
It’s the probability that the model M that we’re training assigns to the best answer X*. (M is outputting a probability distribution over D.)
The next one is the standard REINFORCE method for doing RL with a reward signal that you cannot differentiate through (i.e. basically all RL). If you apply that equation to many different possible Xs, you’re increasing the probability that M assigns to high-reward answers, and decreasing the probability that it assigns to low-reward answers.