It’s the probability that the model M that we’re training assigns to the best answer X*. (M is outputting a probability distribution over D.)
The next one is the standard REINFORCE method for doing RL with a reward signal that you cannot differentiate through (i.e. basically all RL). If you apply that equation to many different possible Xs, you’re increasing the probability that M assigns to high-reward answers, and decreasing the probability that it assigns to low-reward answers.
It’s the probability that the model M that we’re training assigns to the best answer X*. (M is outputting a probability distribution over D.)
The next one is the standard REINFORCE method for doing RL with a reward signal that you cannot differentiate through (i.e. basically all RL). If you apply that equation to many different possible Xs, you’re increasing the probability that M assigns to high-reward answers, and decreasing the probability that it assigns to low-reward answers.