I was excited to see this post since I’m having some similar puzzles, but I’m still quite confused after reading this.
We want to find the answer X* that is the answer in D which maximizes the approval of amplified overseer, M2(“How good is answer X to Y?“).
I don’t understand why we want to find this X* in the imitation learning case. For imitation learning, don’t we want to produce a distilled model that would imitate M2, i.e., give the same answer to Y as what M2 would give? If M2, upon input Y, only does a limited search over D (let’s say because of concerns about safety) and therefore would not output the answer that maximizes M2(“How good is answer X to Y?”) in an absolute/unbounded sense, then don’t we want to reproduce that behavior for imitation learning?
∇pM(X∗)
What is pM(X∗)?
M2("How good is answer X to Y?")∗∇log(pM(X))
Can you explain this a bit more too? It might be apparent once I know what pM(X∗) is, but just in case...
It’s the probability that the model M that we’re training assigns to the best answer X*. (M is outputting a probability distribution over D.)
The next one is the standard REINFORCE method for doing RL with a reward signal that you cannot differentiate through (i.e. basically all RL). If you apply that equation to many different possible Xs, you’re increasing the probability that M assigns to high-reward answers, and decreasing the probability that it assigns to low-reward answers.
I don’t understand why we want to find this X* in the imitation learning case.
Ah, with this example the intent was more like “we can frame what the RL case is doing as finding X* , let’s show how we could accomplish the same thing in the imitation learning case (in the limit of unlimited compute)”.
The reverse mapping (imitation to RL) just consists of applying reward 1 to M2′s demonstrated behaviour (which could be “execute some safe search and return the results), and reward 0 to everything else.
What is pM(X∗)?
pM(X∗) is the probability of outputting X∗ (where pM is a stochastic policy)
M2(“How good is answer X to Y?”)∗∇log(pM(X))
This is the REINFORCE gradient estimator (which tries to increase the log probability of actions that were rated highly)
I’m talking about an imitation version where the human you’re imitating is allowed to do anything they want, including instatiting a search over all possible outputs X and taking that one that maximizes the score of “How good is answer X to Y?” to try to find X*. So I’m more pointing out that this behaviour is available in imitation by default. We could try to rule it out by instructing the human to only do limited searches, but that might be hard to do along with maintaining capabilities of the system, and we need to figure out what “safe limited search” actually looks like.
I was excited to see this post since I’m having some similar puzzles, but I’m still quite confused after reading this.
I don’t understand why we want to find this X* in the imitation learning case. For imitation learning, don’t we want to produce a distilled model that would imitate M2, i.e., give the same answer to Y as what M2 would give? If M2, upon input Y, only does a limited search over D (let’s say because of concerns about safety) and therefore would not output the answer that maximizes M2(“How good is answer X to Y?”) in an absolute/unbounded sense, then don’t we want to reproduce that behavior for imitation learning?
What is pM(X∗)?
Can you explain this a bit more too? It might be apparent once I know what pM(X∗) is, but just in case...
It’s the probability that the model M that we’re training assigns to the best answer X*. (M is outputting a probability distribution over D.)
The next one is the standard REINFORCE method for doing RL with a reward signal that you cannot differentiate through (i.e. basically all RL). If you apply that equation to many different possible Xs, you’re increasing the probability that M assigns to high-reward answers, and decreasing the probability that it assigns to low-reward answers.
Ah, with this example the intent was more like “we can frame what the RL case is doing as finding X* , let’s show how we could accomplish the same thing in the imitation learning case (in the limit of unlimited compute)”.
The reverse mapping (imitation to RL) just consists of applying reward 1 to M2′s demonstrated behaviour (which could be “execute some safe search and return the results), and reward 0 to everything else.
pM(X∗) is the probability of outputting X∗ (where pM is a stochastic policy)
This is the REINFORCE gradient estimator (which tries to increase the log probability of actions that were rated highly)
I’m talking about an imitation version where the human you’re imitating is allowed to do anything they want, including instatiting a search over all possible outputs X and taking that one that maximizes the score of “How good is answer X to Y?” to try to find X*. So I’m more pointing out that this behaviour is available in imitation by default. We could try to rule it out by instructing the human to only do limited searches, but that might be hard to do along with maintaining capabilities of the system, and we need to figure out what “safe limited search” actually looks like.