Wei Dai comments on Reinforcement Learning in the Iterated Amplification Framework

Wei Dai 9 Feb 2019 21:20 UTC
LW: 8 AF: 4
AF
I was excited to see this post since I’m having some similar puzzles, but I’m still quite confused after reading this.

We want to find the answer X* that is the answer in D which maximizes the approval of amplified overseer, M2(“How good is answer X to Y?“).

I don’t understand why we want to find this X* in the imitation learning case. For imitation learning, don’t we want to produce a distilled model that would imitate M2, i.e., give the same answer to Y as what M2 would give? If M2, upon input Y, only does a limited search over D (let’s say because of concerns about safety) and therefore would not output the answer that maximizes M2(“How good is answer X to Y?”) in an absolute/unbounded sense, then don’t we want to reproduce that behavior for imitation learning?

$\nabla p_{M} (X *)$

What is $p_{M} (X *)$ ?

$M2("How good is answer X to Y?") * \nabla log (p_{M} (X))$

Can you explain this a bit more too? It might be apparent once I know what $p_{M} (X *)$ is, but just in case...
What links here?
- Outer alignment and imitative amplification by evhub (10 Jan 2020 0:26 UTC; 24 points)
- Rohin Shah 13 Feb 2019 3:05 UTC
  LW: 4 AF: 2
  AF Parent
  What is $p_{M} (X *)$ ?
  It’s the probability that the model M that we’re training assigns to the best answer X*. (M is outputting a probability distribution over D.)
  The next one is the standard REINFORCE method for doing RL with a reward signal that you cannot differentiate through (i.e. basically all RL). If you apply that equation to many different possible Xs, you’re increasing the probability that M assigns to high-reward answers, and decreasing the probability that it assigns to low-reward answers.
- William_S 10 Feb 2019 22:09 UTC
  3 points
  Parent
  I don’t understand why we want to find this X* in the imitation learning case.
  Ah, with this example the intent was more like “we can frame what the RL case is doing as finding X* , let’s show how we could accomplish the same thing in the imitation learning case (in the limit of unlimited compute)”.
  The reverse mapping (imitation to RL) just consists of applying reward 1 to M2′s demonstrated behaviour (which could be “execute some safe search and return the results), and reward 0 to everything else.
  What is pM(X∗)?
  $p_{M} (X *)$ is the probability of outputting $X *$ (where $p_{M}$ is a stochastic policy)
  M2(“How good is answer X to Y?”)∗∇log(pM(X))
  This is the REINFORCE gradient estimator (which tries to increase the log probability of actions that were rated highly)
- William_S 16 Feb 2020 1:21 UTC
  LW: 1 AF: 1
  AF Parent
  I’m talking about an imitation version where the human you’re imitating is allowed to do anything they want, including instatiting a search over all possible outputs X and taking that one that maximizes the score of “How good is answer X to Y?” to try to find X*. So I’m more pointing out that this behaviour is available in imitation by default. We could try to rule it out by instructing the human to only do limited searches, but that might be hard to do along with maintaining capabilities of the system, and we need to figure out what “safe limited search” actually looks like.