Rohin Shah comments on Reinforcement Learning in the Iterated Amplification Framework

Rohin Shah 13 Feb 2019 3:28 UTC
LW: 6 AF: 3
AF
I agree with Wei Dai that the schemes you’re describing do not sound like imitation learning. Both of the schemes you describe sound to me like RL-IA. The scheme that you call imitation-IA seems like a combination random search + gradients method of doing RL. There’s an exactly analogous RL algorithm for the normal RL setting—just take the algorithm you have, and replace all instances of M2(“How good is answer X to Y?”) with $r (X)$ , where $r$ is the reward function.
One way that you could do imitation-IA would be to compute $X * = M_{2} (Y)$ a bunch of times to get a dataset ${(Y_{i}, X_{i} *)}$ and train $M$ on that dataset.
I am also not sure exactly what it means to use RL in iterated amplification. There are two different possibilities I could imagine:
- Using a combination of IRL + RL to achieve the same effect as imitation learning. The hope here would be that IRL + RL provides a better inductive bias for imitation learning, helping with sample efficiency.
- Instead of asking the amplified model to compute $M (Y)$ directly, we ask it to provide a measure of approval, e.g. by asking “How good is answer X to Y?”, or by asking “Which is a better answer to Y, X1 or X2?” and learning from that signal (see optimizing with comparisons), using some arbitrary RL algorithm.
I’m quite confident that RL+IA is not meant to be the first kind. But even with the second kind, one question does arise—typically with RL we’re trying to optimize the sum of rewards across time, whereas here we actually only want to optimize the one-step reward that you get immediately (which is the point of maximizing approval and having a stronger overseer). So then I don’t really see why you want RL, which typically is solving a hard credit assignment problem that doesn’t arise in the one-step setting.
What links here?
- Outer alignment and imitative amplification by evhub (10 Jan 2020 0:26 UTC; 24 points)
- William_S's comment on Outer alignment and imitative amplification by evhub (16 Feb 2020 1:04 UTC; 6 points)
- paulfchristiano 13 Feb 2019 5:21 UTC
  LW: 8 AF: 4
  AF Parent
  I am also not sure exactly what it means to use RL in iterated amplification.
  You can use RL for the distillation step. (I usually mention RL as the intended distillation procedure when I describe the scheme, except perhaps in the AGZ analogy post.)
  So then I don’t really see why you want RL, which typically is solving a hard credit assignment problem that doesn’t arise in the one-step setting.
  The algorithm still needs reinforce and a value function baseline (since you need to e.g. output words one at a time), and “RL” seems like the normal way to talk about that algorithm/problem. We you could instead call it “contextual bandits.”
  You could also use an assistant who you can interact with to help evaluate rewards (rather than using assistants who answer a single question) in which case it’s generic RL.
  Using a combination of IRL + RL to achieve the same effect as imitation learning.
  Does “imitation learning” refer to an autoregressive model here? I think of IRL+RL a possible mechanism for imitation learning, and it’s normally the kind of algorithm I have in mind when talking about “imitation learning” (or the GAN objective, or an EBM, all of which seem roughly equivalent, or maybe some bi-GAN/VAE thing). (Though I also expect to use an autoregressive model as an initialization in any case.)
  - Rohin Shah 13 Feb 2019 7:06 UTC
    LW: 7 AF: 4
    AF Parent
    You can use RL for the distillation step.
    Yeah, I know, my main uncertainty was with how exactly that cashes out into an algorithm (in particular, RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).
    The algorithm still needs reinforce and a value function baseline (since you need to e.g. output words one at a time), and “RL” seems like the normal way to talk about that algorithm/problem. We you could instead call it “contextual bandits.”
    I get the need for reinforce, I’m not sure I understand the value function baseline part.
    Here’s a thing you might be saying that would explain the value function baseline: this problem is equivalent to a sparse-reward RL problem, where:
    The states are the question + in-progress answer
    The actions are “append the word w to the answer”
    All actions produce zero reward except for the action that ends the answer, which produces reward equal to the overseer’s answer to “How good is answer <answer> to question <question>?”
    And we can apply RL algorithms to this problem.
    Is that equivalent to what you’re saying?
    You could also use an assistant who you can interact with to help evaluate rewards (rather than using assistants who answer a single question) in which case it’s generic RL.
    Just to make sure I’m understanding correctly, this is recursive reward modeling, right?
    Does “imitation learning” refer to an autoregressive model here? I think of IRL+RL a possible mechanism for imitation learning, and it’s normally the kind of algorithm I have in mind when talking about “imitation learning” (or the GAN objective, or an EBM, all of which seem roughly equivalent, or maybe some bi-GAN/VAE thing). (Though I also expect to use an autoregressive model as an initialization in any case.)
    Yeah, that was bad wording on my part. I was using “imitation learning” to refer both to the problem of imitating the behavior of an agent, as well as the particular mechanism of behavioral cloning, i.e. collecting a dataset of many question-answer pairs and performing gradient descent using e.g. cross-entropy loss.
    I agree that IRL + RL is a possible mechanism for imitation learning, in the same way that behavioral cloning is a possible mechanism for imitation learning. (This is why I was pretty confident that my first option was not the right one.)
    - William_S 18 Feb 2019 21:09 UTC
      3 points
      Parent
      RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).
      I guess I’ve used the term “reinforcement learning” to refer to a broader class of problems including both one-shot bandit problems and sequential decision making problems. In this view The feature that makes RL different from supervised learning is not that we’re trying to figure out what how to act in an MDP/POMDP, but instead that we’re trying to optimize a function that we can’t take the derivative of (in the MDP case, it’s because the environment is non-differentiable, and in the approval learning case, it’s because the overseer is non-differentiable).
      - Rohin Shah 19 Feb 2019 1:19 UTC
        2 points
        Parent
        Got it, thanks for clarifying.
  - Rohin Shah 13 Feb 2019 6:35 UTC
    LW: 2 AF: 1
    AF Parent
    I’m seeing a one-hour old empty comment, I assume it got accidentally deleted somehow?
    ETA: Nvm, I can see it on LessWrong, but not on the Alignment Forum.