I am also not sure exactly what it means to use RL in iterated amplification.
You can use RL for the distillation step. (I usually mention RL as the intended distillation procedure when I describe the scheme, except perhaps in the AGZ analogy post.)
So then I don’t really see why you want RL, which typically is solving a hard credit assignment problem that doesn’t arise in the one-step setting.
The algorithm still needs reinforce and a value function baseline (since you need to e.g. output words one at a time), and “RL” seems like the normal way to talk about that algorithm/problem. We you could instead call it “contextual bandits.”
You could also use an assistant who you can interact with to help evaluate rewards (rather than using assistants who answer a single question) in which case it’s generic RL.
Using a combination of IRL + RL to achieve the same effect as imitation learning.
Does “imitation learning” refer to an autoregressive model here? I think of IRL+RL a possible mechanism for imitation learning, and it’s normally the kind of algorithm I have in mind when talking about “imitation learning” (or the GAN objective, or an EBM, all of which seem roughly equivalent, or maybe some bi-GAN/VAE thing). (Though I also expect to use an autoregressive model as an initialization in any case.)
Yeah, I know, my main uncertainty was with how exactly that cashes out into an algorithm (in particular, RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).
The algorithm still needs reinforce and a value function baseline (since you need to e.g. output words one at a time), and “RL” seems like the normal way to talk about that algorithm/problem. We you could instead call it “contextual bandits.”
I get the need for reinforce, I’m not sure I understand the value function baseline part.
Here’s a thing you might be saying that would explain the value function baseline: this problem is equivalent to a sparse-reward RL problem, where:
The states are the question + in-progress answer
The actions are “append the word w to the answer”
All actions produce zero reward except for the action that ends the answer, which produces reward equal to the overseer’s answer to “How good is answer <answer> to question <question>?”
And we can apply RL algorithms to this problem.
Is that equivalent to what you’re saying?
You could also use an assistant who you can interact with to help evaluate rewards (rather than using assistants who answer a single question) in which case it’s generic RL.
Does “imitation learning” refer to an autoregressive model here? I think of IRL+RL a possible mechanism for imitation learning, and it’s normally the kind of algorithm I have in mind when talking about “imitation learning” (or the GAN objective, or an EBM, all of which seem roughly equivalent, or maybe some bi-GAN/VAE thing). (Though I also expect to use an autoregressive model as an initialization in any case.)
Yeah, that was bad wording on my part. I was using “imitation learning” to refer both to the problem of imitating the behavior of an agent, as well as the particular mechanism of behavioral cloning, i.e. collecting a dataset of many question-answer pairs and performing gradient descent using e.g. cross-entropy loss.
I agree that IRL + RL is a possible mechanism for imitation learning, in the same way that behavioral cloning is a possible mechanism for imitation learning. (This is why I was pretty confident that my first option was not the right one.)
RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).
I guess I’ve used the term “reinforcement learning” to refer to a broader class of problems including both one-shot bandit problems and sequential decision making problems. In this view The feature that makes RL different from supervised learning is not that we’re trying to figure out what how to act in an MDP/POMDP, but instead that we’re trying to optimize a function that we can’t take the derivative of (in the MDP case, it’s because the environment is non-differentiable, and in the approval learning case, it’s because the overseer is non-differentiable).
You can use RL for the distillation step. (I usually mention RL as the intended distillation procedure when I describe the scheme, except perhaps in the AGZ analogy post.)
The algorithm still needs reinforce and a value function baseline (since you need to e.g. output words one at a time), and “RL” seems like the normal way to talk about that algorithm/problem. We you could instead call it “contextual bandits.”
You could also use an assistant who you can interact with to help evaluate rewards (rather than using assistants who answer a single question) in which case it’s generic RL.
Does “imitation learning” refer to an autoregressive model here? I think of IRL+RL a possible mechanism for imitation learning, and it’s normally the kind of algorithm I have in mind when talking about “imitation learning” (or the GAN objective, or an EBM, all of which seem roughly equivalent, or maybe some bi-GAN/VAE thing). (Though I also expect to use an autoregressive model as an initialization in any case.)
Yeah, I know, my main uncertainty was with how exactly that cashes out into an algorithm (in particular, RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).
I get the need for reinforce, I’m not sure I understand the value function baseline part.
Here’s a thing you might be saying that would explain the value function baseline: this problem is equivalent to a sparse-reward RL problem, where:
The states are the question + in-progress answer
The actions are “append the word w to the answer”
All actions produce zero reward except for the action that ends the answer, which produces reward equal to the overseer’s answer to “How good is answer <answer> to question <question>?”
And we can apply RL algorithms to this problem.
Is that equivalent to what you’re saying?
Just to make sure I’m understanding correctly, this is recursive reward modeling, right?
Yeah, that was bad wording on my part. I was using “imitation learning” to refer both to the problem of imitating the behavior of an agent, as well as the particular mechanism of behavioral cloning, i.e. collecting a dataset of many question-answer pairs and performing gradient descent using e.g. cross-entropy loss.
I agree that IRL + RL is a possible mechanism for imitation learning, in the same way that behavioral cloning is a possible mechanism for imitation learning. (This is why I was pretty confident that my first option was not the right one.)
I guess I’ve used the term “reinforcement learning” to refer to a broader class of problems including both one-shot bandit problems and sequential decision making problems. In this view The feature that makes RL different from supervised learning is not that we’re trying to figure out what how to act in an MDP/POMDP, but instead that we’re trying to optimize a function that we can’t take the derivative of (in the MDP case, it’s because the environment is non-differentiable, and in the approval learning case, it’s because the overseer is non-differentiable).
Got it, thanks for clarifying.
I’m seeing a one-hour old empty comment, I assume it got accidentally deleted somehow?
ETA: Nvm, I can see it on LessWrong, but not on the Alignment Forum.