Yeah, I know, my main uncertainty was with how exactly that cashes out into an algorithm (in particular, RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).
The algorithm still needs reinforce and a value function baseline (since you need to e.g. output words one at a time), and “RL” seems like the normal way to talk about that algorithm/problem. We you could instead call it “contextual bandits.”
I get the need for reinforce, I’m not sure I understand the value function baseline part.
Here’s a thing you might be saying that would explain the value function baseline: this problem is equivalent to a sparse-reward RL problem, where:
The states are the question + in-progress answer
The actions are “append the word w to the answer”
All actions produce zero reward except for the action that ends the answer, which produces reward equal to the overseer’s answer to “How good is answer <answer> to question <question>?”
And we can apply RL algorithms to this problem.
Is that equivalent to what you’re saying?
You could also use an assistant who you can interact with to help evaluate rewards (rather than using assistants who answer a single question) in which case it’s generic RL.
Does “imitation learning” refer to an autoregressive model here? I think of IRL+RL a possible mechanism for imitation learning, and it’s normally the kind of algorithm I have in mind when talking about “imitation learning” (or the GAN objective, or an EBM, all of which seem roughly equivalent, or maybe some bi-GAN/VAE thing). (Though I also expect to use an autoregressive model as an initialization in any case.)
Yeah, that was bad wording on my part. I was using “imitation learning” to refer both to the problem of imitating the behavior of an agent, as well as the particular mechanism of behavioral cloning, i.e. collecting a dataset of many question-answer pairs and performing gradient descent using e.g. cross-entropy loss.
I agree that IRL + RL is a possible mechanism for imitation learning, in the same way that behavioral cloning is a possible mechanism for imitation learning. (This is why I was pretty confident that my first option was not the right one.)
RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).
I guess I’ve used the term “reinforcement learning” to refer to a broader class of problems including both one-shot bandit problems and sequential decision making problems. In this view The feature that makes RL different from supervised learning is not that we’re trying to figure out what how to act in an MDP/POMDP, but instead that we’re trying to optimize a function that we can’t take the derivative of (in the MDP case, it’s because the environment is non-differentiable, and in the approval learning case, it’s because the overseer is non-differentiable).
Yeah, I know, my main uncertainty was with how exactly that cashes out into an algorithm (in particular, RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).
I get the need for reinforce, I’m not sure I understand the value function baseline part.
Here’s a thing you might be saying that would explain the value function baseline: this problem is equivalent to a sparse-reward RL problem, where:
The states are the question + in-progress answer
The actions are “append the word w to the answer”
All actions produce zero reward except for the action that ends the answer, which produces reward equal to the overseer’s answer to “How good is answer <answer> to question <question>?”
And we can apply RL algorithms to this problem.
Is that equivalent to what you’re saying?
Just to make sure I’m understanding correctly, this is recursive reward modeling, right?
Yeah, that was bad wording on my part. I was using “imitation learning” to refer both to the problem of imitating the behavior of an agent, as well as the particular mechanism of behavioral cloning, i.e. collecting a dataset of many question-answer pairs and performing gradient descent using e.g. cross-entropy loss.
I agree that IRL + RL is a possible mechanism for imitation learning, in the same way that behavioral cloning is a possible mechanism for imitation learning. (This is why I was pretty confident that my first option was not the right one.)
I guess I’ve used the term “reinforcement learning” to refer to a broader class of problems including both one-shot bandit problems and sequential decision making problems. In this view The feature that makes RL different from supervised learning is not that we’re trying to figure out what how to act in an MDP/POMDP, but instead that we’re trying to optimize a function that we can’t take the derivative of (in the MDP case, it’s because the environment is non-differentiable, and in the approval learning case, it’s because the overseer is non-differentiable).
Got it, thanks for clarifying.