I think [process-based RL] has roughly the same risk profile as imitation learning, while potentially being more competitive.
I agree with this in a sense, although I may be quite a bit a more harsh about what counts as “executing an action”. For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as “executing the action” in the overseer-conversation environment, even if the action looks like it’s for some other environment, like a plan to launch a new product in the market. I do think myopia in this environment would suffice for existential safety, but I don’t know how much myopia we need.
If you’re always talking about myopic/process-based RLAIF when you say RLAIF, then I think what you’re saying is defensible. I speculate that not everyone reading this recognizes that your usage of RLAIF implies RLAIF with a level of myopia that matches current instances of RLAIF, and that that is a load-bearing part of your position.
I say “defensible” instead of fully agreeing because I weakly disagree that increasing compute is any more of a dangerous way to improve performance than by modifying the objective to a new myopic objective. That is, I disagree with this:
I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models
You suggest that increasing compute is the last thing we should do if we’re looking for performance improvements, as opposed to adding a very myopic approval-seeking objective. I don’t see it. I think changing the objective from imitation learning is more likely to lead to problems than scaling up the imitation learners. But this is probably beside the point, because I don’t think problems are particularly likely in either case.
I agree with this in a sense, although I may be quite a bit a more harsh about what counts as “executing an action”. For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as “executing the action” in the overseer-conversation environment, even if the action looks like it’s for some other environment, like a plan to launch a new product in the market. I do think myopia in this environment would suffice for existential safety, but I don’t know how much myopia we need.
If you’re always talking about myopic/process-based RLAIF when you say RLAIF, then I think what you’re saying is defensible. I speculate that not everyone reading this recognizes that your usage of RLAIF implies RLAIF with a level of myopia that matches current instances of RLAIF, and that that is a load-bearing part of your position.
I say “defensible” instead of fully agreeing because I weakly disagree that increasing compute is any more of a dangerous way to improve performance than by modifying the objective to a new myopic objective. That is, I disagree with this:
You suggest that increasing compute is the last thing we should do if we’re looking for performance improvements, as opposed to adding a very myopic approval-seeking objective. I don’t see it. I think changing the objective from imitation learning is more likely to lead to problems than scaling up the imitation learners. But this is probably beside the point, because I don’t think problems are particularly likely in either case.