In particular, if the sample efficiency of RL increases with large models, it might turn out that the optimal strategy for RLing early transformative models is to produce many fewer and much more expensive labels than people use when training current systems; I think people often neglect this possibility when thinking about the future of scalable oversight.
This paper found higher sample efficiency for larger reinforcement learning models (see Fig. 5 and section 5.5).
This paper found higher sample efficiency for larger reinforcement learning models (see Fig. 5 and section 5.5).
Thanks! That’s a multi-agent setup but still handy.