This paper seems like a great first stab at starting to answer very important questions about scalable oversight sample efficiency (at least for non-scheming models). Better understanding of sample effiiciency seems like it should have a big effect research prioritization for scalable oversight (and W2SG).
Despite this being a great start, I don’t feel like I yet know the answers to these questions. Future work would need to explore methods somewhat more[1], do cleaner scaling experiments by using a nicer model stack[2], and use more analogous datasets, particularly preference modeling scores on data sets of difficult agentic tasks. Also, It might eventually be a good idea to run experiments with human labelers in case there are important differences between weak human labels and weak LLM labels.
I think there are a few promising seeming methods which weren’t tried, but I don’t have a good sense of what exploration was done outside of the methods listed in the body of the paper.
This might need to be done at labs, because I don’t think there is a great open source model stack that goes all the way to pretty powerful models. I think llama-3 is the best and it only has 3 models which were trained in mostly comparable ways I think? Pythia was pretty good on standardization, but fails to go to sufficiently powerful models.
This paper seems like a great first stab at starting to answer very important questions about scalable oversight sample efficiency (at least for non-scheming models). Better understanding of sample effiiciency seems like it should have a big effect research prioritization for scalable oversight (and W2SG).
Despite this being a great start, I don’t feel like I yet know the answers to these questions. Future work would need to explore methods somewhat more[1], do cleaner scaling experiments by using a nicer model stack[2], and use more analogous datasets, particularly preference modeling scores on data sets of difficult agentic tasks. Also, It might eventually be a good idea to run experiments with human labelers in case there are important differences between weak human labels and weak LLM labels.
I think there are a few promising seeming methods which weren’t tried, but I don’t have a good sense of what exploration was done outside of the methods listed in the body of the paper.
This might need to be done at labs, because I don’t think there is a great open source model stack that goes all the way to pretty powerful models. I think llama-3 is the best and it only has 3 models which were trained in mostly comparable ways I think? Pythia was pretty good on standardization, but fails to go to sufficiently powerful models.