Cool paper, great to see the project worked out! (:
One question: How do you know the contractors weren’t just answering randomly (or were confused about the task) in your “quality after filtering” experiments (Table 4)? Is there agreement across contractors about the quality of completions (in case they saw the same completions)?
Good question. Surge ran some internal auditing processes for all our quality data collection. We also checked 100 random comparisons ourselves for an earlier round of data and they seemed reasonable: we only disagreed with 5 of them, and 4 of those were a disagreement between “equally good” / “equally bad” and an answer one way or the other. (There were another 10 that seemed a bit borderline.) We don’t have interrater reliability numbers here, though—that would be useful.
Would you be interested in running a wider array of few-shot performance benchmarks? No performance loss on few shot generation is a bold but defensible claim, and it would be great to have stronger evidence for it. I’d be really interested in doing the legwork here if you’d find it useful.
Fantastic paper, makes a strong case that rejection sampling should be part of the standard toolkit for deploying pretrained LMs on downstream tasks.
Given that our classifier has only been trained on 3 sentence prompts + 1 sentence completions, do you think it can be applied to normal benchmarks? It may well transfer okay to other formats.
For sure, benchmarking can still be useful even if the classifier is less powerful than it could be. My main question is: How well does the generator model perform after rejection sampling? You could imagine that the rejection sampling process is destructive for output quality. But the initial results from your paper indicate the opposite—rejection sampling does not reduce human preference for generated outputs, so we would hope that it does not reduce benchmark performance either.
For example, MultiRC is a popular reading comprehension benchmark where the generator model takes a prompt consisting of a text passage, a question about the passage, and a list of possible answers. The generator then labels each answer as true or false, and is graded on its accuracy. Evaluation on MultiRC would show us how the classifier affects the generator’s QA skills.
GPT-Neo has already been evaluated on a wide suite of benchmarks. To prove that rejection sampling is performance competitive with unfiltered generation, you would not need to achieve SOTA performance on these benchmarks—you’d simply need to be competitive with unfiltered GPT-Neo.
My worry is that it’s not clear what exactly we would learn. We might get performance degradation just because the classifier has been trained on such a different distribution (including a different generator). Or it might be totally fine because almost none of those tasks involve frequent mention of injury, making it trivial. Either way IMO it would be unclear how the result would transfer to a more coherent setting.
(ETA: That said, I think it would be cool to train a new classifier on a more general language modeling task, possibly with a different notion of catastrophe, and then benchmark that!)
Coming back to this: Your concern makes sense to me. Your proposal to train a new classifier for filtered generation to improve performance on other tasks seems very interesting. I think it might also be useful to simply provide a nice open-source implementation of rejection sampling in a popular generator repo like Facebook’s OPT-175B, so that future researchers can build on it.
I’m planning on working on technical AI safety full-time this summer. Right now I’m busy applying to a few different programs, but I’ll definitely follow up on this idea with you.
Cool paper, great to see the project worked out! (:
One question: How do you know the contractors weren’t just answering randomly (or were confused about the task) in your “quality after filtering” experiments (Table 4)? Is there agreement across contractors about the quality of completions (in case they saw the same completions)?
Thanks! :)
Good question. Surge ran some internal auditing processes for all our quality data collection. We also checked 100 random comparisons ourselves for an earlier round of data and they seemed reasonable: we only disagreed with 5 of them, and 4 of those were a disagreement between “equally good” / “equally bad” and an answer one way or the other. (There were another 10 that seemed a bit borderline.) We don’t have interrater reliability numbers here, though—that would be useful.
Would you be interested in running a wider array of few-shot performance benchmarks? No performance loss on few shot generation is a bold but defensible claim, and it would be great to have stronger evidence for it. I’d be really interested in doing the legwork here if you’d find it useful.
Fantastic paper, makes a strong case that rejection sampling should be part of the standard toolkit for deploying pretrained LMs on downstream tasks.
Given that our classifier has only been trained on 3 sentence prompts + 1 sentence completions, do you think it can be applied to normal benchmarks? It may well transfer okay to other formats.
For sure, benchmarking can still be useful even if the classifier is less powerful than it could be. My main question is: How well does the generator model perform after rejection sampling? You could imagine that the rejection sampling process is destructive for output quality. But the initial results from your paper indicate the opposite—rejection sampling does not reduce human preference for generated outputs, so we would hope that it does not reduce benchmark performance either.
For example, MultiRC is a popular reading comprehension benchmark where the generator model takes a prompt consisting of a text passage, a question about the passage, and a list of possible answers. The generator then labels each answer as true or false, and is graded on its accuracy. Evaluation on MultiRC would show us how the classifier affects the generator’s QA skills.
GPT-Neo has already been evaluated on a wide suite of benchmarks. To prove that rejection sampling is performance competitive with unfiltered generation, you would not need to achieve SOTA performance on these benchmarks—you’d simply need to be competitive with unfiltered GPT-Neo.
My worry is that it’s not clear what exactly we would learn. We might get performance degradation just because the classifier has been trained on such a different distribution (including a different generator). Or it might be totally fine because almost none of those tasks involve frequent mention of injury, making it trivial. Either way IMO it would be unclear how the result would transfer to a more coherent setting.
(ETA: That said, I think it would be cool to train a new classifier on a more general language modeling task, possibly with a different notion of catastrophe, and then benchmark that!)
Coming back to this: Your concern makes sense to me. Your proposal to train a new classifier for filtered generation to improve performance on other tasks seems very interesting. I think it might also be useful to simply provide a nice open-source implementation of rejection sampling in a popular generator repo like Facebook’s OPT-175B, so that future researchers can build on it.
I’m planning on working on technical AI safety full-time this summer. Right now I’m busy applying to a few different programs, but I’ll definitely follow up on this idea with you.