aogara comments on High-stakes alignment via adversarial training [Redwood Research report]

aogara 8 May 2022 1:00 UTC
1 point
For sure, benchmarking can still be useful even if the classifier is less powerful than it could be. My main question is: How well does the generator model perform after rejection sampling? You could imagine that the rejection sampling process is destructive for output quality. But the initial results from your paper indicate the opposite—rejection sampling does not reduce human preference for generated outputs, so we would hope that it does not reduce benchmark performance either.
For example, MultiRC is a popular reading comprehension benchmark where the generator model takes a prompt consisting of a text passage, a question about the passage, and a list of possible answers. The generator then labels each answer as true or false, and is graded on its accuracy. Evaluation on MultiRC would show us how the classifier affects the generator’s QA skills.
GPT-Neo has already been evaluated on a wide suite of benchmarks. To prove that rejection sampling is performance competitive with unfiltered generation, you would not need to achieve SOTA performance on these benchmarks—you’d simply need to be competitive with unfiltered GPT-Neo.
- dmz 8 May 2022 20:20 UTC
  2 points
  Parent
  My worry is that it’s not clear what exactly we would learn. We might get performance degradation just because the classifier has been trained on such a different distribution (including a different generator). Or it might be totally fine because almost none of those tasks involve frequent mention of injury, making it trivial. Either way IMO it would be unclear how the result would transfer to a more coherent setting.
  (ETA: That said, I think it would be cool to train a new classifier on a more general language modeling task, possibly with a different notion of catastrophe, and then benchmark that!)
  - aogara 22 May 2022 22:04 UTC
    1 point
    Parent
    Coming back to this: Your concern makes sense to me. Your proposal to train a new classifier for filtered generation to improve performance on other tasks seems very interesting. I think it might also be useful to simply provide a nice open-source implementation of rejection sampling in a popular generator repo like Facebook’s OPT-175B, so that future researchers can build on it.
    I’m planning on working on technical AI safety full-time this summer. Right now I’m busy applying to a few different programs, but I’ll definitely follow up on this idea with you.