aog comments on High-stakes alignment via adversarial training [Redwood Research report]

aog 22 May 2022 22:04 UTC
1 point
0
Coming back to this: Your concern makes sense to me. Your proposal to train a new classifier for filtered generation to improve performance on other tasks seems very interesting. I think it might also be useful to simply provide a nice open-source implementation of rejection sampling in a popular generator repo like Facebook’s OPT-175B, so that future researchers can build on it.
I’m planning on working on technical AI safety full-time this summer. Right now I’m busy applying to a few different programs, but I’ll definitely follow up on this idea with you.