aog comments on High-stakes alignment via adversarial training [Redwood Research report]

aog 22 May 2022 21:59 UTC
2 points
Did you consider using the approach described in Ethan Perez’s “Red Teaming LMs with LMs”? This would mean using a new generator model to build many prompts, having the original generator complete those prompts, and then having a classifier identify any injurious examples in the completion.
The tricky part seems to be that this assumes the classifier’s judgements are correct. If you trained the classifier on the examples identified by this process, it would only generate examples that are already labeled correctly by the classifier. To escape this problem, you could have humans grade whether each classification was correct, and train the classifier on incorrectly classified examples. But if humans have to review each example, this might not be any more useful than your original data labeling process using examples from human-written stories.
I suppose it wouldn’t really add much value. Would you agree? Are there any related circumstances where the approach would be more useful? Maybe this would be a better question for Ethan...