For those interested in empirical work on RLHF and building the safety community, Meta AI’s BlenderBot project has produced several good papers on the topic. A few that I liked:
My favorite one is about filtering out trolls from crowdsourced human feedback data. They begin with the observation that when you crowdsource human feedback, you’re going to get some bad feedback. They identify various kinds of archetypal trolls: the “Safe Troll” who marks every response as safe, the “Unsafe Troll” who marks them all unsafe, the “Gaslight Troll” whose only feedback is marking unsafe generations as safe, and the “Lazy Troll” whose feedback is simply incorrect a random percentage of the time. Then they implement several methods for filtering out incorrect feedback examples, and find that the best approach (for their particular hypothesized problem) is to score each user’s trustworthiness based on agreement with other users’ feedback scores and filter feedback from untrustworthy users. This seems like an important problem for ChatGPT or any system accepting public feedback, and I’m very happy to see a benchmark identifying the problem and several proposed methods making progress.
They have a more general paper about online learning from human feedback. They recommend a modular feedback form similar to the one later used by ChatGPT, and they observe that feedback on smaller models can improve the performance of larger models.
DIRECTOR is a method for language model generation aided by a reward model classifier. The method performs better than ranking and filtering beam searches as proposed in FUDGE, but unfortunately they don’t compare to full fine-tuning on a learned preference model as used by OpenAI and Anthropic.
They also have this paper and this paper on integrating retrieved knowledge into language model responses, which is relevant for building Truthful AI but advances capabilities more than I think safety researchers should be comfortable with.
Meta and Yann LeCun in particular haven’t been very receptive to arguments about AI risk. But the people working on BlenderBot have a natural overlap with OpenAI, Anthropic, and others working on alignment who’d like to collect and use human feedback to align language models. Meta’s language models aren’t nearly as capable as OpenAI’s and have attracted correspondingly less buzz, but they did release BlenderBot 3 with the intention of gathering lots of human feedback data. They could plausibly be a valuable ally and source of research on how to how to robustly align language models using human preference data.
For those interested in empirical work on RLHF and building the safety community, Meta AI’s BlenderBot project has produced several good papers on the topic. A few that I liked:
My favorite one is about filtering out trolls from crowdsourced human feedback data. They begin with the observation that when you crowdsource human feedback, you’re going to get some bad feedback. They identify various kinds of archetypal trolls: the “Safe Troll” who marks every response as safe, the “Unsafe Troll” who marks them all unsafe, the “Gaslight Troll” whose only feedback is marking unsafe generations as safe, and the “Lazy Troll” whose feedback is simply incorrect a random percentage of the time. Then they implement several methods for filtering out incorrect feedback examples, and find that the best approach (for their particular hypothesized problem) is to score each user’s trustworthiness based on agreement with other users’ feedback scores and filter feedback from untrustworthy users. This seems like an important problem for ChatGPT or any system accepting public feedback, and I’m very happy to see a benchmark identifying the problem and several proposed methods making progress.
They have a more general paper about online learning from human feedback. They recommend a modular feedback form similar to the one later used by ChatGPT, and they observe that feedback on smaller models can improve the performance of larger models.
DIRECTOR is a method for language model generation aided by a reward model classifier. The method performs better than ranking and filtering beam searches as proposed in FUDGE, but unfortunately they don’t compare to full fine-tuning on a learned preference model as used by OpenAI and Anthropic.
They also have this paper and this paper on integrating retrieved knowledge into language model responses, which is relevant for building Truthful AI but advances capabilities more than I think safety researchers should be comfortable with.
Meta and Yann LeCun in particular haven’t been very receptive to arguments about AI risk. But the people working on BlenderBot have a natural overlap with OpenAI, Anthropic, and others working on alignment who’d like to collect and use human feedback to align language models. Meta’s language models aren’t nearly as capable as OpenAI’s and have attracted correspondingly less buzz, but they did release BlenderBot 3 with the intention of gathering lots of human feedback data. They could plausibly be a valuable ally and source of research on how to how to robustly align language models using human preference data.