By Sophie Bridgers, Rishub Jain, Rory Greig, and Rohin Shah
Based on work by the Google DeepMind Rater Assist Team: Vladimir Mikulik, Sophie Bridgers, Tian Huey Teh, Rishub Jain, Rory Greig, Lili Janzer (randomized order, equal contributions)
Human oversight is critical for ensuring that Artificial Intelligence (AI) models remain safe and aligned to human values. But AI systems are rapidly advancing in capabilities and are being used to complete ever more complex tasks, making it increasingly challenging for humans to verify AI outputs and provide high-quality feedback. How can we ensure that humans can continue to meaningfully evaluate AI performance? An avenue of research to tackle this problem is “Amplified Oversight” (also called “Scalable Oversight”), which aims to develop techniques to use AI to amplify humans’ abilities to oversee increasingly powerful AI systems, even if they eventually surpass human capabilities in particular domains.
With this level of advanced AI, we could use AI itself to evaluate other AIs (i.e., AI raters), but this comes with drawbacks (see Section IV: The Elephant in the Room). Importantly, humans and AIs have complementary strengths and weaknesses. We should thus, in principle, be able to leverage these complementary abilities to generate an oversight signal for model training, evaluation, and monitoring that is stronger than what we could get from human raters or AI raters alone. Two promising mechanisms for harnessing human-AI complementarity to improve oversight are:
Rater Assistance, in which we give human raters access to an AI rating assistant that can critique or point out flaws in an AI output or automate parts of the rating task, and
Hybridization, in which we combine judgments from human raters and AI raters working in isolation based on predictions about their relative rating ability per task instance (e.g., based on confidence).
The design of Rater Assistance and/or Hybridization protocols that enable human-AI complementarity is challenging. It requires grappling with complex questions such as how to pinpoint the unique skills and knowledge that humans or AIs possess, how to identify when AI or human judgment is more reliable, and how to effectively use AI to improve human reasoning and decision-making without leading to under- or over-reliance on the AI. These are fundamentally questions of Human-Computer Interaction (HCI), Cognitive Science, Psychology, Philosophy, and Education. Luckily, these fields have explored these same or related questions, and AI safety can learn from and collaborate with them to address these sociotechnical challenges. On our team, we have worked to expand our interdisciplinary expertise to make progress on Rater Assistance and Hybridization for Amplified Oversight.
Read the rest of the full blog here!
Nice! Purely for my own ease of comprehension I’d have liked a little more translation/analogizing between AI jargon and HCI jargon—e.g. the phrase “active learning” doesn’t appear in the post.
I disagree in several ways.
Humans being definitionally accurate is a convenient assumption on easy problems, but breaks down on hard problems. The thing is, human responses to questions are not always best thought of as direct access to some underlying truth—we give different answers in different contexts, and have to be interpreted in sometimes quite sophisticated ways to turn those question-responses into good choices between actions in the real world. There are even cases where humans will give some answer on the object level, but when asked a meta-level question about their object-level answer will disagree with themselves (perhaps even endorsing some non-human process including AI). If humans were always accurate this would be a paradox.
AI is going to get smart. Eventually, quite smart. Smart enough to predict human behavior in new contexts kind of smart. On the one hand this is good news because it means that if we can reduce moral questions to empirical questions about the behavior of humans in novel contexts (and I mean do it in a philosophically satisfying way, not just try something that sounds good and hope it works), we’re almost done. On the other hand this is bad news because it means that AI ignorance about human behavior cannot be used to ensure properties like corrigibility, and predictions of future AI-human interaction based on assumptions of AI ignorance have to be abandoned.