Debate (AI safety technique)

TagLast edit: Feb 6, 2023, 12:35 AM by plex

Debate is a proposed technique for allowing human evaluators to get correct and helpful answers from experts, even if the evaluator is not themselves an expert or able to fully verify the answers.^[1] The technique was suggested as part of an approach to build advanced AI systems that are aligned with human values, and to safely apply machine learning techniques to problems that have high stakes, but are not well-defined (such as advancing science or increase a company’s revenue). ^[2]^[3]

Briefly thinking through some analogs of debate

Eli TyreSep 11, 2022, 12:02 PM

20 points

3 comments4 min readLW link

Writeup: Progress on AI Safety via Debate

Beth Barnes and paulfchristiano

Feb 5, 2020, 9:04 PM

103 points

18 comments33 min readLW link

A guide to Iterated Amplification & Debate

Rafael HarthNov 15, 2020, 5:14 PM

75 points

12 comments15 min readLW link

Thoughts on AI Safety via Debate

VaniverMay 9, 2018, 7:46 PM

35 points

12 comments6 min readLW link

Debate update: Obfuscated arguments problem

Beth BarnesDec 23, 2020, 3:24 AM

135 points

24 comments16 min readLW link

AI Safety via Debate

ESRogsMay 5, 2018, 2:11 AM

27 points

14 comments1 min readLW link

(blog.openai.com)

[Question] How should AI debate be judged?

abramdemskiJul 15, 2020, 10:20 PM

49 points

26 comments6 min readLW link

Optimal play in human-judged Debate usually won’t answer your question

Joe CollmanJan 27, 2021, 7:34 AM

33 points

12 comments12 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel NandaDec 15, 2021, 11:44 PM

127 points

9 comments15 min readLW link

[New LW Feature] “Debates”

Ruby, RobertM, GPT-4 and Claude+

Apr 1, 2023, 7:00 AM

121 points

35 comments1 min readLW link

A Small Negative Result on Debate

Sam BowmanApr 12, 2022, 6:19 PM

42 points

11 comments1 min readLW link

Deception Chess: Game #2

ZaneNov 29, 2023, 2:43 AM

29 points

17 comments2 min readLW link

The limits of AI safety via debate

Marius HobbhahnMay 10, 2022, 1:33 PM

35 points

8 comments10 min readLW link

An overview of 11 proposals for building safe advanced AI

evhubMay 29, 2020, 8:38 PM

220 points

37 comments38 min readLW link 2 reviews

AXRP Episode 16 - Preparing for Debate AI with Geoffrey Irving

DanielFilanJul 1, 2022, 10:20 PM

20 points

0 comments37 min readLW link

Inference-Only Debate Experiments Using Math Problems

Arjun Panickssery, Abhimanyu Pallavi Sudhir and JacksonKaunismaa

Aug 6, 2024, 5:44 PM

31 points

0 comments2 min readLW link

Three mental images from thinking about AGI debate & corrigibility

Steven ByrnesAug 3, 2020, 2:29 PM

55 points

35 comments4 min readLW link

Synthesizing amplification and debate

evhubFeb 5, 2020, 10:53 PM

33 points

10 comments4 min readLW link

Splitting Debate up into Two Subsystems

NandiJul 3, 2020, 8:11 PM

13 points

5 comments4 min readLW link

Looking for adversarial collaborators to test our Debate protocol

Beth BarnesAug 19, 2020, 3:15 AM

52 points

5 comments1 min readLW link

Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.

Charlie SteinerDec 12, 2022, 11:51 AM

33 points

13 comments2 min readLW link

Why I’m not working on {debate, RRM, ELK, natural abstractions}

Steven ByrnesFeb 10, 2023, 7:22 PM

71 points

19 comments9 min readLW link

Clarifying Factored Cognition

Rafael HarthDec 13, 2020, 8:02 PM

23 points

2 comments3 min readLW link

Thoughts on “AI safety via debate”

Gordon Seidoh WorleyMay 10, 2018, 12:44 AM

12 points

4 comments5 min readLW link

Idealized Factored Cognition

Rafael HarthNov 30, 2020, 6:49 PM

34 points

6 comments11 min readLW link

Traversing a Cognition Space

Rafael HarthDec 7, 2020, 6:32 PM

17 points

5 comments12 min readLW link

Comparing AI Alignment Approaches to Minimize False Positive Risk

Gordon Seidoh WorleyJun 30, 2020, 7:34 PM

5 points

0 comments9 min readLW link

Imitative Generalisation (AKA ‘Learning the Prior’)

Beth BarnesJan 10, 2021, 12:30 AM

107 points

15 comments11 min readLW link 1 review

Why I’m excited about Debate

Richard_NgoJan 15, 2021, 11:37 PM

75 points

12 comments7 min readLW link

FC final: Can Factored Cognition schemes scale?

Rafael HarthJan 24, 2021, 10:18 PM

17 points

0 comments17 min readLW link

AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes

DanielFilanApr 8, 2021, 9:20 PM

26 points

3 comments60 min readLW link

An AI-in-a-box success model

azsantoskApr 11, 2022, 10:28 PM

16 points

1 comment10 min readLW link

Learning the smooth prior

Geoffrey Irving, Rohin Shah and evhub

Apr 29, 2022, 9:10 PM

35 points

0 comments12 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. MurphyMay 12, 2022, 8:01 PM

58 points

0 comments59 min readLW link

AI-Written Critiques Help Humans Notice Flaws

paulfchristianoJun 25, 2022, 5:22 PM

137 points

5 comments3 min readLW link

(openai.com)

Surprised by ELK report’s counterexample to Debate, IDA

Evan R. MurphyAug 4, 2022, 2:12 AM

18 points

0 comments5 min readLW link

Rant on Problem Factorization for Alignment

johnswentworthAug 5, 2022, 7:23 PM

102 points

53 comments6 min readLW link

Debate AI and the Decision to Release an AI

Chris_LeongJan 17, 2019, 2:36 PM

9 points

18 comments3 min readLW link

Alignment via prosocial brain algorithms

Cameron BergSep 12, 2022, 1:48 PM

45 points

30 comments6 min readLW link

The “AI Debate” Debate

michaelcohenJul 2, 2020, 10:16 AM

20 points

20 comments3 min readLW link

AI Unsafety via Non-Zero-Sum Debate

VojtaKovarikJul 3, 2020, 10:03 PM

25 points

10 comments5 min readLW link

Questions about Value Lock-in, Paternalism, and Empowerment

Sam F. BrownNov 16, 2022, 3:33 PM

13 points

2 comments12 min readLW link

(sambrown.eu)

Notes on OpenAI’s alignment plan

Alex FlintDec 8, 2022, 7:13 PM

40 points

5 comments7 min readLW link

truth.integrity(): A Recursive Framework for Hallucination Prevention and Alignment

brittneyluongApr 2, 2025, 5:52 PM

1 point

0 comments2 min readLW link

Alignment with argument-networks and assessment-predictions

Tor Økland BarstadDec 13, 2022, 2:17 AM

10 points

5 comments45 min readLW link

Anthropic Fall 2023 Debate Progress Update

Ansh RadhakrishnanNov 28, 2023, 5:37 AM

76 points

9 comments12 min readLW link

OpenAI Credit Account (2510$)

Emirhan BULUTJan 21, 2024, 2:32 AM

1 point

0 comments1 min readLW link

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

Feb 7, 2024, 9:28 PM

89 points

14 comments9 min readLW link

(arxiv.org)

Superposition Checkers: A Game Where AI’s Strengths Become Fatal Flaws

R. A. McCormackApr 6, 2025, 12:57 AM

1 point

0 comments2 min readLW link

Alignment Gaps

kcyrasJun 8, 2024, 3:23 PM

11 points

4 comments8 min readLW link

AI Debate Stability: Addressing Self-Defeating Responses

Annie SorkinJun 11, 2024, 3:03 AM

9 points

0 comments3 min readLW link

NYU Code Debates Update/Postmortem

David ReinMay 24, 2024, 4:08 PM

27 points

4 comments10 min readLW link

On scalable oversight with weak LLMs judging strong LLMs

zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner and Rohin Shah

Jul 8, 2024, 8:59 AM

49 points

18 comments7 min readLW link

(arxiv.org)

NYU Debate Training Update: Methods, Baselines, Preliminary Results

samarnesenJul 6, 2024, 6:28 PM

9 points

0 comments20 min readLW link

Control Vectors as Dispositional Traits

Gianluca CalcagniJun 23, 2024, 9:34 PM

10 points

0 comments11 min readLW link

Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets

Abhimanyu Pallavi SudhirSep 16, 2024, 1:04 AM

5 points

1 comment5 min readLW link

Debate, Oracles, and Obfuscated Arguments

Jonah Brown-Cohen and Geoffrey Irving

Jun 20, 2024, 11:14 PM

44 points

4 comments21 min readLW link

Embracing complexity when developing and evaluating AI responsibly

Aliya AmirovaOct 11, 2024, 5:46 PM

2 points

9 comments9 min readLW link

A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives

Knight LeeApr 14, 2025, 10:27 AM

−3 points

2 comments4 min readLW link

Making LLMs safer is more intuitive than you think: How Common Sense and Diversity Improve AI Alignment

Jeba SaniaDec 29, 2024, 7:27 PM

−5 points

1 comment6 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjrFeb 12, 2021, 7:55 AM

15 points

0 comments27 min readLW link

Topological Debate Framework

lunatic_at_largeJan 16, 2025, 5:19 PM

10 points

5 comments9 min readLW link

[Question] Enhanced Clarity to Bridge the AI Labeling Gap?

PathwaysJan 26, 2025, 6:48 AM

1 point

0 comments1 min readLW link

Debate Minus Factored Cognition

abramdemskiDec 29, 2020, 10:59 PM

37 points

42 comments11 min readLW link

Arguing for the Truth? An Inference-Only Study into AI Debate

denisemesterFeb 11, 2025, 3:04 AM

7 points

0 comments16 min readLW link

Can there be an indescribable hellworld?

Stuart_ArmstrongJan 29, 2019, 3:00 PM

39 points

19 comments2 min readLW link

Empathy bandaid for immediate AI catastrophe

installgentooApr 5, 2023, 2:12 AM

1 point

2 comments1 min readLW link

Debate helps supervise human experts [Paper]

habrykaNov 17, 2023, 5:25 AM

29 points

6 comments1 min readLW link

(github.com)

AI debate: test yourself against chess ‘AIs’

Richard WillisNov 22, 2023, 2:58 PM

26 points

35 comments4 min readLW link

Parallels Between AI Safety by Debate and Evidence Law

CullenJul 20, 2020, 10:52 PM

10 points

1 comment2 min readLW link

(cullenokeefe.com)

AI Safety Debate and Its Applications

VojtaKovarikJul 23, 2019, 10:31 PM

38 points

5 comments11 min readLW link

New paper: (When) is Truth-telling Favored in AI debate?

VojtaKovarikDec 26, 2019, 7:59 PM

32 points

7 comments5 min readLW link

(medium.com)

Problems with AI debate

Stuart_ArmstrongAug 26, 2019, 7:21 PM

21 points

3 comments5 min readLW link

Evaluating Superhuman Models with Consistency Checks

Daniel Paleka and Lukas Fluri

Aug 1, 2023, 7:51 AM

21 points

2 comments9 min readLW link

(arxiv.org)

AI Safety 101 - Chapter 5.1 - Debate

Charbel-RaphaëlOct 31, 2023, 2:29 PM

15 points

0 comments13 min readLW link

Zach Stein-Perlman Dec 16, 2024, 9:32 PM
2 points
0
The more important zoom-level is: debate is a proposed technique to provide a good training signal. See e.g. https://www.lesswrong.com/posts/eq2aJt8ZqMaGhBu3r/zach-stein-perlman-s-shortform?commentId=DLYDeiumQPWv4pdZ4.
Edit: debate is a technique for iterated amplification—but that tag is terrible too, oh no

De­bate (AI safety tech­nique)

Debate (AI safety technique)