RSS

De­bate (AI safety tech­nique)

TagLast edit: Feb 6, 2023, 12:35 AM by plex

Debate is a proposed technique for allowing human evaluators to get correct and helpful answers from experts, even if the evaluator is not themselves an expert or able to fully verify the answers.[1] The technique was suggested as part of an approach to build advanced AI systems that are aligned with human values, and to safely apply machine learning techniques to problems that have high stakes, but are not well-defined (such as advancing science or increase a company’s revenue). [2][3]

  1. ^
  2. ^
  3. ^

Wri­teup: Progress on AI Safety via Debate

Feb 5, 2020, 9:04 PM
102 points
18 comments33 min readLW link

Briefly think­ing through some analogs of debate

Eli TyreSep 11, 2022, 12:02 PM
20 points
3 comments4 min readLW link

A guide to Iter­ated Am­plifi­ca­tion & Debate

Rafael HarthNov 15, 2020, 5:14 PM
75 points
12 comments15 min readLW link

AI Safety via Debate

ESRogsMay 5, 2018, 2:11 AM
27 points
14 comments1 min readLW link
(blog.openai.com)

[Question] How should AI de­bate be judged?

abramdemskiJul 15, 2020, 10:20 PM
49 points
26 comments6 min readLW link

Thoughts on AI Safety via Debate

VaniverMay 9, 2018, 7:46 PM
35 points
12 comments6 min readLW link

De­bate up­date: Obfus­cated ar­gu­ments problem

Beth BarnesDec 23, 2020, 3:24 AM
135 points
24 comments16 min readLW link

Op­ti­mal play in hu­man-judged De­bate usu­ally won’t an­swer your question

Joe CollmanJan 27, 2021, 7:34 AM
33 points
12 comments12 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel NandaDec 15, 2021, 11:44 PM
127 points
9 comments15 min readLW link

[New LW Fea­ture] “De­bates”

Apr 1, 2023, 7:00 AM
121 points
35 comments1 min readLW link

A Small Nega­tive Re­sult on Debate

Sam BowmanApr 12, 2022, 6:19 PM
42 points
11 comments1 min readLW link

De­cep­tion Chess: Game #2

ZaneNov 29, 2023, 2:43 AM
29 points
17 comments2 min readLW link

Clar­ify­ing Fac­tored Cognition

Rafael HarthDec 13, 2020, 8:02 PM
23 points
2 comments3 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhubMay 29, 2020, 8:38 PM
220 points
36 comments38 min readLW link2 reviews

AXRP Epi­sode 16 - Prepar­ing for De­bate AI with Ge­offrey Irving

DanielFilanJul 1, 2022, 10:20 PM
20 points
0 comments37 min readLW link

Three men­tal images from think­ing about AGI de­bate & corrigibility

Steven ByrnesAug 3, 2020, 2:29 PM
55 points
35 comments4 min readLW link

Syn­the­siz­ing am­plifi­ca­tion and debate

evhubFeb 5, 2020, 10:53 PM
33 points
10 comments4 min readLW link

In­fer­ence-Only De­bate Ex­per­i­ments Us­ing Math Problems

Aug 6, 2024, 5:44 PM
31 points
0 comments2 min readLW link

Split­ting De­bate up into Two Subsystems

NandiJul 3, 2020, 8:11 PM
13 points
5 comments4 min readLW link

Look­ing for ad­ver­sar­ial col­lab­o­ra­tors to test our De­bate protocol

Beth BarnesAug 19, 2020, 3:15 AM
52 points
5 comments1 min readLW link

Take 9: No, RLHF/​IDA/​de­bate doesn’t solve outer al­ign­ment.

Charlie SteinerDec 12, 2022, 11:51 AM
33 points
13 comments2 min readLW link

Why I’m not work­ing on {de­bate, RRM, ELK, nat­u­ral ab­strac­tions}

Steven ByrnesFeb 10, 2023, 7:22 PM
71 points
19 comments9 min readLW link

The limits of AI safety via debate

Marius HobbhahnMay 10, 2022, 1:33 PM
29 points
7 comments10 min readLW link

Thoughts on “AI safety via de­bate”

Gordon Seidoh WorleyMay 10, 2018, 12:44 AM
12 points
4 comments5 min readLW link

Ideal­ized Fac­tored Cognition

Rafael HarthNov 30, 2020, 6:49 PM
34 points
6 comments11 min readLW link

Travers­ing a Cog­ni­tion Space

Rafael HarthDec 7, 2020, 6:32 PM
17 points
5 comments12 min readLW link

Com­par­ing AI Align­ment Ap­proaches to Min­i­mize False Pos­i­tive Risk

Gordon Seidoh WorleyJun 30, 2020, 7:34 PM
5 points
0 comments9 min readLW link

Imi­ta­tive Gen­er­al­i­sa­tion (AKA ‘Learn­ing the Prior’)

Beth BarnesJan 10, 2021, 12:30 AM
107 points
15 comments11 min readLW link1 review

Why I’m ex­cited about Debate

Richard_NgoJan 15, 2021, 11:37 PM
75 points
12 comments7 min readLW link

FC fi­nal: Can Fac­tored Cog­ni­tion schemes scale?

Rafael HarthJan 24, 2021, 10:18 PM
17 points
0 comments17 min readLW link

AXRP Epi­sode 6 - De­bate and Imi­ta­tive Gen­er­al­iza­tion with Beth Barnes

DanielFilanApr 8, 2021, 9:20 PM
26 points
3 comments60 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. MurphyMay 12, 2022, 8:01 PM
58 points
0 comments59 min readLW link

AI-Writ­ten Cri­tiques Help Hu­mans No­tice Flaws

paulfchristianoJun 25, 2022, 5:22 PM
137 points
5 comments3 min readLW link
(openai.com)

Sur­prised by ELK re­port’s coun­terex­am­ple to De­bate, IDA

Evan R. MurphyAug 4, 2022, 2:12 AM
18 points
0 comments5 min readLW link

Rant on Prob­lem Fac­tor­iza­tion for Alignment

johnswentworthAug 5, 2022, 7:23 PM
102 points
53 comments6 min readLW link

De­bate AI and the De­ci­sion to Re­lease an AI

Chris_LeongJan 17, 2019, 2:36 PM
9 points
18 comments3 min readLW link

Align­ment via proso­cial brain algorithms

Cameron BergSep 12, 2022, 1:48 PM
45 points
30 comments6 min readLW link

The “AI De­bate” Debate

michaelcohenJul 2, 2020, 10:16 AM
20 points
20 comments3 min readLW link

AI Un­safety via Non-Zero-Sum Debate

VojtaKovarikJul 3, 2020, 10:03 PM
25 points
10 comments5 min readLW link

Ques­tions about Value Lock-in, Pa­ter­nal­ism, and Empowerment

Sam F. BrownNov 16, 2022, 3:33 PM
13 points
2 comments12 min readLW link
(sambrown.eu)

Notes on OpenAI’s al­ign­ment plan

Alex FlintDec 8, 2022, 7:13 PM
40 points
5 comments7 min readLW link

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland BarstadDec 13, 2022, 2:17 AM
10 points
5 comments45 min readLW link

An­thropic Fall 2023 De­bate Progress Update

Ansh RadhakrishnanNov 28, 2023, 5:37 AM
75 points
9 comments12 min readLW link

OpenAI Credit Ac­count (2510$)

Emirhan BULUTJan 21, 2024, 2:32 AM
1 point
0 comments1 min readLW link

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

Feb 7, 2024, 9:28 PM
88 points
14 comments9 min readLW link
(arxiv.org)

Align­ment Gaps

kcyrasJun 8, 2024, 3:23 PM
11 points
4 comments8 min readLW link

AI De­bate Sta­bil­ity: Ad­dress­ing Self-Defeat­ing Responses

Annie SorkinJun 11, 2024, 3:03 AM
9 points
0 comments3 min readLW link

NYU Code De­bates Up­date/​Postmortem

David ReinMay 24, 2024, 4:08 PM
27 points
4 comments10 min readLW link

On scal­able over­sight with weak LLMs judg­ing strong LLMs

Jul 8, 2024, 8:59 AM
49 points
18 comments7 min readLW link
(arxiv.org)

NYU De­bate Train­ing Up­date: Meth­ods, Baselines, Pre­limi­nary Results

samarnesenJul 6, 2024, 6:28 PM
9 points
0 comments20 min readLW link

Con­trol Vec­tors as Dis­po­si­tional Traits

Gianluca CalcagniJun 23, 2024, 9:34 PM
10 points
0 comments11 min readLW link

Re­in­force­ment Learn­ing from In­for­ma­tion Bazaar Feed­back, and other uses of in­for­ma­tion markets

Abhimanyu Pallavi SudhirSep 16, 2024, 1:04 AM
5 points
1 comment5 min readLW link

De­bate, Or­a­cles, and Obfus­cated Arguments

Jun 20, 2024, 11:14 PM
44 points
4 comments21 min readLW link

Em­brac­ing com­plex­ity when de­vel­op­ing and eval­u­at­ing AI re­spon­si­bly

Aliya AmirovaOct 11, 2024, 5:46 PM
2 points
9 comments9 min readLW link

Mak­ing LLMs safer is more in­tu­itive than you think: How Com­mon Sense and Diver­sity Im­prove AI Align­ment

Jeba SaniaDec 29, 2024, 7:27 PM
−5 points
1 comment6 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjrFeb 12, 2021, 7:55 AM
15 points
0 comments27 min readLW link

Topolog­i­cal De­bate Framework

lunatic_at_largeJan 16, 2025, 5:19 PM
10 points
5 comments9 min readLW link

[Question] En­hanced Clar­ity to Bridge the AI La­bel­ing Gap?

PathwaysJan 26, 2025, 6:48 AM
1 point
0 comments1 min readLW link

De­bate Minus Fac­tored Cognition

abramdemskiDec 29, 2020, 10:59 PM
37 points
42 comments11 min readLW link

Ar­gu­ing for the Truth? An In­fer­ence-Only Study into AI Debate

denisemesterFeb 11, 2025, 3:04 AM
6 points
0 comments16 min readLW link

Can there be an in­de­scrib­able hel­l­world?

Stuart_ArmstrongJan 29, 2019, 3:00 PM
39 points
19 comments2 min readLW link

Em­pa­thy bandaid for im­me­di­ate AI catastrophe

installgentooApr 5, 2023, 2:12 AM
1 point
2 comments1 min readLW link

De­bate helps su­per­vise hu­man ex­perts [Paper]

habrykaNov 17, 2023, 5:25 AM
29 points
6 comments1 min readLW link
(github.com)

AI de­bate: test your­self against chess ‘AIs’

Richard WillisNov 22, 2023, 2:58 PM
26 points
35 comments4 min readLW link

Par­allels Between AI Safety by De­bate and Ev­i­dence Law

CullenJul 20, 2020, 10:52 PM
10 points
1 comment2 min readLW link
(cullenokeefe.com)

AI Safety De­bate and Its Applications

VojtaKovarikJul 23, 2019, 10:31 PM
38 points
5 comments11 min readLW link

New pa­per: (When) is Truth-tel­ling Fa­vored in AI de­bate?

VojtaKovarikDec 26, 2019, 7:59 PM
32 points
7 comments5 min readLW link
(medium.com)

Prob­lems with AI debate

Stuart_ArmstrongAug 26, 2019, 7:21 PM
21 points
3 comments5 min readLW link

Eval­u­at­ing Su­per­hu­man Models with Con­sis­tency Checks

Aug 1, 2023, 7:51 AM
21 points
2 comments9 min readLW link
(arxiv.org)

AI Safety 101 - Chap­ter 5.1 - Debate

Charbel-RaphaëlOct 31, 2023, 2:29 PM
15 points
0 comments13 min readLW link

An AI-in-a-box suc­cess model

azsantoskApr 11, 2022, 10:28 PM
16 points
1 comment10 min readLW link

Learn­ing the smooth prior

Apr 29, 2022, 9:10 PM
35 points
0 comments12 min readLW link