RSS

De­bate (AI safety tech­nique)

TagLast edit: 6 Feb 2023 0:35 UTC by plex

Debate is a proposed technique for allowing human evaluators to get correct and helpful answers from experts, even if the evaluator is not themselves an expert or able to fully verify the answers.[1] The technique was suggested as part of an approach to build advanced AI systems that are aligned with human values, and to safely apply machine learning techniques to problems that have high stakes, but are not well-defined (such as advancing science or increase a company’s revenue). [2][3]

  1. ^
  2. ^
  3. ^

Wri­teup: Progress on AI Safety via Debate

5 Feb 2020 21:04 UTC
102 points
18 comments33 min readLW link

Briefly think­ing through some analogs of debate

Eli Tyre11 Sep 2022 12:02 UTC
20 points
3 comments4 min readLW link

A guide to Iter­ated Am­plifi­ca­tion & Debate

Rafael Harth15 Nov 2020 17:14 UTC
75 points
12 comments15 min readLW link

Thoughts on AI Safety via Debate

Vaniver9 May 2018 19:46 UTC
35 points
12 comments6 min readLW link

De­bate up­date: Obfus­cated ar­gu­ments problem

Beth Barnes23 Dec 2020 3:24 UTC
135 points
24 comments16 min readLW link

AI Safety via Debate

ESRogs5 May 2018 2:11 UTC
27 points
14 comments1 min readLW link
(blog.openai.com)

[Question] How should AI de­bate be judged?

abramdemski15 Jul 2020 22:20 UTC
49 points
26 comments6 min readLW link

Op­ti­mal play in hu­man-judged De­bate usu­ally won’t an­swer your question

Joe Collman27 Jan 2021 7:34 UTC
33 points
12 comments12 min readLW link

A Small Nega­tive Re­sult on Debate

Sam Bowman12 Apr 2022 18:19 UTC
42 points
11 comments1 min readLW link

In­fer­ence-Only De­bate Ex­per­i­ments Us­ing Math Problems

6 Aug 2024 17:44 UTC
31 points
0 comments2 min readLW link

[New LW Fea­ture] “De­bates”

1 Apr 2023 7:00 UTC
121 points
35 comments1 min readLW link

Split­ting De­bate up into Two Subsystems

Nandi3 Jul 2020 20:11 UTC
13 points
5 comments4 min readLW link

Thoughts on “AI safety via de­bate”

Gordon Seidoh Worley10 May 2018 0:44 UTC
12 points
4 comments5 min readLW link

Com­par­ing AI Align­ment Ap­proaches to Min­i­mize False Pos­i­tive Risk

Gordon Seidoh Worley30 Jun 2020 19:34 UTC
5 points
0 comments9 min readLW link

De­cep­tion Chess: Game #2

Zane29 Nov 2023 2:43 UTC
29 points
17 comments2 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
213 points
36 comments38 min readLW link2 reviews

Three men­tal images from think­ing about AGI de­bate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC
55 points
35 comments4 min readLW link

Syn­the­siz­ing am­plifi­ca­tion and debate

evhub5 Feb 2020 22:53 UTC
33 points
10 comments4 min readLW link

Look­ing for ad­ver­sar­ial col­lab­o­ra­tors to test our De­bate protocol

Beth Barnes19 Aug 2020 3:15 UTC
52 points
5 comments1 min readLW link

Clar­ify­ing Fac­tored Cognition

Rafael Harth13 Dec 2020 20:02 UTC
23 points
2 comments3 min readLW link

Ideal­ized Fac­tored Cognition

Rafael Harth30 Nov 2020 18:49 UTC
34 points
6 comments11 min readLW link

Travers­ing a Cog­ni­tion Space

Rafael Harth7 Dec 2020 18:32 UTC
17 points
5 comments12 min readLW link

Imi­ta­tive Gen­er­al­i­sa­tion (AKA ‘Learn­ing the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC
107 points
15 comments11 min readLW link1 review

Why I’m ex­cited about Debate

Richard_Ngo15 Jan 2021 23:37 UTC
75 points
12 comments7 min readLW link

FC fi­nal: Can Fac­tored Cog­ni­tion schemes scale?

Rafael Harth24 Jan 2021 22:18 UTC
17 points
0 comments17 min readLW link

AXRP Epi­sode 6 - De­bate and Imi­ta­tive Gen­er­al­iza­tion with Beth Barnes

DanielFilan8 Apr 2021 21:20 UTC
26 points
3 comments60 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
127 points
9 comments15 min readLW link

The limits of AI safety via debate

Marius Hobbhahn10 May 2022 13:33 UTC
29 points
7 comments10 min readLW link

AXRP Epi­sode 16 - Prepar­ing for De­bate AI with Ge­offrey Irving

DanielFilan1 Jul 2022 22:20 UTC
20 points
0 comments37 min readLW link

Take 9: No, RLHF/​IDA/​de­bate doesn’t solve outer al­ign­ment.

Charlie Steiner12 Dec 2022 11:51 UTC
33 points
13 comments2 min readLW link

Why I’m not work­ing on {de­bate, RRM, ELK, nat­u­ral ab­strac­tions}

Steven Byrnes10 Feb 2023 19:22 UTC
71 points
19 comments9 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
58 points
0 comments59 min readLW link

AI-Writ­ten Cri­tiques Help Hu­mans No­tice Flaws

paulfchristiano25 Jun 2022 17:22 UTC
137 points
5 comments3 min readLW link
(openai.com)

AI Safety De­bate and Its Applications

VojtaKovarik23 Jul 2019 22:31 UTC
38 points
5 comments11 min readLW link

New pa­per: (When) is Truth-tel­ling Fa­vored in AI de­bate?

VojtaKovarik26 Dec 2019 19:59 UTC
32 points
7 comments5 min readLW link
(medium.com)

NYU Code De­bates Up­date/​Postmortem

David Rein24 May 2024 16:08 UTC
27 points
4 comments10 min readLW link

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC
10 points
5 comments45 min readLW link

Sur­prised by ELK re­port’s coun­terex­am­ple to De­bate, IDA

Evan R. Murphy4 Aug 2022 2:12 UTC
18 points
0 comments5 min readLW link

Prob­lems with AI debate

Stuart_Armstrong26 Aug 2019 19:21 UTC
21 points
3 comments5 min readLW link

Rant on Prob­lem Fac­tor­iza­tion for Alignment

johnswentworth5 Aug 2022 19:23 UTC
98 points
53 comments6 min readLW link

Eval­u­at­ing Su­per­hu­man Models with Con­sis­tency Checks

1 Aug 2023 7:51 UTC
21 points
2 comments9 min readLW link
(arxiv.org)

AI De­bate Sta­bil­ity: Ad­dress­ing Self-Defeat­ing Responses

Annie Sorkin11 Jun 2024 3:03 UTC
9 points
0 comments3 min readLW link

De­bate AI and the De­ci­sion to Re­lease an AI

Chris_Leong17 Jan 2019 14:36 UTC
9 points
18 comments3 min readLW link

OpenAI Credit Ac­count (2510$)

Emirhan BULUT21 Jan 2024 2:32 UTC
1 point
0 comments1 min readLW link

AI Safety 101 - Chap­ter 5.1 - Debate

Charbel-Raphaël31 Oct 2023 14:29 UTC
15 points
0 comments13 min readLW link

Align­ment Gaps

kcyras8 Jun 2024 15:23 UTC
10 points
3 comments8 min readLW link

Align­ment via proso­cial brain algorithms

Cameron Berg12 Sep 2022 13:48 UTC
45 points
30 comments6 min readLW link

The “AI De­bate” Debate

michaelcohen2 Jul 2020 10:16 UTC
20 points
20 comments3 min readLW link

De­bate, Or­a­cles, and Obfus­cated Arguments

20 Jun 2024 23:14 UTC
40 points
2 comments21 min readLW link

AI Un­safety via Non-Zero-Sum Debate

VojtaKovarik3 Jul 2020 22:03 UTC
25 points
10 comments5 min readLW link

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

7 Feb 2024 21:28 UTC
88 points
14 comments9 min readLW link
(arxiv.org)

Em­brac­ing com­plex­ity when de­vel­op­ing and eval­u­at­ing AI re­spon­si­bly

Aliya Amirova11 Oct 2024 17:46 UTC
2 points
9 comments9 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments27 min readLW link

De­bate Minus Fac­tored Cognition

abramdemski29 Dec 2020 22:59 UTC
37 points
42 comments11 min readLW link

Can there be an in­de­scrib­able hel­l­world?

Stuart_Armstrong29 Jan 2019 15:00 UTC
39 points
19 comments2 min readLW link

Ques­tions about Value Lock-in, Pa­ter­nal­ism, and Empowerment

Sam F. Brown16 Nov 2022 15:33 UTC
13 points
2 comments12 min readLW link
(sambrown.eu)

Em­pa­thy bandaid for im­me­di­ate AI catastrophe

installgentoo5 Apr 2023 2:12 UTC
1 point
2 comments1 min readLW link

De­bate helps su­per­vise hu­man ex­perts [Paper]

habryka17 Nov 2023 5:25 UTC
29 points
6 comments1 min readLW link
(github.com)

Notes on OpenAI’s al­ign­ment plan

Alex Flint8 Dec 2022 19:13 UTC
40 points
5 comments7 min readLW link

Re­in­force­ment Learn­ing from In­for­ma­tion Bazaar Feed­back, and other uses of in­for­ma­tion markets

Abhimanyu Pallavi Sudhir16 Sep 2024 1:04 UTC
5 points
1 comment5 min readLW link

Con­trol Vec­tors as Dis­po­si­tional Traits

Gianluca Calcagni23 Jun 2024 21:34 UTC
9 points
0 comments11 min readLW link

An AI-in-a-box suc­cess model

azsantosk11 Apr 2022 22:28 UTC
16 points
1 comment10 min readLW link

An­thropic Fall 2023 De­bate Progress Update

Ansh Radhakrishnan28 Nov 2023 5:37 UTC
74 points
9 comments12 min readLW link

AI de­bate: test your­self against chess ‘AIs’

Richard Willis22 Nov 2023 14:58 UTC
26 points
35 comments4 min readLW link

Learn­ing the smooth prior

29 Apr 2022 21:10 UTC
35 points
0 comments12 min readLW link

NYU De­bate Train­ing Up­date: Meth­ods, Baselines, Pre­limi­nary Results

samarnesen6 Jul 2024 18:28 UTC
9 points
0 comments20 min readLW link

Par­allels Between AI Safety by De­bate and Ev­i­dence Law

Cullen20 Jul 2020 22:52 UTC
10 points
1 comment2 min readLW link
(cullenokeefe.com)

On scal­able over­sight with weak LLMs judg­ing strong LLMs

8 Jul 2024 8:59 UTC
49 points
18 comments7 min readLW link
(arxiv.org)