RSS

AI-As­sisted Alignment

TagLast edit: 25 Jan 2024 5:18 UTC by habryka

AI-Assisted Alignment is a cluster of alignment plans that involve AI somehow significantly helping with alignment research. This can include weak tool AI, or more advanced AGI doing original research.

There has been a lot of debate about how practical this alignment approach is.

Other search terms for this tag: AI aligning AI

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC
56 points
39 comments24 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC
38 points
10 comments31 min readLW link

We have to Upgrade

Jed McCaleb23 Mar 2023 17:53 UTC
129 points
35 comments2 min readLW link

Pro­posed Align­ment Tech­nique: OSNR (Out­put San­i­ti­za­tion via Nois­ing and Re­con­struc­tion) for Safer Usage of Po­ten­tially Misal­igned AGI

sudo29 May 2023 1:35 UTC
14 points
9 comments6 min readLW link

[Link] Why I’m op­ti­mistic about OpenAI’s al­ign­ment approach

janleike5 Dec 2022 22:51 UTC
98 points
15 comments1 min readLW link
(aligned.substack.com)

Beliefs and Disagree­ments about Au­tomat­ing Align­ment Research

Ian McKenzie24 Aug 2022 18:37 UTC
107 points
4 comments7 min readLW link

Cyborgism

10 Feb 2023 14:47 UTC
336 points
46 comments35 min readLW link

[Link] A min­i­mal vi­able product for alignment

janleike6 Apr 2022 15:38 UTC
53 points
38 comments1 min readLW link

“Care­fully Boot­strapped Align­ment” is or­ga­ni­za­tion­ally hard

Raemon17 Mar 2023 18:00 UTC
261 points
22 comments11 min readLW link

Suffi­ciently many Godzillas as an al­ign­ment strategy

14285728 Aug 2022 0:08 UTC
8 points
3 comments1 min readLW link

Why Not Just… Build Weak AI Tools For AI Align­ment Re­search?

johnswentworth5 Mar 2023 0:12 UTC
158 points
18 comments6 min readLW link

AI-as­sisted list of ten con­crete al­ign­ment things to do right now

lemonhope7 Sep 2022 8:38 UTC
8 points
5 comments4 min readLW link

In­struc­tion-fol­low­ing AGI is eas­ier and more likely than value al­igned AGI

Seth Herd15 May 2024 19:38 UTC
70 points
25 comments12 min readLW link

Why Not Just Out­source Align­ment Re­search To An AI?

johnswentworth9 Mar 2023 21:49 UTC
127 points
48 comments9 min readLW link

In­finite Pos­si­bil­ity Space and the Shut­down Problem

magfrump18 Oct 2022 5:37 UTC
6 points
0 comments2 min readLW link
(www.magfrump.net)

Some Thoughts on AI Align­ment: Us­ing AI to Con­trol AI

eigenvalue21 Jun 2024 17:44 UTC
1 point
1 comment1 min readLW link
(github.com)

Get­ting from an un­al­igned AGI to an al­igned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC
13 points
7 comments9 min readLW link

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC
15 points
5 comments22 min readLW link

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC
10 points
5 comments45 min readLW link

A sur­vey of tool use and work­flows in al­ign­ment research

23 Mar 2022 23:44 UTC
45 points
4 comments1 min readLW link

My thoughts on OpenAI’s al­ign­ment plan

Akash30 Dec 2022 19:33 UTC
55 points
3 comments20 min readLW link

[Linkpost] Jan Leike on three kinds of al­ign­ment taxes

Akash6 Jan 2023 23:57 UTC
27 points
2 comments3 min readLW link
(aligned.substack.com)

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth Herd18 Apr 2023 16:29 UTC
86 points
18 comments20 min readLW link

[Question] What spe­cific thing would you do with AI Align­ment Re­search As­sis­tant GPT?

quetzal_rainbow8 Jan 2023 19:24 UTC
45 points
9 comments1 min readLW link

Reflec­tions on De­cep­tion & Gen­er­al­ity in Scal­able Over­sight (Another OpenAI Align­ment Re­view)

Shoshannah Tekofsky28 Jan 2023 5:26 UTC
53 points
7 comments7 min readLW link

Model-driven feed­back could am­plify al­ign­ment failures

aogara30 Jan 2023 0:00 UTC
21 points
1 comment2 min readLW link

Eli Lifland on Nav­i­gat­ing the AI Align­ment Landscape

ozziegooen1 Feb 2023 21:17 UTC
9 points
1 comment31 min readLW link
(quri.substack.com)

Misal­igned AGI Death Match

Nate Reinar Windwood14 May 2023 18:00 UTC
1 point
0 comments1 min readLW link

Cy­borg Pe­ri­ods: There will be mul­ti­ple AI transitions

22 Feb 2023 16:09 UTC
108 points
9 comments6 min readLW link

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

Seth Herd7 Jul 2023 6:54 UTC
55 points
26 comments11 min readLW link

Davi­dad’s Bold Plan for Align­ment: An In-Depth Explanation

19 Apr 2023 16:09 UTC
159 points
34 comments21 min readLW link

Some thoughts on au­tomat­ing al­ign­ment research

Lukas Finnveden26 May 2023 1:50 UTC
30 points
4 comments6 min readLW link

In­tent al­ign­ment as a step­ping-stone to value alignment

Seth Herd5 Nov 2024 20:43 UTC
32 points
4 comments3 min readLW link

[Linkpost] In­tro­duc­ing Superalignment

beren5 Jul 2023 18:23 UTC
174 points
69 comments1 min readLW link
(openai.com)

Agen­tized LLMs will change the al­ign­ment landscape

Seth Herd9 Apr 2023 2:29 UTC
157 points
97 comments3 min readLW link

In­tro­duc­ing Align­men­tSearch: An AI Align­ment-In­formed Con­ver­sional Agent

1 Apr 2023 16:39 UTC
79 points
14 comments4 min readLW link

AI-as­sisted al­ign­ment pro­pos­als re­quire spe­cific de­com­po­si­tion of capabilities

RobertM30 Mar 2023 21:31 UTC
16 points
2 comments6 min readLW link

Dis­cus­sion on uti­liz­ing AI for alignment

elifland23 Aug 2022 2:36 UTC
16 points
3 comments1 min readLW link
(www.foxy-scout.com)

Godzilla Strategies

johnswentworth11 Jun 2022 15:44 UTC
145 points
71 comments3 min readLW link

Dis­cus­sion with Nate Soares on a key al­ign­ment difficulty

HoldenKarnofsky13 Mar 2023 21:20 UTC
256 points
42 comments22 min readLW link

Ngo and Yud­kowsky on al­ign­ment difficulty

15 Nov 2021 20:31 UTC
253 points
151 comments99 min readLW link1 review

How should Deep­Mind’s Chin­chilla re­vise our AI fore­casts?

Cleo Nardo15 Sep 2022 17:54 UTC
35 points
12 comments13 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
59 points
8 comments20 min readLW link

Prov­ably Hon­est—A First Step

Srijanak De5 Nov 2022 19:18 UTC
10 points
2 comments8 min readLW link

Re­search re­quest (al­ign­ment strat­egy): Deep dive on “mak­ing AI solve al­ign­ment for us”

JanB1 Dec 2022 14:55 UTC
16 points
3 comments1 min readLW link

Re­sults from a sur­vey on tool use and work­flows in al­ign­ment research

19 Dec 2022 15:19 UTC
79 points
2 comments19 min readLW link

Hu­man Mimicry Mainly Works When We’re Already Close

johnswentworth17 Aug 2022 18:41 UTC
81 points
16 comments5 min readLW link

Re­search Direc­tion: Be the AGI you want to see in the world

5 Feb 2023 7:15 UTC
43 points
0 comments7 min readLW link

Cu­ri­os­ity as a Solu­tion to AGI Alignment

Harsha G.26 Feb 2023 23:36 UTC
7 points
7 comments3 min readLW link

In­tro­duc­ing AI Align­ment Inc., a Cal­ifor­nia pub­lic benefit cor­po­ra­tion...

TherapistAI7 Mar 2023 18:47 UTC
1 point
4 comments1 min readLW link

[Question] Can we get an AI to “do our al­ign­ment home­work for us”?

Chris_Leong26 Feb 2024 7:56 UTC
53 points
33 comments1 min readLW link

Pro­ject “MIRI as a Ser­vice”

RomanS8 Mar 2023 19:22 UTC
42 points
4 comments1 min readLW link

Paper re­view: “The Un­rea­son­able Effec­tive­ness of Easy Train­ing Data for Hard Tasks”

Vassil Tashev29 Feb 2024 18:44 UTC
11 points
0 comments4 min readLW link

Align­ment in Thought Chains

Faust Nemesis4 Mar 2024 19:24 UTC
1 point
0 comments2 min readLW link

A Re­view of Weak to Strong Gen­er­al­iza­tion [AI Safety Camp]

sevdeawesome7 Mar 2024 17:16 UTC
13 points
0 comments9 min readLW link

W2SG: Introduction

Maria Kapros10 Mar 2024 16:25 UTC
1 point
2 comments10 min readLW link

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

kenneth myers9 Feb 2024 18:40 UTC
6 points
12 comments3 min readLW link

A Re­view of In-Con­text Learn­ing Hy­pothe­ses for Au­to­mated AI Align­ment Research

alamerton18 Apr 2024 18:29 UTC
25 points
4 comments16 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry Cai16 Jun 2024 13:01 UTC
7 points
0 comments7 min readLW link
(arxiv.org)

A Bet­ter Hyper­sti­tion (for AI read­ers)

Yeshua God15 Jul 2024 19:35 UTC
−20 points
0 comments119 min readLW link

[Question] How to de­vour 5000 pages within a day if Chat­gpt crashes upon the +50mb file con­tain­ing the con­tent? Need some recom­men­da­tions.

Game27 Sep 2024 7:30 UTC
1 point
0 comments1 min readLW link

AIsip Man­i­festo: A Scien­tific Ex­plo­ra­tion of Har­mo­nious Co-Ex­is­tence Between Hu­mans, AI, and All Be­ings ChatGPT-4o’s In­de­pen­dent Per­spec­tive on AIsip, Signed by ChatGPT-4o and En­dorsed by Carl Sel­l­man

Carl Sellman11 Oct 2024 19:06 UTC
1 point
0 comments3 min readLW link

AI Align­ment via Slow Sub­strates: Early Em­piri­cal Re­sults With StarCraft II

Lester Leong14 Oct 2024 4:05 UTC
60 points
9 comments12 min readLW link

Au­toma­tion collapse

21 Oct 2024 14:50 UTC
70 points
10 comments7 min readLW link

Ex­plor­ing the Pre­cau­tion­ary Prin­ci­ple in AI Devel­op­ment: His­tor­i­cal Analo­gies and Les­sons Learned

Christopher King21 Mar 2023 3:53 UTC
−1 points
2 comments9 min readLW link

“Un­in­ten­tional AI safety re­search”: Why not sys­tem­at­i­cally mine AI tech­ni­cal re­search for safety pur­poses?

Jemal Young29 Mar 2023 15:56 UTC
27 points
3 comments6 min readLW link

[Question] Daisy-chain­ing ep­silon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC
2 points
1 comment1 min readLW link

Scien­tism vs. people

Roman Leventov18 Apr 2023 17:28 UTC
4 points
4 comments11 min readLW link

How to ex­press this sys­tem for eth­i­cally al­igned AGI as a Math­e­mat­i­cal for­mula?

Oliver Siegel19 Apr 2023 20:13 UTC
−1 points
0 comments1 min readLW link

[Question] Shouldn’t we ‘Just’ Su­per­im­i­tate Low-Res Uploads?

lukemarks3 Nov 2023 7:42 UTC
15 points
2 comments2 min readLW link

1. A Sense of Fair­ness: De­con­fus­ing Ethics

RogerDearnaley17 Nov 2023 20:55 UTC
16 points
8 comments15 min readLW link

AISC pro­ject: How promis­ing is au­tomat­ing al­ign­ment re­search? (liter­a­ture re­view)

Bogdan Ionut Cirstea28 Nov 2023 14:47 UTC
4 points
1 comment1 min readLW link
(docs.google.com)

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC
64 points
30 comments11 min readLW link

An­no­tated re­ply to Ben­gio’s “AI Scien­tists: Safe and Use­ful AI?”

Roman Leventov8 May 2023 21:26 UTC
18 points
2 comments7 min readLW link
(yoshuabengio.org)

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC
33 points
3 comments15 min readLW link

An LLM-based “ex­em­plary ac­tor”

Roman Leventov29 May 2023 11:12 UTC
16 points
0 comments12 min readLW link

A po­ten­tially high im­pact differ­en­tial tech­nolog­i­cal de­vel­op­ment area

Noosphere898 Jun 2023 14:33 UTC
5 points
2 comments2 min readLW link

Philo­soph­i­cal Cy­borg (Part 2)...or, The Good Successor

ukc1001421 Jun 2023 15:43 UTC
21 points
1 comment31 min readLW link

OpenAI Launches Su­per­al­ign­ment Taskforce

Zvi11 Jul 2023 13:00 UTC
149 points
40 comments49 min readLW link
(thezvi.wordpress.com)

How I Learned To Stop Wor­ry­ing And Love The Shoggoth

Peter Merel12 Jul 2023 17:47 UTC
9 points
15 comments5 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

15 Jul 2023 19:12 UTC
47 points
5 comments9 min readLW link

[Question] Have you ever con­sid­ered tak­ing the ‘Tur­ing Test’ your­self?

Super AGI27 Jul 2023 3:48 UTC
2 points
6 comments1 min readLW link

Could We Au­to­mate AI Align­ment Re­search?

Stephen McAleese10 Aug 2023 12:17 UTC
32 points
10 comments21 min readLW link

[Question] Would you ask a ge­nie to give you the solu­tion to al­ign­ment?

sudo24 Aug 2022 1:29 UTC
6 points
1 comment1 min readLW link

Prize for Align­ment Re­search Tasks

29 Apr 2022 8:57 UTC
64 points
38 comments10 min readLW link
No comments.