RSS

AI-As­sisted Alignment

TagLast edit: 25 Jan 2024 5:18 UTC by habryka

AI-Assisted Alignment is a cluster of alignment plans that involve AI somehow significantly helping with alignment research. This can include weak tool AI, or more advanced AGI doing original research.

There has been a lot of debate about how practical this alignment approach is.

Other search terms for this tag: AI aligning AI

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC
61 points
41 comments24 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC
41 points
12 comments31 min readLW link

Pro­posed Align­ment Tech­nique: OSNR (Out­put San­i­ti­za­tion via Nois­ing and Re­con­struc­tion) for Safer Usage of Po­ten­tially Misal­igned AGI

sudo29 May 2023 1:35 UTC
14 points
9 comments6 min readLW link

We have to Upgrade

Jed McCaleb23 Mar 2023 17:53 UTC
131 points
35 comments2 min readLW link

[Link] Why I’m op­ti­mistic about OpenAI’s al­ign­ment approach

janleike5 Dec 2022 22:51 UTC
98 points
15 comments1 min readLW link
(aligned.substack.com)

Beliefs and Disagree­ments about Au­tomat­ing Align­ment Research

Ian McKenzie24 Aug 2022 18:37 UTC
107 points
4 comments7 min readLW link

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC
65 points
30 comments11 min readLW link

In­finite Pos­si­bil­ity Space and the Shut­down Problem

magfrump18 Oct 2022 5:37 UTC
9 points
0 comments2 min readLW link
(www.magfrump.net)

[Link] A min­i­mal vi­able product for alignment

janleike6 Apr 2022 15:38 UTC
53 points
38 comments1 min readLW link

Cyborgism

10 Feb 2023 14:47 UTC
341 points
46 comments35 min readLW link2 reviews

Misal­igned AGI Death Match

Nate Reinar Windwood14 May 2023 18:00 UTC
1 point
0 comments1 min readLW link

Get­ting from an un­al­igned AGI to an al­igned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC
13 points
7 comments9 min readLW link

In­tro­duc­ing Align­men­tSearch: An AI Align­ment-In­formed Con­ver­sional Agent

1 Apr 2023 16:39 UTC
79 points
14 comments4 min readLW link

Some Thoughts on AI Align­ment: Us­ing AI to Con­trol AI

eigenvalue21 Jun 2024 17:44 UTC
1 point
1 comment1 min readLW link
(github.com)

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC
10 points
5 comments45 min readLW link

Some thoughts on au­tomat­ing al­ign­ment research

Lukas Finnveden26 May 2023 1:50 UTC
30 points
4 comments6 min readLW link

Davi­dad’s Bold Plan for Align­ment: An In-Depth Explanation

19 Apr 2023 16:09 UTC
168 points
40 comments21 min readLW link2 reviews

AI Tools for Ex­is­ten­tial Security

14 Mar 2025 18:38 UTC
22 points
4 comments11 min readLW link
(www.forethought.org)

Can we safely au­to­mate al­ign­ment re­search?

Joe Carlsmith30 Apr 2025 17:37 UTC
53 points
29 comments48 min readLW link
(joecarlsmith.com)

Deep sparse au­toen­coders yield in­ter­pretable fea­tures too

Armaan A. Abraham23 Feb 2025 5:46 UTC
29 points
8 comments8 min readLW link

Agen­tized LLMs will change the al­ign­ment landscape

Seth Herd9 Apr 2023 2:29 UTC
160 points
102 comments3 min readLW link1 review

[Linkpost] In­tro­duc­ing Superalignment

beren5 Jul 2023 18:23 UTC
175 points
69 comments1 min readLW link
(openai.com)

[Linkpost] Jan Leike on three kinds of al­ign­ment taxes

Orpheus166 Jan 2023 23:57 UTC
27 points
2 comments3 min readLW link
(aligned.substack.com)

In­struc­tion-fol­low­ing AGI is eas­ier and more likely than value al­igned AGI

Seth Herd15 May 2024 19:38 UTC
80 points
28 comments12 min readLW link

Main­tain­ing Align­ment dur­ing RSI as a Feed­back Con­trol Problem

beren2 Mar 2025 0:21 UTC
66 points
6 comments11 min readLW link

[Question] What spe­cific thing would you do with AI Align­ment Re­search As­sis­tant GPT?

quetzal_rainbow8 Jan 2023 19:24 UTC
47 points
9 comments1 min readLW link

Dis­cus­sion on uti­liz­ing AI for alignment

elifland23 Aug 2022 2:36 UTC
16 points
3 comments1 min readLW link
(www.foxy-scout.com)

A sur­vey of tool use and work­flows in al­ign­ment research

23 Mar 2022 23:44 UTC
45 points
4 comments1 min readLW link

Cy­borg Pe­ri­ods: There will be mul­ti­ple AI transitions

22 Feb 2023 16:09 UTC
108 points
9 comments6 min readLW link

The prospect of ac­cel­er­ated AI safety progress, in­clud­ing philo­soph­i­cal progress

Mitchell_Porter13 Mar 2025 10:52 UTC
11 points
0 comments4 min readLW link

AI for Re­solv­ing Fore­cast­ing Ques­tions: An Early Exploration

ozziegooen16 Jan 2025 21:41 UTC
10 points
2 comments1 min readLW link

Anti-Slop In­ter­ven­tions?

abramdemski4 Feb 2025 19:50 UTC
76 points
33 comments6 min readLW link

Suffi­ciently many Godzillas as an al­ign­ment strategy

14285728 Aug 2022 0:08 UTC
8 points
3 comments1 min readLW link

Dis­cus­sion with Nate Soares on a key al­ign­ment difficulty

HoldenKarnofsky13 Mar 2023 21:20 UTC
267 points
43 comments22 min readLW link1 review

How might we safely pass the buck to AI?

joshc19 Feb 2025 17:48 UTC
83 points
58 comments31 min readLW link

AI for AI safety

Joe Carlsmith14 Mar 2025 15:00 UTC
78 points
13 comments17 min readLW link
(joecarlsmith.substack.com)

AI-as­sisted list of ten con­crete al­ign­ment things to do right now

lemonhope7 Sep 2022 8:38 UTC
8 points
5 comments4 min readLW link

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth Herd18 Apr 2023 16:29 UTC
88 points
18 comments20 min readLW link

In­tent al­ign­ment as a step­ping-stone to value alignment

Seth Herd5 Nov 2024 20:43 UTC
37 points
8 comments3 min readLW link

Video and tran­script of talk on au­tomat­ing al­ign­ment research

Joe Carlsmith30 Apr 2025 17:43 UTC
21 points
0 comments24 min readLW link
(joecarlsmith.com)

Eli Lifland on Nav­i­gat­ing the AI Align­ment Landscape

ozziegooen1 Feb 2023 21:17 UTC
9 points
1 comment31 min readLW link
(quri.substack.com)

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC
15 points
5 comments22 min readLW link

My thoughts on OpenAI’s al­ign­ment plan

Orpheus1630 Dec 2022 19:33 UTC
55 points
3 comments20 min readLW link

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

Seth Herd7 Jul 2023 6:54 UTC
55 points
30 comments11 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ank15 Feb 2025 11:08 UTC
2 points
2 comments2 min readLW link

[Question] I Tried to For­mal­ize Mean­ing. I May Have Ac­ci­den­tally De­scribed Con­scious­ness.

Erichcurtis9130 Apr 2025 3:16 UTC
0 points
0 comments2 min readLW link

A Re­view of Weak to Strong Gen­er­al­iza­tion [AI Safety Camp]

sevdeawesome7 Mar 2024 17:16 UTC
14 points
0 comments9 min readLW link

W2SG: Introduction

Maria Kapros10 Mar 2024 16:25 UTC
2 points
2 comments10 min readLW link

[Question] How to de­vour 5000 pages within a day if Chat­gpt crashes upon the +50mb file con­tain­ing the con­tent? Need some recom­men­da­tions.

Game27 Sep 2024 7:30 UTC
1 point
0 comments1 min readLW link

“Un­in­ten­tional AI safety re­search”: Why not sys­tem­at­i­cally mine AI tech­ni­cal re­search for safety pur­poses?

Jemal Young29 Mar 2023 15:56 UTC
27 points
3 comments6 min readLW link

We should try to au­to­mate AI safety work asap

Marius Hobbhahn26 Apr 2025 16:35 UTC
108 points
10 comments15 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry Cai16 Jun 2024 13:01 UTC
7 points
0 comments7 min readLW link
(arxiv.org)

How to ex­press this sys­tem for eth­i­cally al­igned AGI as a Math­e­mat­i­cal for­mula?

Oliver Siegel19 Apr 2023 20:13 UTC
−1 points
0 comments1 min readLW link

Is Align­ment a flawed ap­proach?

Patrick Bernard11 Mar 2025 20:32 UTC
1 point
0 comments3 min readLW link

How I Learned To Stop Wor­ry­ing And Love The Shoggoth

Peter Merel12 Jul 2023 17:47 UTC
9 points
15 comments5 min readLW link

Re­search re­quest (al­ign­ment strat­egy): Deep dive on “mak­ing AI solve al­ign­ment for us”

JanB1 Dec 2022 14:55 UTC
16 points
3 comments1 min readLW link

Align­ment Does Not Need to Be Opaque! An In­tro­duc­tion to Fea­ture Steer­ing with Re­in­force­ment Learning

Jeremias Ferrao18 Apr 2025 19:34 UTC
5 points
0 comments10 min readLW link

An­no­tated re­ply to Ben­gio’s “AI Scien­tists: Safe and Use­ful AI?”

Roman Leventov8 May 2023 21:26 UTC
18 points
2 comments7 min readLW link
(yoshuabengio.org)

Con­sti­tu­tional Clas­sifiers: Defend­ing against uni­ver­sal jailbreaks (An­thropic Blog)

Archimedes4 Feb 2025 2:55 UTC
16 points
1 comment1 min readLW link
(www.anthropic.com)

Prize for Align­ment Re­search Tasks

29 Apr 2022 8:57 UTC
64 points
38 comments10 min readLW link

Godzilla Strategies

johnswentworth11 Jun 2022 15:44 UTC
159 points
72 comments3 min readLW link

A po­ten­tially high im­pact differ­en­tial tech­nolog­i­cal de­vel­op­ment area

Noosphere898 Jun 2023 14:33 UTC
5 points
2 comments2 min readLW link

Lan­guage Models and World Models, a Philosophy

kyjohnso3 Feb 2025 2:55 UTC
1 point
0 comments1 min readLW link
(hylaeansea.org)

How should Deep­Mind’s Chin­chilla re­vise our AI fore­casts?

Cleo Nardo15 Sep 2022 17:54 UTC
35 points
12 comments13 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
60 points
8 comments20 min readLW link

Cu­ri­os­ity as a Solu­tion to AGI Alignment

Harsha G.26 Feb 2023 23:36 UTC
7 points
7 comments3 min readLW link

AI-Gen­er­ated GitHub repo back­dated with junk then filled with my sys­tems work. Has any­one seen this be­fore?

rgunther1 May 2025 20:14 UTC
7 points
1 comment1 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC
33 points
3 comments15 min readLW link

[Question] Are Sparse Au­toen­coders a good idea for AI con­trol?

Gerard Boxo26 Dec 2024 17:34 UTC
3 points
4 comments1 min readLW link

Could We Au­to­mate AI Align­ment Re­search?

Stephen McAleese10 Aug 2023 12:17 UTC
34 points
10 comments21 min readLW link

In­tro­duc­ing AI Align­ment Inc., a Cal­ifor­nia pub­lic benefit cor­po­ra­tion...

TherapistAI7 Mar 2023 18:47 UTC
1 point
4 comments1 min readLW link

Ex­plor­ing the Pre­cau­tion­ary Prin­ci­ple in AI Devel­op­ment: His­tor­i­cal Analo­gies and Les­sons Learned

Christopher King21 Mar 2023 3:53 UTC
−1 points
2 comments9 min readLW link

1. A Sense of Fair­ness: De­con­fus­ing Ethics

RogerDearnaley17 Nov 2023 20:55 UTC
17 points
8 comments15 min readLW link

The Over­lap Paradigm: Re­think­ing Data’s Role in Weak-to-Strong Gen­er­al­iza­tion (W2SG)

Serhii Zamrii3 Feb 2025 19:31 UTC
2 points
0 comments11 min readLW link

Re­search Direc­tion: Be the AGI you want to see in the world

5 Feb 2023 7:15 UTC
44 points
0 comments7 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

15 Jul 2023 19:12 UTC
47 points
5 comments9 min readLW link

[Question] Would you ask a ge­nie to give you the solu­tion to al­ign­ment?

sudo24 Aug 2022 1:29 UTC
6 points
1 comment1 min readLW link

Re­cur­sive al­ign­ment with the prin­ci­ple of alignment

hive27 Feb 2025 2:34 UTC
9 points
1 comment15 min readLW link
(hiveism.substack.com)

Paper re­view: “The Un­rea­son­able Effec­tive­ness of Easy Train­ing Data for Hard Tasks”

Vassil Tashev29 Feb 2024 18:44 UTC
11 points
0 comments4 min readLW link

[Question] Daisy-chain­ing ep­silon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC
2 points
1 comment1 min readLW link

Tether­ware #1: The case for hu­man­like AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC
5 points
14 comments10 min readLW link
(tetherware.substack.com)

Does Time Lin­ear­ity Shape Hu­man Self-Directed Evolu­tion, and will AGI/​ASI Tran­scend or Desta­bil­ise Real­ity?

Emmely5 Feb 2025 7:58 UTC
1 point
0 comments3 min readLW link

AI-as­sisted al­ign­ment pro­pos­als re­quire spe­cific de­com­po­si­tion of capabilities

RobertM30 Mar 2023 21:31 UTC
16 points
2 comments6 min readLW link

An LLM-based “ex­em­plary ac­tor”

Roman Leventov29 May 2023 11:12 UTC
16 points
0 comments12 min readLW link

AIsip Man­i­festo: A Scien­tific Ex­plo­ra­tion of Har­mo­nious Co-Ex­is­tence Between Hu­mans, AI, and All Be­ings ChatGPT-4o’s In­de­pen­dent Per­spec­tive on AIsip, Signed by ChatGPT-4o and En­dorsed by Carl Sel­l­man

Carl Sellman11 Oct 2024 19:06 UTC
1 point
0 comments3 min readLW link

As We May Align

Gilbert C20 Dec 2024 19:02 UTC
−1 points
0 comments6 min readLW link

Ngo and Yud­kowsky on al­ign­ment difficulty

15 Nov 2021 20:31 UTC
259 points
151 comments99 min readLW link1 review

A Solu­tion for AGI/​ASI Safety

Weibing Wang18 Dec 2024 19:44 UTC
50 points
29 comments1 min readLW link

Re­sults from a sur­vey on tool use and work­flows in al­ign­ment research

19 Dec 2022 15:19 UTC
79 points
2 comments19 min readLW link

Prov­ably Hon­est—A First Step

Srijanak De5 Nov 2022 19:18 UTC
10 points
2 comments8 min readLW link

Align­ment in Thought Chains

Faust Nemesis4 Mar 2024 19:24 UTC
1 point
0 comments2 min readLW link

[Question] How far along Metr’s law can AI start au­tomat­ing or helping with al­ign­ment re­search?

Christopher King20 Mar 2025 15:58 UTC
20 points
21 comments1 min readLW link

Get­tier Cases [re­post]

Antigone3 Feb 2025 18:12 UTC
−4 points
5 comments2 min readLW link

[Question] Shouldn’t we ‘Just’ Su­per­im­i­tate Low-Res Uploads?

lukemarks3 Nov 2023 7:42 UTC
15 points
2 comments2 min readLW link

Scien­tism vs. people

Roman Leventov18 Apr 2023 17:28 UTC
4 points
4 comments11 min readLW link

AI Align­ment via Slow Sub­strates: Early Em­piri­cal Re­sults With StarCraft II

Lester Leong14 Oct 2024 4:05 UTC
60 points
9 comments12 min readLW link

[Question] Can we get an AI to “do our al­ign­ment home­work for us”?

Chris_Leong26 Feb 2024 7:56 UTC
53 points
33 comments1 min readLW link

AISC pro­ject: How promis­ing is au­tomat­ing al­ign­ment re­search? (liter­a­ture re­view)

Bogdan Ionut Cirstea28 Nov 2023 14:47 UTC
4 points
1 comment1 min readLW link
(docs.google.com)

Pro­posal: Deriva­tive In­for­ma­tion The­ory (DIT) — A Dy­namic Model of Agency and Consciousness

Yogmog14 Apr 2025 0:27 UTC
1 point
0 comments2 min readLW link

Model-driven feed­back could am­plify al­ign­ment failures

aog30 Jan 2023 0:00 UTC
21 points
1 comment2 min readLW link

A Re­view of In-Con­text Learn­ing Hy­pothe­ses for Au­to­mated AI Align­ment Research

alamerton18 Apr 2024 18:29 UTC
25 points
4 comments16 min readLW link

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

kenneth myers9 Feb 2024 18:40 UTC
6 points
12 comments3 min readLW link

[Question] Have you ever con­sid­ered tak­ing the ‘Tur­ing Test’ your­self?

Super AGI27 Jul 2023 3:48 UTC
2 points
6 comments1 min readLW link

Emer­gence of su­per­in­tel­li­gence from AI hive­minds: how to make it hu­man-friendly?

Mitchell_Porter27 Apr 2025 4:51 UTC
12 points
0 comments2 min readLW link

I Don’t Use AI — I Reflect With It

badjack badjack3 May 2025 14:45 UTC
1 point
0 comments1 min readLW link

Philo­soph­i­cal Cy­borg (Part 2)...or, The Good Successor

ukc1001421 Jun 2023 15:43 UTC
21 points
1 comment31 min readLW link

Prospects for Align­ment Au­toma­tion: In­ter­pretabil­ity Case Study

21 Mar 2025 14:05 UTC
32 points
5 comments8 min readLW link
No comments.