RSS

AI-As­sisted Alignment

TagLast edit: Jan 25, 2024, 5:18 AM by habryka

AI-Assisted Alignment is a cluster of alignment plans that involve AI somehow significantly helping with alignment research. This can include weak tool AI, or more advanced AGI doing original research.

There has been a lot of debate about how practical this alignment approach is.

Other search terms for this tag: AI aligning AI

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaleyJul 6, 2024, 1:23 AM
60 points
39 comments24 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM
40 points
12 comments31 min readLW link

We have to Upgrade

Jed McCalebMar 23, 2023, 5:53 PM
129 points
35 comments2 min readLW link

Pro­posed Align­ment Tech­nique: OSNR (Out­put San­i­ti­za­tion via Nois­ing and Re­con­struc­tion) for Safer Usage of Po­ten­tially Misal­igned AGI

sudoMay 29, 2023, 1:35 AM
14 points
9 comments6 min readLW link

[Link] Why I’m op­ti­mistic about OpenAI’s al­ign­ment approach

janleikeDec 5, 2022, 10:51 PM
98 points
15 comments1 min readLW link
(aligned.substack.com)

In­finite Pos­si­bil­ity Space and the Shut­down Problem

magfrumpOct 18, 2022, 5:37 AM
9 points
0 comments2 min readLW link
(www.magfrump.net)

Beliefs and Disagree­ments about Au­tomat­ing Align­ment Research

Ian McKenzieAug 24, 2022, 6:37 PM
107 points
4 comments7 min readLW link

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaleyNov 28, 2023, 7:56 PM
64 points
30 comments11 min readLW link

[Link] A min­i­mal vi­able product for alignment

janleikeApr 6, 2022, 3:38 PM
53 points
38 comments1 min readLW link

Cyborgism

Feb 10, 2023, 2:47 PM
336 points
46 comments35 min readLW link2 reviews

Davi­dad’s Bold Plan for Align­ment: An In-Depth Explanation

Apr 19, 2023, 4:09 PM
168 points
40 comments21 min readLW link2 reviews

AI for Re­solv­ing Fore­cast­ing Ques­tions: An Early Exploration

ozziegooenJan 16, 2025, 9:41 PM
10 points
2 comments1 min readLW link

[Linkpost] In­tro­duc­ing Superalignment

berenJul 5, 2023, 6:23 PM
175 points
69 comments1 min readLW link
(openai.com)

Suffi­ciently many Godzillas as an al­ign­ment strategy

142857Aug 28, 2022, 12:08 AM
8 points
3 comments1 min readLW link

Dis­cus­sion with Nate Soares on a key al­ign­ment difficulty

HoldenKarnofskyMar 13, 2023, 9:20 PM
265 points
43 comments22 min readLW link1 review

AI-as­sisted list of ten con­crete al­ign­ment things to do right now

lemonhopeSep 7, 2022, 8:38 AM
8 points
5 comments4 min readLW link

Cy­borg Pe­ri­ods: There will be mul­ti­ple AI transitions

Feb 22, 2023, 4:09 PM
108 points
9 comments6 min readLW link

Some thoughts on au­tomat­ing al­ign­ment research

Lukas FinnvedenMay 26, 2023, 1:50 AM
30 points
4 comments6 min readLW link

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth HerdApr 18, 2023, 4:29 PM
86 points
18 comments20 min readLW link

Misal­igned AGI Death Match

Nate Reinar WindwoodMay 14, 2023, 6:00 PM
1 point
0 comments1 min readLW link

Get­ting from an un­al­igned AGI to an al­igned AGI?

Tor Økland BarstadJun 21, 2022, 12:36 PM
13 points
7 comments9 min readLW link

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland BarstadJul 9, 2022, 2:42 PM
15 points
5 comments22 min readLW link

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland BarstadDec 13, 2022, 2:17 AM
10 points
5 comments45 min readLW link

In­struc­tion-fol­low­ing AGI is eas­ier and more likely than value al­igned AGI

Seth HerdMay 15, 2024, 7:38 PM
79 points
28 comments12 min readLW link

In­tent al­ign­ment as a step­ping-stone to value alignment

Seth HerdNov 5, 2024, 8:43 PM
37 points
6 comments3 min readLW link

A sur­vey of tool use and work­flows in al­ign­ment research

Mar 23, 2022, 11:44 PM
45 points
4 comments1 min readLW link

My thoughts on OpenAI’s al­ign­ment plan

AkashDec 30, 2022, 7:33 PM
55 points
3 comments20 min readLW link

[Linkpost] Jan Leike on three kinds of al­ign­ment taxes

AkashJan 6, 2023, 11:57 PM
27 points
2 comments3 min readLW link
(aligned.substack.com)

[Question] What spe­cific thing would you do with AI Align­ment Re­search As­sis­tant GPT?

quetzal_rainbowJan 8, 2023, 7:24 PM
47 points
9 comments1 min readLW link

Eli Lifland on Nav­i­gat­ing the AI Align­ment Landscape

ozziegooenFeb 1, 2023, 9:17 PM
9 points
1 comment31 min readLW link
(quri.substack.com)

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

Seth HerdJul 7, 2023, 6:54 AM
55 points
30 comments11 min readLW link

Some Thoughts on AI Align­ment: Us­ing AI to Con­trol AI

eigenvalueJun 21, 2024, 5:44 PM
1 point
1 comment1 min readLW link
(github.com)

Anti-Slop In­ter­ven­tions?

abramdemskiFeb 4, 2025, 7:50 PM
74 points
33 comments6 min readLW link

Agen­tized LLMs will change the al­ign­ment landscape

Seth HerdApr 9, 2023, 2:29 AM
160 points
102 comments3 min readLW link1 review

In­tro­duc­ing Align­men­tSearch: An AI Align­ment-In­formed Con­ver­sional Agent

Apr 1, 2023, 4:39 PM
79 points
14 comments4 min readLW link

Dis­cus­sion on uti­liz­ing AI for alignment

eliflandAug 23, 2022, 2:36 AM
16 points
3 comments1 min readLW link
(www.foxy-scout.com)

AISC pro­ject: How promis­ing is au­tomat­ing al­ign­ment re­search? (liter­a­ture re­view)

Bogdan Ionut CirsteaNov 28, 2023, 2:47 PM
4 points
1 comment1 min readLW link
(docs.google.com)

An­no­tated re­ply to Ben­gio’s “AI Scien­tists: Safe and Use­ful AI?”

Roman LeventovMay 8, 2023, 9:26 PM
18 points
2 comments7 min readLW link
(yoshuabengio.org)

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaleyMay 25, 2023, 9:26 AM
33 points
3 comments15 min readLW link

An LLM-based “ex­em­plary ac­tor”

Roman LeventovMay 29, 2023, 11:12 AM
16 points
0 comments12 min readLW link

A po­ten­tially high im­pact differ­en­tial tech­nolog­i­cal de­vel­op­ment area

Noosphere89Jun 8, 2023, 2:33 PM
5 points
2 comments2 min readLW link

Philo­soph­i­cal Cy­borg (Part 2)...or, The Good Successor

ukc10014Jun 21, 2023, 3:43 PM
21 points
1 comment31 min readLW link

How I Learned To Stop Wor­ry­ing And Love The Shoggoth

Peter MerelJul 12, 2023, 5:47 PM
9 points
15 comments5 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

Jul 15, 2023, 7:12 PM
47 points
5 comments9 min readLW link

[Question] Have you ever con­sid­ered tak­ing the ‘Tur­ing Test’ your­self?

Super AGIJul 27, 2023, 3:48 AM
2 points
6 comments1 min readLW link

Could We Au­to­mate AI Align­ment Re­search?

Stephen McAleeseAug 10, 2023, 12:17 PM
34 points
10 comments21 min readLW link

[Question] Would you ask a ge­nie to give you the solu­tion to al­ign­ment?

sudoAug 24, 2022, 1:29 AM
6 points
1 comment1 min readLW link

Prize for Align­ment Re­search Tasks

Apr 29, 2022, 8:57 AM
64 points
38 comments10 min readLW link

Ngo and Yud­kowsky on al­ign­ment difficulty

Nov 15, 2021, 8:31 PM
254 points
151 comments99 min readLW link1 review

How should Deep­Mind’s Chin­chilla re­vise our AI fore­casts?

Cleo NardoSep 15, 2022, 5:54 PM
35 points
12 comments13 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

JozdienJul 18, 2022, 7:11 AM
59 points
8 comments20 min readLW link

Prov­ably Hon­est—A First Step

Srijanak DeNov 5, 2022, 7:18 PM
10 points
2 comments8 min readLW link

Re­search re­quest (al­ign­ment strat­egy): Deep dive on “mak­ing AI solve al­ign­ment for us”

JanBDec 1, 2022, 2:55 PM
16 points
3 comments1 min readLW link

Re­sults from a sur­vey on tool use and work­flows in al­ign­ment research

Dec 19, 2022, 3:19 PM
79 points
2 comments19 min readLW link

Re­search Direc­tion: Be the AGI you want to see in the world

Feb 5, 2023, 7:15 AM
43 points
0 comments7 min readLW link

Cu­ri­os­ity as a Solu­tion to AGI Alignment

Harsha G.Feb 26, 2023, 11:36 PM
7 points
7 comments3 min readLW link

[Question] Can we get an AI to “do our al­ign­ment home­work for us”?

Chris_LeongFeb 26, 2024, 7:56 AM
53 points
33 comments1 min readLW link

In­tro­duc­ing AI Align­ment Inc., a Cal­ifor­nia pub­lic benefit cor­po­ra­tion...

TherapistAIMar 7, 2023, 6:47 PM
1 point
4 comments1 min readLW link

Paper re­view: “The Un­rea­son­able Effec­tive­ness of Easy Train­ing Data for Hard Tasks”

Vassil TashevFeb 29, 2024, 6:44 PM
11 points
0 comments4 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ankFeb 15, 2025, 11:08 AM
2 points
2 comments1 min readLW link

Align­ment in Thought Chains

Faust NemesisMar 4, 2024, 7:24 PM
1 point
0 comments2 min readLW link

A Re­view of Weak to Strong Gen­er­al­iza­tion [AI Safety Camp]

sevdeawesomeMar 7, 2024, 5:16 PM
13 points
0 comments9 min readLW link

W2SG: Introduction

Maria KaprosMar 10, 2024, 4:25 PM
2 points
2 comments10 min readLW link

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

kenneth myersFeb 9, 2024, 6:40 PM
6 points
12 comments3 min readLW link

A Re­view of In-Con­text Learn­ing Hy­pothe­ses for Au­to­mated AI Align­ment Research

alamertonApr 18, 2024, 6:29 PM
25 points
4 comments16 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry CaiJun 16, 2024, 1:01 PM
7 points
0 comments7 min readLW link
(arxiv.org)

As We May Align

Gilbert CDec 20, 2024, 7:02 PM
−1 points
0 comments6 min readLW link

[Question] How to de­vour 5000 pages within a day if Chat­gpt crashes upon the +50mb file con­tain­ing the con­tent? Need some recom­men­da­tions.

GameSep 27, 2024, 7:30 AM
1 point
0 comments1 min readLW link

[Question] Are Sparse Au­toen­coders a good idea for AI con­trol?

Gerard BoxoDec 26, 2024, 5:34 PM
3 points
4 comments1 min readLW link

AIsip Man­i­festo: A Scien­tific Ex­plo­ra­tion of Har­mo­nious Co-Ex­is­tence Between Hu­mans, AI, and All Be­ings ChatGPT-4o’s In­de­pen­dent Per­spec­tive on AIsip, Signed by ChatGPT-4o and En­dorsed by Carl Sel­l­man

Carl SellmanOct 11, 2024, 7:06 PM
1 point
0 comments3 min readLW link

AI Align­ment via Slow Sub­strates: Early Em­piri­cal Re­sults With StarCraft II

Lester LeongOct 14, 2024, 4:05 AM
60 points
9 comments12 min readLW link

A Solu­tion for AGI/​ASI Safety

Weibing WangDec 18, 2024, 7:44 PM
50 points
29 comments1 min readLW link

AI-as­sisted al­ign­ment pro­pos­als re­quire spe­cific de­com­po­si­tion of capabilities

RobertM30 Mar 2023 21:31 UTC
16 points
2 comments6 min readLW link

Godzilla Strategies

johnswentworth11 Jun 2022 15:44 UTC
158 points
71 comments3 min readLW link

Model-driven feed­back could am­plify al­ign­ment failures

aogara30 Jan 2023 0:00 UTC
21 points
1 comment2 min readLW link

Tether­ware #1: The case for hu­man­like AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC
5 points
10 comments10 min readLW link
(tetherware.substack.com)

The Over­lap Paradigm: Re­think­ing Data’s Role in Weak-to-Strong Gen­er­al­iza­tion (W2SG)

Serhii Zamrii3 Feb 2025 19:31 UTC
2 points
0 comments7 min readLW link

Lan­guage Models and World Models, a Philosophy

kyjohnso3 Feb 2025 2:55 UTC
1 point
0 comments1 min readLW link
(hylaeansea.org)

Get­tier Cases [re­post]

Antigone3 Feb 2025 18:12 UTC
−4 points
4 comments2 min readLW link

Con­sti­tu­tional Clas­sifiers: Defend­ing against uni­ver­sal jailbreaks (An­thropic Blog)

Archimedes4 Feb 2025 2:55 UTC
16 points
0 comments1 min readLW link
(www.anthropic.com)

Does Time Lin­ear­ity Shape Hu­man Self-Directed Evolu­tion, and will AGI/​ASI Tran­scend or Desta­bil­ise Real­ity?

Emmely5 Feb 2025 7:58 UTC
1 point
0 comments3 min readLW link

Ex­plor­ing the Pre­cau­tion­ary Prin­ci­ple in AI Devel­op­ment: His­tor­i­cal Analo­gies and Les­sons Learned

Christopher King21 Mar 2023 3:53 UTC
−1 points
2 comments9 min readLW link

“Un­in­ten­tional AI safety re­search”: Why not sys­tem­at­i­cally mine AI tech­ni­cal re­search for safety pur­poses?

Jemal Young29 Mar 2023 15:56 UTC
27 points
3 comments6 min readLW link

[Question] Daisy-chain­ing ep­silon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC
2 points
1 comment1 min readLW link

Scien­tism vs. people

Roman Leventov18 Apr 2023 17:28 UTC
4 points
4 comments11 min readLW link

How to ex­press this sys­tem for eth­i­cally al­igned AGI as a Math­e­mat­i­cal for­mula?

Oliver Siegel19 Apr 2023 20:13 UTC
−1 points
0 comments1 min readLW link

[Question] Shouldn’t we ‘Just’ Su­per­im­i­tate Low-Res Uploads?

lukemarks3 Nov 2023 7:42 UTC
15 points
2 comments2 min readLW link

1. A Sense of Fair­ness: De­con­fus­ing Ethics

RogerDearnaley17 Nov 2023 20:55 UTC
16 points
8 comments15 min readLW link
No comments.