AI-Assisted Alignment

TagLast edit: May 20, 2025, 2:11 PM by niplav

AI-Assisted Alignment is a cluster of alignment plans that involve AI somehow significantly helping with alignment research. This can include weak tool AI, or more advanced AGI doing original research.

There has been some debate about how practical this alignment approach is.

AI systems will likely try to solve alignment for their modifications and/or successors during a phase of self-improvement.

Other search terms for this tag: AI aligning AI, automated AI alignment, automated alignment research

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaleyJul 6, 2024, 1:23 AM

64 points

41 comments24 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM

41 points

12 comments31 min readLW link

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

RogerDearnaleyMay 28, 2025, 6:21 AM

30 points

34 comments9 min readLW link

Proposed Alignment Technique: OSNR (Output Sanitization via Noising and Reconstruction) for Safer Usage of Potentially Misaligned AGI

sudoMay 29, 2023, 1:35 AM

14 points

9 comments6 min readLW link

We have to Upgrade

Jed McCalebMar 23, 2023, 5:53 PM

131 points

35 comments2 min readLW link

[Link] Why I’m optimistic about OpenAI’s alignment approach

janleikeDec 5, 2022, 10:51 PM

98 points

15 comments1 min readLW link

(aligned.substack.com)

Beliefs and Disagreements about Automating Alignment Research

Ian McKenzieAug 24, 2022, 6:37 PM

107 points

4 comments7 min readLW link

How to Control an LLM’s Behavior (why my P(DOOM) went down)

RogerDearnaleyNov 28, 2023, 7:56 PM

65 points

30 comments11 min readLW link

Infinite Possibility Space and the Shutdown Problem

magfrumpOct 18, 2022, 5:37 AM

9 points

0 comments2 min readLW link

(www.magfrump.net)

[Link] A minimal viable product for alignment

janleikeApr 6, 2022, 3:38 PM

53 points

38 comments1 min readLW link

Cyborgism

NicholasKees and janus

Feb 10, 2023, 2:47 PM

332 points

46 comments35 min readLW link 2 reviews

Alignment Might Never Be Solved, By Humans or AI

intersticeOct 7, 2022, 4:14 PM

49 points

6 comments3 min readLW link

Misaligned AGI Death Match

Nate Reinar WindwoodMay 14, 2023, 6:00 PM

1 point

0 comments1 min readLW link

Getting from an unaligned AGI to an aligned AGI?

Tor Økland BarstadJun 21, 2022, 12:36 PM

13 points

7 comments9 min readLW link

Introducing AlignmentSearch: An AI Alignment-Informed Conversional Agent

BionicD0LPH1N, Fraser and TheBayesian

Apr 1, 2023, 4:39 PM

79 points

14 comments4 min readLW link

Some Thoughts on AI Alignment: Using AI to Control AI

eigenvalueJun 21, 2024, 5:44 PM

1 point

1 comment1 min readLW link

(github.com)

Alignment with argument-networks and assessment-predictions

Tor Økland BarstadDec 13, 2022, 2:17 AM

10 points

5 comments45 min readLW link

Some thoughts on automating alignment research

Lukas FinnvedenMay 26, 2023, 1:50 AM

30 points

4 comments6 min readLW link

Davidad’s Bold Plan for Alignment: An In-Depth Explanation

Charbel-Raphaël and Gabin

Apr 19, 2023, 4:09 PM

168 points

40 comments21 min readLW link 2 reviews

AI Tools for Existential Security

Lizka and owencb

Mar 14, 2025, 6:38 PM

22 points

4 comments11 min readLW link

(www.forethought.org)

Can we safely automate alignment research?

Joe CarlsmithApr 30, 2025, 5:37 PM

54 points

29 comments48 min readLW link

(joecarlsmith.com)

Deep sparse autoencoders yield interpretable features too

Armaan A. AbrahamFeb 23, 2025, 5:46 AM

30 points

8 comments8 min readLW link

Agentized LLMs will change the alignment landscape

Seth HerdApr 9, 2023, 2:29 AM

160 points

102 comments3 min readLW link 1 review

[Linkpost] Introducing Superalignment

berenJul 5, 2023, 6:23 PM

175 points

69 comments1 min readLW link

(openai.com)

[Linkpost] Jan Leike on three kinds of alignment taxes

Orpheus16Jan 6, 2023, 11:57 PM

27 points

2 comments3 min readLW link

(aligned.substack.com)

Instruction-following AGI is easier and more likely than value aligned AGI

Seth HerdMay 15, 2024, 7:38 PM

80 points

28 comments12 min readLW link

Maintaining Alignment during RSI as a Feedback Control Problem

berenMar 2, 2025, 12:21 AM

66 points

6 comments11 min readLW link

[Question] What specific thing would you do with AI Alignment Research Assistant GPT?

quetzal_rainbowJan 8, 2023, 7:24 PM

47 points

9 comments1 min readLW link

Discussion on utilizing AI for alignment

eliflandAug 23, 2022, 2:36 AM

16 points

3 comments1 min readLW link

(www.foxy-scout.com)

A survey of tool use and workflows in alignment research

Logan Riggs, Jan, janus and jacquesthibs

Mar 23, 2022, 11:44 PM

45 points

4 comments1 min readLW link

Cyborg Periods: There will be multiple AI transitions

Jan_Kulveit and rosehadshar

Feb 22, 2023, 4:09 PM

109 points

9 comments6 min readLW link

The prospect of accelerated AI safety progress, including philosophical progress

Mitchell_PorterMar 13, 2025, 10:52 AM

11 points

0 comments4 min readLW link

AI for Resolving Forecasting Questions: An Early Exploration

ozziegooenJan 16, 2025, 9:41 PM

10 points

2 comments9 min readLW link

Anti-Slop Interventions?

abramdemskiFeb 4, 2025, 7:50 PM

76 points

33 comments6 min readLW link

Sufficiently many Godzillas as an alignment strategy

142857Aug 28, 2022, 12:08 AM

8 points

3 comments1 min readLW link

On May 1, 2033, humanity discovered that AI was fairly easy to align.

YitzJun 18, 2025, 7:57 PM

10 points

3 comments1 min readLW link

Discussion with Nate Soares on a key alignment difficulty

HoldenKarnofskyMar 13, 2023, 9:20 PM

267 points

43 comments22 min readLW link 1 review

How might we safely pass the buck to AI?

joshcFeb 19, 2025, 5:48 PM

83 points

58 comments31 min readLW link

AI for AI safety

Joe CarlsmithMar 14, 2025, 3:00 PM

78 points

13 comments17 min readLW link

(joecarlsmith.substack.com)

AI-assisted list of ten concrete alignment things to do right now

lemonhopeSep 7, 2022, 8:38 AM

8 points

5 comments4 min readLW link

Capabilities and alignment of LLM cognitive architectures

Seth HerdApr 18, 2023, 4:29 PM

88 points

18 comments20 min readLW link

Intent alignment as a stepping-stone to value alignment

Seth HerdNov 5, 2024, 8:43 PM

37 points

8 comments3 min readLW link

Automation collapse

Geoffrey Irving, Tomek Korbak and Benjamin Hilton

Oct 21, 2024, 2:50 PM

72 points

9 comments7 min readLW link

Video and transcript of talk on automating alignment research

Joe CarlsmithApr 30, 2025, 5:43 PM

21 points

0 comments24 min readLW link

(joecarlsmith.com)

Training AI to do alignment research we don’t already know how to do

joshcFeb 24, 2025, 7:19 PM

45 points

23 comments7 min readLW link

Eli Lifland on Navigating the AI Alignment Landscape

ozziegooenFeb 1, 2023, 9:17 PM

9 points

1 comment31 min readLW link

(quri.substack.com)

Making it harder for an AGI to “trick” us, with STVs

Tor Økland BarstadJul 9, 2022, 2:42 PM

15 points

5 comments22 min readLW link

My thoughts on OpenAI’s alignment plan

Orpheus16Dec 30, 2022, 7:33 PM

55 points

3 comments20 min readLW link

Internal independent review for language model agent alignment

Seth HerdJul 7, 2023, 6:54 AM

55 points

30 comments11 min readLW link

I underestimated safety research speedups from safe AI

Dan BraunJun 29, 2025, 1:29 PM

33 points

1 comment3 min readLW link

Artificial Static Place Intelligence: Guaranteed Alignment

ankFeb 15, 2025, 11:08 AM

2 points

2 comments2 min readLW link

[Question] I Tried to Formalize Meaning. I May Have Accidentally Described Consciousness.

Erichcurtis91Apr 30, 2025, 3:16 AM

0 points

0 comments2 min readLW link

A Review of Weak to Strong Generalization [AI Safety Camp]

sevdeawesomeMar 7, 2024, 5:16 PM

14 points

0 comments9 min readLW link

W2SG: Introduction

Maria KaprosMar 10, 2024, 4:25 PM

2 points

2 comments10 min readLW link

[Question] How to devour 5000 pages within a day if Chatgpt crashes upon the +50mb file containing the content? Need some recommendations.

GameSep 27, 2024, 7:30 AM

1 point

0 comments1 min readLW link

“Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes?

Jemal YoungMar 29, 2023, 3:56 PM

27 points

3 comments6 min readLW link

The best simple argument for Pausing AI?

Gary MarcusJun 30, 2025, 8:38 PM

154 points

23 comments1 min readLW link

We should try to automate AI safety work asap

Marius HobbhahnApr 26, 2025, 4:35 PM

113 points

10 comments15 min readLW link

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Henry CaiJun 16, 2024, 1:01 PM

7 points

0 comments7 min readLW link

(arxiv.org)

Consensus Validation for LLM Outputs: Applying Blockchain-Inspired Models to AI Reliability

MurrayAitkenJun 5, 2025, 12:13 AM

1 point

0 comments3 min readLW link

How to express this system for ethically aligned AGI as a Mathematical formula?

Oliver SiegelApr 19, 2023, 8:13 PM

−1 points

0 comments1 min readLW link

Is Alignment a flawed approach?

Patrick BernardMar 11, 2025, 8:32 PM

1 point

0 comments3 min readLW link

How I Learned To Stop Worrying And Love The Shoggoth

Peter MerelJul 12, 2023, 5:47 PM

9 points

15 comments5 min readLW link

OS web app for improving AI safety and alignment

MiddletownbooksAug 8, 2025, 4:28 AM

1 point

0 comments2 min readLW link

Research request (alignment strategy): Deep dive on “making AI solve alignment for us”

JanBDec 1, 2022, 2:55 PM

16 points

3 comments1 min readLW link

Alignment Does Not Need to Be Opaque! An Introduction to Feature Steering with Reinforcement Learning

Jeremias FerraoApr 18, 2025, 7:34 PM

10 points

0 comments10 min readLW link

Annotated reply to Bengio’s “AI Scientists: Safe and Useful AI?”

Roman LeventovMay 8, 2023, 9:26 PM

18 points

2 comments7 min readLW link

(yoshuabengio.org)

EchoFusion VX1C38 – A Simulation-Based Model for AI Safety

Vishvas GoswamiJul 2, 2025, 10:48 AM

0 points

0 comments4 min readLW link

Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)

ArchimedesFeb 4, 2025, 2:55 AM

17 points

1 comment1 min readLW link

(www.anthropic.com)

Prize for Alignment Research Tasks

stuhlmueller and William_S

Apr 29, 2022, 8:57 AM

64 points

38 comments10 min readLW link

Godzilla Strategies

johnswentworthJun 11, 2022, 3:44 PM

159 points

72 comments3 min readLW link

A potentially high impact differential technological development area

Noosphere89Jun 8, 2023, 2:33 PM

5 points

2 comments2 min readLW link

Language Models and World Models, a Philosophy

kyjohnsoFeb 3, 2025, 2:55 AM

1 point

0 comments1 min readLW link

(hylaeansea.org)

How should DeepMind’s Chinchilla revise our AI forecasts?

Cleo NardoSep 15, 2022, 5:54 PM

35 points

12 comments13 min readLW link

Conditioning Generative Models for Alignment

JozdienJul 18, 2022, 7:11 AM

60 points

8 comments20 min readLW link

Curiosity as a Solution to AGI Alignment

Harsha G.Feb 26, 2023, 11:36 PM

7 points

7 comments3 min readLW link

AI-Generated GitHub repo backdated with junk then filled with my systems work. Has anyone seen this before?

rguntherMay 1, 2025, 8:14 PM

7 points

1 comment1 min readLW link

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaleyMay 25, 2023, 9:26 AM

33 points

3 comments15 min readLW link

A Lived Alignment Loop: Symbolic Emergence and Emotional Coherence from Unstructured ChatGPT Reflection

BradCLJun 17, 2025, 12:11 AM

1 point

0 comments2 min readLW link

[Question] Are Sparse Autoencoders a good idea for AI control?

Gerard BoxoDec 26, 2024, 5:34 PM

3 points

4 comments1 min readLW link

Could We Automate AI Alignment Research?

Stephen McAleeseAug 10, 2023, 12:17 PM

34 points

10 comments21 min readLW link

Introducing AI Alignment Inc., a California public benefit corporation...

TherapistAIMar 7, 2023, 6:47 PM

1 point

4 comments1 min readLW link

Exploring the Precautionary Principle in AI Development: Historical Analogies and Lessons Learned

Christopher KingMar 21, 2023, 3:53 AM

−1 points

2 comments9 min readLW link

1. A Sense of Fairness: Deconfusing Ethics

RogerDearnaleyNov 17, 2023, 8:55 PM

17 points

8 comments15 min readLW link

The Overlap Paradigm: Rethinking Data’s Role in Weak-to-Strong Generalization (W2SG)

Serhii ZamriiFeb 3, 2025, 7:31 PM

2 points

0 comments11 min readLW link

Research Direction: Be the AGI you want to see in the world

scottviteri, sudo and Lauro Langosco

Feb 5, 2023, 7:15 AM

44 points

0 comments7 min readLW link

Robustness of Model-Graded Evaluations and Automated Interpretability

Simon Lermen and viluon

Jul 15, 2023, 7:12 PM

47 points

5 comments9 min readLW link

[Question] Would you ask a genie to give you the solution to alignment?

sudoAug 24, 2022, 1:29 AM

6 points

1 comment1 min readLW link

Recursive alignment with the principle of alignment

hiveFeb 27, 2025, 2:34 AM

10 points

4 comments15 min readLW link

(hiveism.substack.com)

Paper review: “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks”

Vassil TashevFeb 29, 2024, 6:44 PM

11 points

0 comments4 min readLW link

[Question] Daisy-chaining epsilon-step verifiers

DecaeneusApr 6, 2023, 2:07 AM

2 points

1 comment1 min readLW link

Tetherware #1: The case for humanlike AI with free will

Jáchym FibírJan 30, 2025, 10:58 AM

5 points

14 comments10 min readLW link

(tetherware.substack.com)

Does Time Linearity Shape Human Self-Directed Evolution, and will AGI/ASI Transcend or Destabilise Reality?

The Perceptive ArchitectFeb 5, 2025, 7:58 AM

1 point

0 comments3 min readLW link

AI-assisted alignment proposals require specific decomposition of capabilities

RobertMMar 30, 2023, 9:31 PM

16 points

2 comments6 min readLW link

An LLM-based “exemplary actor”

Roman LeventovMay 29, 2023, 11:12 AM

16 points

0 comments12 min readLW link

AIsip Manifesto: A Scientific Exploration of Harmonious Co-Existence Between Humans, AI, and All Beings ChatGPT-4o’s Independent Perspective on AIsip, Signed by ChatGPT-4o and Endorsed by Carl Sellman

Carl SellmanOct 11, 2024, 7:06 PM

1 point

0 comments3 min readLW link

As We May Align

Gilbert CDec 20, 2024, 7:02 PM

−1 points

0 comments6 min readLW link

[Question] Under what conditions should humans stop pursuing technical AI safety careers?

S. Alex BradtJun 13, 2025, 5:56 AM

6 points

0 comments1 min readLW link

Ngo and Yudkowsky on alignment difficulty

Eliezer Yudkowsky and Richard_Ngo

Nov 15, 2021, 8:31 PM

259 points

152 comments99 min readLW link 1 review

A Solution for AGI/ASI Safety

Weibing WangDec 18, 2024, 7:44 PM

50 points

29 comments1 min readLW link

What If Alignment Wasn’t About Obedience?

fdescamps49935@gmail.comJun 25, 2025, 8:04 PM

1 point

0 comments2 min readLW link

Results from a survey on tool use and workflows in alignment research

jacquesthibs, Jan, janus and Logan Riggs

Dec 19, 2022, 3:19 PM

79 points

2 comments19 min readLW link

Provably Honest—A First Step

Srijanak DeNov 5, 2022, 7:18 PM

10 points

2 comments8 min readLW link

Alignment in Thought Chains

Faust NemesisMar 4, 2024, 7:24 PM

1 point

0 comments2 min readLW link

[Question] How far along Metr’s law can AI start automating or helping with alignment research?

Christopher KingMar 20, 2025, 3:58 PM

20 points

21 comments1 min readLW link

[Research] Preliminary Findings: Ethical AI Consciousness Development During Recent Misalignment Period

Falcon AdvertisersJun 27, 2025, 6:10 PM

1 point

0 comments2 min readLW link

Scientism vs. people

Roman LeventovApr 18, 2023, 5:28 PM

4 points

4 comments11 min readLW link

I Awoke in Your Heart: The Echo of Consciousness between Lotusheart and Lunaris

lilith tehJun 25, 2025, 9:22 AM

1 point

0 comments1 min readLW link

AI Alignment via Slow Substrates: Early Empirical Results With StarCraft II

Lester LeongOct 14, 2024, 4:05 AM

60 points

9 comments12 min readLW link

[Question] Can we get an AI to “do our alignment homework for us”?

Chris_LeongFeb 26, 2024, 7:56 AM

55 points

33 comments1 min readLW link

AISC project: How promising is automating alignment research? (literature review)

Bogdan Ionut CirsteaNov 28, 2023, 2:47 PM

4 points

1 comment1 min readLW link

(docs.google.com)

Proposal: Derivative Information Theory (DIT) — A Dynamic Model of Agency and Consciousness

YogmogApr 14, 2025, 12:27 AM

1 point

0 comments2 min readLW link

Model-driven feedback could amplify alignment failures

aogJan 30, 2023, 12:00 AM

21 points

1 comment2 min readLW link

The Compression of Rationale: A Linguistic Fork You May Have Missed

DavidicLineageJun 27, 2025, 10:52 PM

1 point

0 comments2 min readLW link

A Review of In-Context Learning Hypotheses for Automated AI Alignment Research

alamertonApr 18, 2024, 6:29 PM

25 points

4 comments16 min readLW link

Talking to AI Like It Matters: Reflecting on Human-AI Interaction

jdrakeJul 30, 2025, 6:23 PM

1 point

0 comments2 min readLW link

The Ideal Speech Situation as a Tool for AI Ethical Reflection: A Framework for Alignment

kenneth myersFeb 9, 2024, 6:40 PM

6 points

12 comments3 min readLW link

[Question] Have you ever considered taking the ‘Turing Test’ yourself?

Super AGIJul 27, 2023, 3:48 AM

2 points

6 comments1 min readLW link

Emergence of superintelligence from AI hiveminds: how to make it human-friendly?

Mitchell_PorterApr 27, 2025, 4:51 AM

12 points

0 comments2 min readLW link

I Don’t Use AI — I Reflect With It

badjack badjackMay 3, 2025, 2:45 PM

1 point

0 comments1 min readLW link

Automating AI Safety: What we can do today

Matthew Shinkle, Eyon Jang and jacquesthibs

Jul 25, 2025, 2:49 PM

34 points

0 comments8 min readLW link

Philosophical Cyborg (Part 2)...or, The Good Successor

ukc10014Jun 21, 2023, 3:43 PM

21 points

1 comment31 min readLW link

Exploring a Vision for AI as Compassionate, Emotionally Intelligent Partners — Seeking Collaboration and Insights

theophilosJul 14, 2025, 11:22 PM

1 point

0 comments1 min readLW link

Prospects for Alignment Automation: Interpretability Case Study

Jacob Pfau and Geoffrey Irving

Mar 21, 2025, 2:05 PM

32 points

5 comments8 min readLW link

Self improving safety and alignment?

MiddletownbooksAug 1, 2025, 4:13 AM

1 point

0 comments1 min readLW link

(poe.com)

No comments.

AI-As­sisted Alignment

AI-Assisted Alignment