AI-Assisted Alignment

TagLast edit: 25 Jan 2024 5:18 UTC by habryka

AI-Assisted Alignment is a cluster of alignment plans that involve AI somehow significantly helping with alignment research. This can include weak tool AI, or more advanced AGI doing original research.

There has been a lot of debate about how practical this alignment approach is.

Other search terms for this tag: AI aligning AI

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC

61 points

41 comments24 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

41 points

12 comments31 min readLW link

Proposed Alignment Technique: OSNR (Output Sanitization via Noising and Reconstruction) for Safer Usage of Potentially Misaligned AGI

sudo29 May 2023 1:35 UTC

14 points

9 comments6 min readLW link

We have to Upgrade

Jed McCaleb23 Mar 2023 17:53 UTC

131 points

35 comments2 min readLW link

[Link] Why I’m optimistic about OpenAI’s alignment approach

janleike5 Dec 2022 22:51 UTC

98 points

15 comments1 min readLW link

(aligned.substack.com)

Beliefs and Disagreements about Automating Alignment Research

Ian McKenzie24 Aug 2022 18:37 UTC

107 points

4 comments7 min readLW link

How to Control an LLM’s Behavior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC

65 points

30 comments11 min readLW link

Infinite Possibility Space and the Shutdown Problem

magfrump18 Oct 2022 5:37 UTC

9 points

0 comments2 min readLW link

(www.magfrump.net)

[Link] A minimal viable product for alignment

janleike6 Apr 2022 15:38 UTC

53 points

38 comments1 min readLW link

Cyborgism

NicholasKees and janus

10 Feb 2023 14:47 UTC

341 points

46 comments35 min readLW link 2 reviews

Misaligned AGI Death Match

Nate Reinar Windwood14 May 2023 18:00 UTC

1 point

0 comments1 min readLW link

Getting from an unaligned AGI to an aligned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC

13 points

7 comments9 min readLW link

Introducing AlignmentSearch: An AI Alignment-Informed Conversional Agent

BionicD0LPH1N, Fraser and TheBayesian

1 Apr 2023 16:39 UTC

79 points

14 comments4 min readLW link

Some Thoughts on AI Alignment: Using AI to Control AI

eigenvalue21 Jun 2024 17:44 UTC

1 point

1 comment1 min readLW link

(github.com)

Alignment with argument-networks and assessment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC

10 points

5 comments45 min readLW link

Some thoughts on automating alignment research

Lukas Finnveden26 May 2023 1:50 UTC

30 points

4 comments6 min readLW link

Davidad’s Bold Plan for Alignment: An In-Depth Explanation

Charbel-Raphaël and Gabin

19 Apr 2023 16:09 UTC

168 points

40 comments21 min readLW link 2 reviews

AI Tools for Existential Security

Lizka and owencb

14 Mar 2025 18:38 UTC

22 points

4 comments11 min readLW link

(www.forethought.org)

Can we safely automate alignment research?

Joe Carlsmith30 Apr 2025 17:37 UTC

53 points

29 comments48 min readLW link

(joecarlsmith.com)

Deep sparse autoencoders yield interpretable features too

Armaan A. Abraham23 Feb 2025 5:46 UTC

29 points

8 comments8 min readLW link

Agentized LLMs will change the alignment landscape

Seth Herd9 Apr 2023 2:29 UTC

160 points

102 comments3 min readLW link 1 review

[Linkpost] Introducing Superalignment

beren5 Jul 2023 18:23 UTC

175 points

69 comments1 min readLW link

(openai.com)

[Linkpost] Jan Leike on three kinds of alignment taxes

Orpheus166 Jan 2023 23:57 UTC

27 points

2 comments3 min readLW link

(aligned.substack.com)

Instruction-following AGI is easier and more likely than value aligned AGI

Seth Herd15 May 2024 19:38 UTC

80 points

28 comments12 min readLW link

Maintaining Alignment during RSI as a Feedback Control Problem

beren2 Mar 2025 0:21 UTC

66 points

6 comments11 min readLW link

[Question] What specific thing would you do with AI Alignment Research Assistant GPT?

quetzal_rainbow8 Jan 2023 19:24 UTC

47 points

9 comments1 min readLW link

Discussion on utilizing AI for alignment

elifland23 Aug 2022 2:36 UTC

16 points

3 comments1 min readLW link

(www.foxy-scout.com)

A survey of tool use and workflows in alignment research

Logan Riggs, Jan, janus and jacquesthibs

23 Mar 2022 23:44 UTC

45 points

4 comments1 min readLW link

Cyborg Periods: There will be multiple AI transitions

Jan_Kulveit and rosehadshar

22 Feb 2023 16:09 UTC

108 points

9 comments6 min readLW link

The prospect of accelerated AI safety progress, including philosophical progress

Mitchell_Porter13 Mar 2025 10:52 UTC

11 points

0 comments4 min readLW link

AI for Resolving Forecasting Questions: An Early Exploration

ozziegooen16 Jan 2025 21:41 UTC

10 points

2 comments1 min readLW link

Anti-Slop Interventions?

abramdemski4 Feb 2025 19:50 UTC

76 points

33 comments6 min readLW link

Sufficiently many Godzillas as an alignment strategy

14285728 Aug 2022 0:08 UTC

8 points

3 comments1 min readLW link

Discussion with Nate Soares on a key alignment difficulty

HoldenKarnofsky13 Mar 2023 21:20 UTC

267 points

43 comments22 min readLW link 1 review

How might we safely pass the buck to AI?

joshc19 Feb 2025 17:48 UTC

83 points

58 comments31 min readLW link

AI for AI safety

Joe Carlsmith14 Mar 2025 15:00 UTC

78 points

13 comments17 min readLW link

(joecarlsmith.substack.com)

AI-assisted list of ten concrete alignment things to do right now

lemonhope7 Sep 2022 8:38 UTC

8 points

5 comments4 min readLW link

Capabilities and alignment of LLM cognitive architectures

Seth Herd18 Apr 2023 16:29 UTC

88 points

18 comments20 min readLW link

Intent alignment as a stepping-stone to value alignment

Seth Herd5 Nov 2024 20:43 UTC

37 points

8 comments3 min readLW link

Video and transcript of talk on automating alignment research

Joe Carlsmith30 Apr 2025 17:43 UTC

21 points

0 comments24 min readLW link

(joecarlsmith.com)

Eli Lifland on Navigating the AI Alignment Landscape

ozziegooen1 Feb 2023 21:17 UTC

9 points

1 comment31 min readLW link

(quri.substack.com)

Making it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC

15 points

5 comments22 min readLW link

My thoughts on OpenAI’s alignment plan

Orpheus1630 Dec 2022 19:33 UTC

55 points

3 comments20 min readLW link

Internal independent review for language model agent alignment

Seth Herd7 Jul 2023 6:54 UTC

55 points

30 comments11 min readLW link

Artificial Static Place Intelligence: Guaranteed Alignment

ank15 Feb 2025 11:08 UTC

2 points

2 comments2 min readLW link

[Question] I Tried to Formalize Meaning. I May Have Accidentally Described Consciousness.

Erichcurtis9130 Apr 2025 3:16 UTC

0 points

0 comments2 min readLW link

A Review of Weak to Strong Generalization [AI Safety Camp]

sevdeawesome7 Mar 2024 17:16 UTC

14 points

0 comments9 min readLW link

W2SG: Introduction

Maria Kapros10 Mar 2024 16:25 UTC

2 points

2 comments10 min readLW link

[Question] How to devour 5000 pages within a day if Chatgpt crashes upon the +50mb file containing the content? Need some recommendations.

Game27 Sep 2024 7:30 UTC

1 point

0 comments1 min readLW link

“Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes?

Jemal Young29 Mar 2023 15:56 UTC

27 points

3 comments6 min readLW link

We should try to automate AI safety work asap

Marius Hobbhahn26 Apr 2025 16:35 UTC

108 points

10 comments15 min readLW link

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Henry Cai16 Jun 2024 13:01 UTC

7 points

0 comments7 min readLW link

(arxiv.org)

How to express this system for ethically aligned AGI as a Mathematical formula?

Oliver Siegel19 Apr 2023 20:13 UTC

−1 points

0 comments1 min readLW link

Is Alignment a flawed approach?

Patrick Bernard11 Mar 2025 20:32 UTC

1 point

0 comments3 min readLW link

How I Learned To Stop Worrying And Love The Shoggoth

Peter Merel12 Jul 2023 17:47 UTC

9 points

15 comments5 min readLW link

Research request (alignment strategy): Deep dive on “making AI solve alignment for us”

JanB1 Dec 2022 14:55 UTC

16 points

3 comments1 min readLW link

Alignment Does Not Need to Be Opaque! An Introduction to Feature Steering with Reinforcement Learning

Jeremias Ferrao18 Apr 2025 19:34 UTC

5 points

0 comments10 min readLW link

Annotated reply to Bengio’s “AI Scientists: Safe and Useful AI?”

Roman Leventov8 May 2023 21:26 UTC

18 points

2 comments7 min readLW link

(yoshuabengio.org)

Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)

Archimedes4 Feb 2025 2:55 UTC

16 points

1 comment1 min readLW link

(www.anthropic.com)

Prize for Alignment Research Tasks

stuhlmueller and William_S

29 Apr 2022 8:57 UTC

64 points

38 comments10 min readLW link

Godzilla Strategies

johnswentworth11 Jun 2022 15:44 UTC

159 points

72 comments3 min readLW link

A potentially high impact differential technological development area

Noosphere898 Jun 2023 14:33 UTC

5 points

2 comments2 min readLW link

Language Models and World Models, a Philosophy

kyjohnso3 Feb 2025 2:55 UTC

1 point

0 comments1 min readLW link

(hylaeansea.org)

How should DeepMind’s Chinchilla revise our AI forecasts?

Cleo Nardo15 Sep 2022 17:54 UTC

35 points

12 comments13 min readLW link

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

60 points

8 comments20 min readLW link

Curiosity as a Solution to AGI Alignment

Harsha G.26 Feb 2023 23:36 UTC

7 points

7 comments3 min readLW link

AI-Generated GitHub repo backdated with junk then filled with my systems work. Has anyone seen this before?

rgunther1 May 2025 20:14 UTC

7 points

1 comment1 min readLW link

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC

33 points

3 comments15 min readLW link

[Question] Are Sparse Autoencoders a good idea for AI control?

Gerard Boxo26 Dec 2024 17:34 UTC

3 points

4 comments1 min readLW link

Could We Automate AI Alignment Research?

Stephen McAleese10 Aug 2023 12:17 UTC

34 points

10 comments21 min readLW link

Introducing AI Alignment Inc., a California public benefit corporation...

TherapistAI7 Mar 2023 18:47 UTC

1 point

4 comments1 min readLW link

Exploring the Precautionary Principle in AI Development: Historical Analogies and Lessons Learned

Christopher King21 Mar 2023 3:53 UTC

−1 points

2 comments9 min readLW link

1. A Sense of Fairness: Deconfusing Ethics

RogerDearnaley17 Nov 2023 20:55 UTC

17 points

8 comments15 min readLW link

The Overlap Paradigm: Rethinking Data’s Role in Weak-to-Strong Generalization (W2SG)

Serhii Zamrii3 Feb 2025 19:31 UTC

2 points

0 comments11 min readLW link

Research Direction: Be the AGI you want to see in the world

scottviteri, sudo and Lauro Langosco

5 Feb 2023 7:15 UTC

44 points

0 comments7 min readLW link

Robustness of Model-Graded Evaluations and Automated Interpretability

Simon Lermen and viluon

15 Jul 2023 19:12 UTC

47 points

5 comments9 min readLW link

[Question] Would you ask a genie to give you the solution to alignment?

sudo24 Aug 2022 1:29 UTC

6 points

1 comment1 min readLW link

Recursive alignment with the principle of alignment

hive27 Feb 2025 2:34 UTC

9 points

1 comment15 min readLW link

(hiveism.substack.com)

Paper review: “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks”

Vassil Tashev29 Feb 2024 18:44 UTC

11 points

0 comments4 min readLW link

[Question] Daisy-chaining epsilon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC

2 points

1 comment1 min readLW link

Tetherware #1: The case for humanlike AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC

5 points

14 comments10 min readLW link

(tetherware.substack.com)

Does Time Linearity Shape Human Self-Directed Evolution, and will AGI/ASI Transcend or Destabilise Reality?

Emmely5 Feb 2025 7:58 UTC

1 point

0 comments3 min readLW link

AI-assisted alignment proposals require specific decomposition of capabilities

RobertM30 Mar 2023 21:31 UTC

16 points

2 comments6 min readLW link

An LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:12 UTC

16 points

0 comments12 min readLW link

AIsip Manifesto: A Scientific Exploration of Harmonious Co-Existence Between Humans, AI, and All Beings ChatGPT-4o’s Independent Perspective on AIsip, Signed by ChatGPT-4o and Endorsed by Carl Sellman

Carl Sellman11 Oct 2024 19:06 UTC

1 point

0 comments3 min readLW link

As We May Align

Gilbert C20 Dec 2024 19:02 UTC

−1 points

0 comments6 min readLW link

Ngo and Yudkowsky on alignment difficulty

Eliezer Yudkowsky and Richard_Ngo

15 Nov 2021 20:31 UTC

259 points

151 comments99 min readLW link 1 review

A Solution for AGI/ASI Safety

Weibing Wang18 Dec 2024 19:44 UTC

50 points

29 comments1 min readLW link

Results from a survey on tool use and workflows in alignment research

jacquesthibs, Jan, janus and Logan Riggs

19 Dec 2022 15:19 UTC

79 points

2 comments19 min readLW link

Provably Honest—A First Step

Srijanak De5 Nov 2022 19:18 UTC

10 points

2 comments8 min readLW link

Alignment in Thought Chains

Faust Nemesis4 Mar 2024 19:24 UTC

1 point

0 comments2 min readLW link

[Question] How far along Metr’s law can AI start automating or helping with alignment research?

Christopher King20 Mar 2025 15:58 UTC

20 points

21 comments1 min readLW link

Gettier Cases [repost]

Antigone3 Feb 2025 18:12 UTC

−4 points

5 comments2 min readLW link

[Question] Shouldn’t we ‘Just’ Superimitate Low-Res Uploads?

lukemarks3 Nov 2023 7:42 UTC

15 points

2 comments2 min readLW link

Scientism vs. people

Roman Leventov18 Apr 2023 17:28 UTC

4 points

4 comments11 min readLW link

AI Alignment via Slow Substrates: Early Empirical Results With StarCraft II

Lester Leong14 Oct 2024 4:05 UTC

60 points

9 comments12 min readLW link

[Question] Can we get an AI to “do our alignment homework for us”?

Chris_Leong26 Feb 2024 7:56 UTC

53 points

33 comments1 min readLW link

AISC project: How promising is automating alignment research? (literature review)

Bogdan Ionut Cirstea28 Nov 2023 14:47 UTC

4 points

1 comment1 min readLW link

(docs.google.com)

Proposal: Derivative Information Theory (DIT) — A Dynamic Model of Agency and Consciousness

Yogmog14 Apr 2025 0:27 UTC

1 point

0 comments2 min readLW link

Model-driven feedback could amplify alignment failures

aog30 Jan 2023 0:00 UTC

21 points

1 comment2 min readLW link

A Review of In-Context Learning Hypotheses for Automated AI Alignment Research

alamerton18 Apr 2024 18:29 UTC

25 points

4 comments16 min readLW link

The Ideal Speech Situation as a Tool for AI Ethical Reflection: A Framework for Alignment

kenneth myers9 Feb 2024 18:40 UTC

6 points

12 comments3 min readLW link

[Question] Have you ever considered taking the ‘Turing Test’ yourself?

Super AGI27 Jul 2023 3:48 UTC

2 points

6 comments1 min readLW link

Emergence of superintelligence from AI hiveminds: how to make it human-friendly?

Mitchell_Porter27 Apr 2025 4:51 UTC

12 points

0 comments2 min readLW link

I Don’t Use AI — I Reflect With It

badjack badjack3 May 2025 14:45 UTC

1 point

0 comments1 min readLW link

Philosophical Cyborg (Part 2)...or, The Good Successor

ukc1001421 Jun 2023 15:43 UTC

21 points

1 comment31 min readLW link

Prospects for Alignment Automation: Interpretability Case Study

Jacob Pfau and Geoffrey Irving

21 Mar 2025 14:05 UTC

32 points

5 comments8 min readLW link

No comments.

AI-As­sisted Alignment

AI-Assisted Alignment