RSS

Aligned AI Proposals

TagLast edit: Jan 4, 2025, 10:05 PM by Dakara

Aligned AI Proposals are proposals aimed at ensuring artificial intelligence systems behave in accordance with human intentions (intent alignment) or human values (value alignment).

The main goal of these proposals is to ensure that AI systems will, all things considered, benefit humanity.

Why Align­ing an LLM is Hard, and How to Make it Easier

RogerDearnaleyJan 23, 2025, 6:44 AM
30 points
3 comments4 min readLW link

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaleyJul 6, 2024, 1:23 AM
60 points
39 comments24 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaleyJan 5, 2024, 8:46 AM
37 points
4 comments2 min readLW link

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaleyNov 28, 2023, 7:56 PM
64 points
30 comments11 min readLW link

Mo­ti­vat­ing Align­ment of LLM-Pow­ered Agents: Easy for AGI, Hard for ASI?

RogerDearnaleyJan 11, 2024, 12:56 PM
35 points
4 comments39 min readLW link

Align­ment has a Basin of At­trac­tion: Beyond the Orthog­o­nal­ity Thesis

RogerDearnaleyFeb 1, 2024, 9:15 PM
15 points
15 comments13 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM
41 points
12 comments31 min readLW link

Good­bye, Shog­goth: The Stage, its An­i­ma­tron­ics, & the Pup­peteer – a New Metaphor

RogerDearnaleyJan 9, 2024, 8:42 PM
47 points
8 comments36 min readLW link

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaleyDec 18, 2023, 8:12 AM
30 points
14 comments9 min readLW link

Safety First: safety be­fore full al­ign­ment. The de­on­tic suffi­ciency hy­poth­e­sis.

ChipmonkJan 3, 2024, 5:55 PM
48 points
3 comments3 min readLW link

[Linkpost] Build­ing Altru­is­tic and Mo­ral AI Agent with Brain-in­spired Affec­tive Em­pa­thy Mechanisms

Gunnar_ZarnckeNov 4, 2024, 10:15 AM
13 points
0 comments1 min readLW link
(arxiv.org)

How might we solve the al­ign­ment prob­lem? (Part 1: In­tro, sum­mary, on­tol­ogy)

Joe CarlsmithOct 28, 2024, 9:57 PM
54 points
5 comments32 min readLW link

The (par­tial) fal­lacy of dumb superintelligence

Seth HerdOct 18, 2023, 9:25 PM
38 points
5 comments4 min readLW link

AI Align­ment Metastrategy

Vanessa KosoyDec 31, 2023, 12:06 PM
120 points
13 comments7 min readLW link

We have promis­ing al­ign­ment plans with low taxes

Seth HerdNov 10, 2023, 6:51 PM
44 points
9 comments5 min readLW link

A list of core AI safety prob­lems and how I hope to solve them

davidadAug 26, 2023, 3:12 PM
165 points
29 comments5 min readLW link

A Non­con­struc­tive Ex­is­tence Proof of Aligned Superintelligence

RokoSep 12, 2024, 3:20 AM
0 points
80 comments1 min readLW link
(transhumanaxiology.substack.com)

Desider­ata for an AI

Nathan Helm-BurgerJul 19, 2023, 4:18 PM
9 points
0 comments4 min readLW link

Two paths to win the AGI transition

Nathan Helm-BurgerJul 6, 2023, 9:59 PM
11 points
8 comments4 min readLW link

Spec­u­la­tion on map­ping the moral land­scape for fu­ture Ai Alignment

Sven Heinz (Welwordion)Apr 16, 2023, 1:43 PM
1 point
0 comments1 min readLW link

A Pro­posal for AI Align­ment: Us­ing Directly Op­pos­ing Models

Arne BApr 27, 2023, 6:05 PM
0 points
5 comments3 min readLW link

AI Align­ment: A Com­pre­hen­sive Survey

Stephen McAleerNov 1, 2023, 5:35 PM
20 points
1 comment1 min readLW link
(arxiv.org)

Is In­ter­pretabil­ity All We Need?

RogerDearnaleyNov 14, 2023, 5:31 AM
1 point
1 comment1 min readLW link

Lan­guage Model Me­moriza­tion, Copy­right Law, and Con­di­tional Pre­train­ing Alignment

RogerDearnaleyDec 7, 2023, 6:14 AM
9 points
0 comments11 min readLW link

An­no­tated re­ply to Ben­gio’s “AI Scien­tists: Safe and Use­ful AI?”

Roman LeventovMay 8, 2023, 9:26 PM
18 points
2 comments7 min readLW link
(yoshuabengio.org)

Pro­posal: Align Sys­tems Ear­lier In Training

OneManyNoneMay 16, 2023, 4:24 PM
18 points
0 comments11 min readLW link

The Goal Mis­gen­er­al­iza­tion Problem

MyspyMay 18, 2023, 11:40 PM
1 point
0 comments1 min readLW link
(drive.google.com)

An LLM-based “ex­em­plary ac­tor”

Roman LeventovMay 29, 2023, 11:12 AM
16 points
0 comments12 min readLW link

Align­ing an H-JEPA agent via train­ing on the out­puts of an LLM-based “ex­em­plary ac­tor”

Roman LeventovMay 29, 2023, 11:08 AM
12 points
10 comments30 min readLW link

Sup­ple­men­tary Align­ment In­sights Through a Highly Con­trol­led Shut­down Incentive

JustausernameJul 23, 2023, 4:08 PM
4 points
1 comment3 min readLW link

Au­tonomous Align­ment Over­sight Frame­work (AAOF)

JustausernameJul 25, 2023, 10:25 AM
−9 points
0 comments4 min readLW link

Embed­ding Eth­i­cal Pri­ors into AI Sys­tems: A Bayesian Approach

JustausernameAug 3, 2023, 3:31 PM
−5 points
3 comments21 min readLW link

Re­duc­ing the risk of catas­troph­i­cally mis­al­igned AI by avoid­ing the Sin­gle­ton sce­nario: the Many­ton Variant

GravitasGradientAug 6, 2023, 2:24 PM
−6 points
0 comments3 min readLW link

[Question] Bostrom’s Solution

James BlackmonAug 14, 2023, 5:09 PM
1 point
0 comments1 min readLW link

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

JustausernameAug 24, 2023, 3:53 AM
1 point
0 comments6 min readLW link

An Open Agency Ar­chi­tec­ture for Safe Trans­for­ma­tive AI

davidadDec 20, 2022, 1:04 PM
80 points
22 comments4 min readLW link

Scal­able Over­sight and Weak-to-Strong Gen­er­al­iza­tion: Com­pat­i­ble ap­proaches to the same problem

Dec 16, 2023, 5:49 AM
76 points
4 comments6 min readLW link1 review

Lifel­og­ging for Align­ment & Immortality

Dev.ErrataAug 17, 2024, 11:42 PM
13 points
3 comments7 min readLW link

Up­date on Devel­op­ing an Ethics Calcu­la­tor to Align an AGI to

sweenesmMar 12, 2024, 12:33 PM
4 points
2 comments8 min readLW link

Align­ment in Thought Chains

Faust NemesisMar 4, 2024, 7:24 PM
1 point
0 comments2 min readLW link

Mo­ral re­al­ism and AI alignment

Caspar OesterheldSep 3, 2018, 6:46 PM
13 points
10 comments1 min readLW link
(casparoesterheld.com)

Pro­posal for an AI Safety Prize

sweenesmJan 31, 2024, 6:35 PM
3 points
0 comments2 min readLW link

How to safely use an optimizer

Simon FischerMar 28, 2024, 4:11 PM
47 points
21 comments7 min readLW link

Slowed ASI—a pos­si­ble tech­ni­cal strat­egy for alignment

Lester LeongJun 14, 2024, 12:57 AM
5 points
2 comments3 min readLW link

In­tel­li­gence–Agency Equiv­alence ≈ Mass–En­ergy Equiv­alence: On Static Na­ture of In­tel­li­gence & Phys­i­cal­iza­tion of Ethics

ankFeb 22, 2025, 12:12 AM
1 point
0 comments6 min readLW link

aim­less ace an­a­lyzes ac­tive am­a­teur: a micro-aaaaal­ign­ment proposal

lemonhopeJul 21, 2024, 12:37 PM
12 points
0 comments1 min readLW link

Toward a Hu­man Hy­brid Lan­guage for En­hanced Hu­man-Ma­chine Com­mu­ni­ca­tion: Ad­dress­ing the AI Align­ment Problem

Andndn DheudndAug 14, 2024, 10:19 PM
−4 points
2 comments4 min readLW link

Mak­ing al­ign­ment a law of the universe

jugginsFeb 25, 2025, 10:44 AM
0 points
3 comments15 min readLW link

AI Align­ment via Slow Sub­strates: Early Em­piri­cal Re­sults With StarCraft II

Lester LeongOct 14, 2024, 4:05 AM
60 points
9 comments12 min readLW link

Re­cur­sive al­ign­ment with the prin­ci­ple of alignment

hiveFeb 27, 2025, 2:34 AM
8 points
0 comments15 min readLW link
(hiveism.substack.com)

[Question] Share AI Safety Ideas: Both Crazy and Not

ankMar 1, 2025, 7:08 PM
16 points
28 comments1 min readLW link

Ra­tional Utopia & Nar­row Way There: Mul­tiver­sal AI Align­ment, Non-Agen­tic Static Place AI, New Ethics… (V. 4)

ankFeb 11, 2025, 3:21 AM
13 points
8 comments35 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ankFeb 15, 2025, 11:08 AM
2 points
2 comments2 min readLW link

Give Neo a Chance

ankMar 6, 2025, 1:48 AM
3 points
7 comments7 min readLW link

Is Align­ment a flawed ap­proach?

Patrick BernardMar 11, 2025, 8:32 PM
1 point
0 comments3 min readLW link

The Road to Evil Is Paved with Good Ob­jec­tives: Frame­work to Clas­sify and Fix Misal­ign­ments.

ShivamJan 30, 2025, 2:44 AM
1 point
0 comments11 min readLW link
No comments.