RSS

Aligned AI Proposals

TagLast edit: 4 Jan 2025 22:05 UTC by Dakara

Aligned AI Proposals are proposals aimed at ensuring artificial intelligence systems behave in accordance with human intentions (intent alignment) or human values (value alignment).

The main goal of these proposals is to ensure that AI systems will, all things considered, benefit humanity.

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC
60 points
39 comments24 min readLW link

Why Align­ing an LLM is Hard, and How to Make it Easier

RogerDearnaley23 Jan 2025 6:44 UTC
30 points
3 comments4 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC
37 points
4 comments2 min readLW link

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC
64 points
30 comments11 min readLW link

Good­bye, Shog­goth: The Stage, its An­i­ma­tron­ics, & the Pup­peteer – a New Metaphor

RogerDearnaley9 Jan 2024 20:42 UTC
47 points
8 comments36 min readLW link

Mo­ti­vat­ing Align­ment of LLM-Pow­ered Agents: Easy for AGI, Hard for ASI?

RogerDearnaley11 Jan 2024 12:56 UTC
35 points
4 comments39 min readLW link

Align­ment has a Basin of At­trac­tion: Beyond the Orthog­o­nal­ity Thesis

RogerDearnaley1 Feb 2024 21:15 UTC
15 points
15 comments13 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC
40 points
12 comments31 min readLW link

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC
30 points
14 comments9 min readLW link

A Non­con­struc­tive Ex­is­tence Proof of Aligned Superintelligence

Roko12 Sep 2024 3:20 UTC
0 points
80 comments1 min readLW link
(transhumanaxiology.substack.com)

AI Align­ment Metastrategy

Vanessa Kosoy31 Dec 2023 12:06 UTC
120 points
13 comments7 min readLW link

A list of core AI safety prob­lems and how I hope to solve them

davidad26 Aug 2023 15:12 UTC
165 points
29 comments5 min readLW link

The (par­tial) fal­lacy of dumb superintelligence

Seth Herd18 Oct 2023 21:25 UTC
38 points
5 comments4 min readLW link

We have promis­ing al­ign­ment plans with low taxes

Seth Herd10 Nov 2023 18:51 UTC
44 points
9 comments5 min readLW link

Safety First: safety be­fore full al­ign­ment. The de­on­tic suffi­ciency hy­poth­e­sis.

Chipmonk3 Jan 2024 17:55 UTC
48 points
3 comments3 min readLW link

How might we solve the al­ign­ment prob­lem? (Part 1: In­tro, sum­mary, on­tol­ogy)

Joe Carlsmith28 Oct 2024 21:57 UTC
54 points
5 comments32 min readLW link

[Linkpost] Build­ing Altru­is­tic and Mo­ral AI Agent with Brain-in­spired Affec­tive Em­pa­thy Mechanisms

Gunnar_Zarncke4 Nov 2024 10:15 UTC
13 points
0 comments1 min readLW link
(arxiv.org)

Desider­ata for an AI

Nathan Helm-Burger19 Jul 2023 16:18 UTC
9 points
0 comments4 min readLW link

Two paths to win the AGI transition

Nathan Helm-Burger6 Jul 2023 21:59 UTC
11 points
8 comments4 min readLW link

Pro­posal: Align Sys­tems Ear­lier In Training

OneManyNone16 May 2023 16:24 UTC
18 points
0 comments11 min readLW link

The Goal Mis­gen­er­al­iza­tion Problem

Myspy18 May 2023 23:40 UTC
1 point
0 comments1 min readLW link
(drive.google.com)

An LLM-based “ex­em­plary ac­tor”

Roman Leventov29 May 2023 11:12 UTC
16 points
0 comments12 min readLW link

Align­ing an H-JEPA agent via train­ing on the out­puts of an LLM-based “ex­em­plary ac­tor”

Roman Leventov29 May 2023 11:08 UTC
12 points
10 comments30 min readLW link

Sup­ple­men­tary Align­ment In­sights Through a Highly Con­trol­led Shut­down Incentive

Justausername23 Jul 2023 16:08 UTC
4 points
1 comment3 min readLW link

Au­tonomous Align­ment Over­sight Frame­work (AAOF)

Justausername25 Jul 2023 10:25 UTC
−9 points
0 comments4 min readLW link

Embed­ding Eth­i­cal Pri­ors into AI Sys­tems: A Bayesian Approach

Justausername3 Aug 2023 15:31 UTC
−5 points
3 comments21 min readLW link

Re­duc­ing the risk of catas­troph­i­cally mis­al­igned AI by avoid­ing the Sin­gle­ton sce­nario: the Many­ton Variant

GravitasGradient6 Aug 2023 14:24 UTC
−6 points
0 comments3 min readLW link

[Question] Bostrom’s Solution

James Blackmon14 Aug 2023 17:09 UTC
1 point
0 comments1 min readLW link

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

Justausername24 Aug 2023 3:53 UTC
1 point
0 comments6 min readLW link

An Open Agency Ar­chi­tec­ture for Safe Trans­for­ma­tive AI

davidad20 Dec 2022 13:04 UTC
80 points
22 comments4 min readLW link

Scal­able Over­sight and Weak-to-Strong Gen­er­al­iza­tion: Com­pat­i­ble ap­proaches to the same problem

16 Dec 2023 5:49 UTC
76 points
4 comments6 min readLW link1 review

Lifel­og­ging for Align­ment & Immortality

Dev.Errata17 Aug 2024 23:42 UTC
13 points
3 comments7 min readLW link

Up­date on Devel­op­ing an Ethics Calcu­la­tor to Align an AGI to

sweenesm12 Mar 2024 12:33 UTC
4 points
2 comments8 min readLW link

Align­ment in Thought Chains

Faust Nemesis4 Mar 2024 19:24 UTC
1 point
0 comments2 min readLW link

Mo­ral re­al­ism and AI alignment

Caspar Oesterheld3 Sep 2018 18:46 UTC
13 points
10 comments1 min readLW link
(casparoesterheld.com)

Pro­posal for an AI Safety Prize

sweenesm31 Jan 2024 18:35 UTC
3 points
0 comments2 min readLW link

How to safely use an optimizer

Simon Fischer28 Mar 2024 16:11 UTC
47 points
21 comments7 min readLW link

Slowed ASI—a pos­si­ble tech­ni­cal strat­egy for alignment

Lester Leong14 Jun 2024 0:57 UTC
5 points
2 comments3 min readLW link

aim­less ace an­a­lyzes ac­tive am­a­teur: a micro-aaaaal­ign­ment proposal

lemonhope21 Jul 2024 12:37 UTC
12 points
0 comments1 min readLW link

Toward a Hu­man Hy­brid Lan­guage for En­hanced Hu­man-Ma­chine Com­mu­ni­ca­tion: Ad­dress­ing the AI Align­ment Problem

Andndn Dheudnd14 Aug 2024 22:19 UTC
−4 points
2 comments4 min readLW link

AI Align­ment via Slow Sub­strates: Early Em­piri­cal Re­sults With StarCraft II

Lester Leong14 Oct 2024 4:05 UTC
60 points
9 comments12 min readLW link

The Road to Evil Is Paved with Good Ob­jec­tives: Frame­work to Clas­sify and Fix Misal­ign­ments.

Shivam30 Jan 2025 2:44 UTC
1 point
0 comments12 min readLW link

Spec­u­la­tion on map­ping the moral land­scape for fu­ture Ai Alignment

Sven Heinz (Welwordion)16 Apr 2023 13:43 UTC
1 point
0 comments1 min readLW link

A Pro­posal for AI Align­ment: Us­ing Directly Op­pos­ing Models

Arne B27 Apr 2023 18:05 UTC
0 points
5 comments3 min readLW link

AI Align­ment: A Com­pre­hen­sive Survey

Stephen McAleer1 Nov 2023 17:35 UTC
20 points
1 comment1 min readLW link
(arxiv.org)

Is In­ter­pretabil­ity All We Need?

RogerDearnaley14 Nov 2023 5:31 UTC
1 point
1 comment1 min readLW link

Lan­guage Model Me­moriza­tion, Copy­right Law, and Con­di­tional Pre­train­ing Alignment

RogerDearnaley7 Dec 2023 6:14 UTC
9 points
0 comments11 min readLW link

An­no­tated re­ply to Ben­gio’s “AI Scien­tists: Safe and Use­ful AI?”

Roman Leventov8 May 2023 21:26 UTC
18 points
2 comments7 min readLW link
(yoshuabengio.org)
No comments.