RSS

Re­search Agendas

TagLast edit: Sep 16, 2021, 3:08 PM by plex

Research Agendas lay out the areas of research which individuals or groups are working on, or those that they believe would be valuable for others to work on. They help make research more legible and encourage discussion of priorities.

Embed­ded Agents

Oct 29, 2018, 7:53 PM
233 points
42 comments1 min readLW link2 reviews

New safety re­search agenda: scal­able agent al­ign­ment via re­ward modeling

VikaNov 20, 2018, 5:29 PM
34 points
12 comments1 min readLW link
(medium.com)

The Learn­ing-The­o­retic AI Align­ment Re­search Agenda

Vanessa KosoyJul 4, 2018, 9:53 AM
93 points
37 comments32 min readLW link

On how var­i­ous plans miss the hard bits of the al­ign­ment challenge

So8resJul 12, 2022, 2:49 AM
313 points
89 comments29 min readLW link3 reviews

Re­search Agenda v0.9: Syn­the­sis­ing a hu­man’s prefer­ences into a util­ity function

Stuart_ArmstrongJun 17, 2019, 5:46 PM
70 points
26 comments33 min readLW link

AI Gover­nance: A Re­search Agenda

habrykaSep 5, 2018, 6:00 PM
25 points
3 comments1 min readLW link
(www.fhi.ox.ac.uk)

Paul’s re­search agenda FAQ

zhukeepaJul 1, 2018, 6:25 AM
128 points
74 comments19 min readLW link1 review

Our take on CHAI’s re­search agenda in un­der 1500 words

Alex FlintJun 17, 2020, 12:24 PM
113 points
18 comments5 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhubMay 29, 2020, 8:38 PM
220 points
37 comments38 min readLW link2 reviews

Re­search Adenda: Model­ling Tra­jec­to­ries of Lan­guage Models

NickyPNov 13, 2023, 2:33 PM
28 points
0 comments12 min readLW link

The ‘Ne­glected Ap­proaches’ Ap­proach: AE Stu­dio’s Align­ment Agenda

Dec 18, 2023, 8:35 PM
175 points
22 comments12 min readLW link1 review

Embed­ded Agency (full-text ver­sion)

Nov 15, 2018, 7:49 PM
201 points
17 comments54 min readLW link

Try­ing to iso­late ob­jec­tives: ap­proaches to­ward high-level interpretability

JozdienJan 9, 2023, 6:33 PM
49 points
14 comments8 min readLW link

De­con­fus­ing Hu­man Values Re­search Agenda v1

Gordon Seidoh WorleyMar 23, 2020, 4:25 PM
28 points
12 comments4 min readLW link

Davi­dad’s Bold Plan for Align­ment: An In-Depth Explanation

Apr 19, 2023, 4:09 PM
168 points
40 comments21 min readLW link2 reviews

Thoughts on Hu­man Models

Feb 21, 2019, 9:10 AM
127 points
32 comments10 min readLW link1 review

MIRI’s tech­ni­cal re­search agenda

So8resDec 23, 2014, 6:45 PM
55 points
52 comments3 min readLW link

Pre­face to CLR’s Re­search Agenda on Co­op­er­a­tion, Con­flict, and TAI

JesseCliftonDec 13, 2019, 9:02 PM
62 points
10 comments2 min readLW link

Re­search agenda update

Steven ByrnesAug 6, 2021, 7:24 PM
55 points
40 comments7 min readLW link

Some con­cep­tual al­ign­ment re­search projects

Richard_NgoAug 25, 2022, 10:51 PM
177 points
15 comments3 min readLW link

The Learn­ing-The­o­retic Agenda: Sta­tus 2023

Vanessa KosoyApr 19, 2023, 5:21 AM
143 points
21 comments56 min readLW link3 reviews

New year, new re­search agenda post

Charlie SteinerJan 12, 2022, 5:58 PM
29 points
4 comments16 min readLW link

Us­ing GPT-N to Solve In­ter­pretabil­ity of Neu­ral Net­works: A Re­search Agenda

Sep 3, 2020, 6:27 PM
68 points
11 comments2 min readLW link

Key Ques­tions for Digi­tal Minds

Jacy Reese AnthisMar 22, 2023, 5:13 PM
22 points
0 comments7 min readLW link
(www.sentienceinstitute.org)

The space of sys­tems and the space of maps

Mar 22, 2023, 2:59 PM
38 points
0 comments5 min readLW link

Towards Hodge-podge Alignment

Cleo NardoDec 19, 2022, 8:12 PM
95 points
30 comments9 min readLW link

The­o­ries of im­pact for Science of Deep Learning

Marius HobbhahnDec 1, 2022, 2:39 PM
24 points
0 comments11 min readLW link

An­nounc­ing the Align­ment of Com­plex Sys­tems Re­search Group

Jun 4, 2022, 4:10 AM
91 points
20 comments5 min readLW link

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee SharkeyApr 3, 2024, 12:34 PM
96 points
23 comments22 min readLW link

Con­structabil­ity: Plainly-coded AGIs may be fea­si­ble in the near future

Apr 27, 2024, 4:04 PM
85 points
13 comments13 min readLW link

The Prop-room and Stage Cog­ni­tive Architecture

Robert KralischApr 29, 2024, 12:48 AM
14 points
4 comments14 min readLW link

Notes on notes on virtues

David GrossDec 30, 2020, 5:47 PM
71 points
11 comments11 min readLW link

What and Why: Devel­op­men­tal In­ter­pretabil­ity of Re­in­force­ment Learning

Garrett BakerJul 9, 2024, 2:09 PM
68 points
4 comments6 min readLW link

Towards the Oper­a­tional­iza­tion of Philos­o­phy & Wisdom

Thane RuthenisOct 28, 2024, 7:45 PM
20 points
2 comments33 min readLW link
(aiimpacts.org)

Self-pre­dic­tion acts as an emer­gent regularizer

Oct 23, 2024, 10:27 PM
91 points
9 comments4 min readLW link

Seek­ing Collaborators

abramdemskiNov 1, 2024, 5:13 PM
57 points
15 comments7 min readLW link

Shal­low re­view of tech­ni­cal AI safety, 2024

Dec 29, 2024, 12:01 PM
185 points
34 comments41 min readLW link

My AGI safety re­search—2024 re­view, ’25 plans

Steven ByrnesDec 31, 2024, 9:05 PM
109 points
4 comments8 min readLW link

Ul­tra-sim­plified re­search agenda

Stuart_ArmstrongNov 22, 2019, 2:29 PM
34 points
4 comments1 min readLW link

Wor­ri­some mi­s­un­der­stand­ing of the core is­sues with AI transition

Roman LeventovJan 18, 2024, 10:05 AM
5 points
2 comments4 min readLW link

Four vi­sions of Trans­for­ma­tive AI success

Steven ByrnesJan 17, 2024, 8:45 PM
112 points
22 comments15 min readLW link

The Plan − 2023 Version

johnswentworthDec 29, 2023, 11:34 PM
152 points
40 comments31 min readLW link1 review

Assess­ment of AI safety agen­das: think about the down­side risk

Roman LeventovDec 19, 2023, 9:00 AM
13 points
1 comment1 min readLW link

Embed­ded Curiosities

Nov 8, 2018, 2:19 PM
91 points
1 comment2 min readLW link

Sub­sys­tem Alignment

Nov 6, 2018, 4:16 PM
102 points
12 comments1 min readLW link

Ro­bust Delegation

Nov 4, 2018, 4:38 PM
116 points
10 comments1 min readLW link

Embed­ded World-Models

Nov 2, 2018, 4:07 PM
96 points
16 comments1 min readLW link

De­ci­sion Theory

Oct 31, 2018, 6:41 PM
121 points
45 comments1 min readLW link

An­nounc­ing Hu­man-al­igned AI Sum­mer School

May 22, 2024, 8:55 AM
50 points
0 comments1 min readLW link
(humanaligned.ai)

The Short­est Path Between Scylla and Charybdis

Thane RuthenisDec 18, 2023, 8:08 PM
50 points
8 comments5 min readLW link

Re­search agenda: Su­per­vis­ing AIs im­prov­ing AIs

Apr 29, 2023, 5:09 PM
76 points
5 comments19 min readLW link

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

scasperDec 5, 2023, 4:48 PM
125 points
30 comments13 min readLW link

Sec­tions 1 & 2: In­tro­duc­tion, Strat­egy and Governance

JesseCliftonDec 17, 2019, 9:27 PM
35 points
8 comments14 min readLW link

Sec­tions 3 & 4: Cred­i­bil­ity, Peace­ful Bar­gain­ing Mechanisms

JesseCliftonDec 17, 2019, 9:46 PM
20 points
2 comments12 min readLW link

Sec­tions 5 & 6: Con­tem­po­rary Ar­chi­tec­tures, Hu­mans in the Loop

JesseCliftonDec 20, 2019, 3:52 AM
27 points
4 comments10 min readLW link

Sec­tion 7: Foun­da­tions of Ra­tional Agency

JesseCliftonDec 22, 2019, 2:05 AM
14 points
4 comments8 min readLW link

Ac­knowl­edge­ments & References

JesseCliftonDec 14, 2019, 7:04 AM
6 points
0 comments14 min readLW link

Align­ment pro­pos­als and com­plex­ity classes

evhubJul 16, 2020, 12:27 AM
40 points
26 comments13 min readLW link

Orthog­o­nal’s For­mal-Goal Align­ment the­ory of change

Tamsin LeakeMay 5, 2023, 10:36 PM
69 points
13 comments4 min readLW link
(carado.moe)

The Good­hart Game

John_MaxwellNov 18, 2019, 11:22 PM
13 points
5 comments5 min readLW link

[Linkpost] In­ter­pretabil­ity Dreams

DanielFilanMay 24, 2023, 9:08 PM
39 points
2 comments2 min readLW link
(transformer-circuits.pub)

My AI Align­ment Re­search Agenda and Threat Model, right now (May 2023)

Nicholas / Heather KrossMay 28, 2023, 3:23 AM
25 points
0 comments6 min readLW link
(www.thinkingmuchbetter.com)

Ab­strac­tion is Big­ger than Nat­u­ral Abstraction

Nicholas / Heather KrossMay 31, 2023, 12:00 AM
18 points
0 comments5 min readLW link
(www.thinkingmuchbetter.com)

[Question] Does any­one’s full-time job in­clude read­ing and un­der­stand­ing all the most-promis­ing for­mal AI al­ign­ment work?

Nicholas / Heather KrossJun 16, 2023, 2:24 AM
15 points
2 comments1 min readLW link

My re­search agenda in agent foundations

Alex_AltairJun 28, 2023, 6:00 PM
72 points
9 comments11 min readLW link

My Align­ment Timeline

Nicholas / Heather KrossJul 3, 2023, 1:04 AM
22 points
0 comments2 min readLW link

My Cen­tral Align­ment Pri­or­ity (2 July 2023)

Nicholas / Heather KrossJul 3, 2023, 1:46 AM
12 points
1 comment3 min readLW link

Im­mo­bile AI makes a move: anti-wire­head­ing, on­tol­ogy change, and model splintering

Stuart_ArmstrongSep 17, 2021, 3:24 PM
32 points
3 comments2 min readLW link

Test­ing The Nat­u­ral Ab­strac­tion Hy­poth­e­sis: Pro­ject Update

johnswentworthSep 20, 2021, 3:44 AM
88 points
17 comments8 min readLW link1 review

AI, learn to be con­ser­va­tive, then learn to be less so: re­duc­ing side-effects, learn­ing pre­served fea­tures, and go­ing be­yond conservatism

Stuart_ArmstrongSep 20, 2021, 11:56 AM
14 points
4 comments3 min readLW link

The Plan

johnswentworthDec 10, 2021, 11:41 PM
260 points
78 comments14 min readLW link1 review

Paradigm-build­ing: Introduction

Cameron BergFeb 8, 2022, 12:06 AM
28 points
0 comments2 min readLW link

Ac­cept­abil­ity Ver­ifi­ca­tion: A Re­search Agenda

Jul 12, 2022, 8:11 PM
50 points
0 comments1 min readLW link
(docs.google.com)

Gaia Net­work: a prac­ti­cal, in­cre­men­tal path­way to Open Agency Architecture

Dec 20, 2023, 5:11 PM
22 points
8 comments16 min readLW link

Re­marks 1–18 on GPT (com­pressed)

Cleo NardoMar 20, 2023, 10:27 PM
145 points
35 comments31 min readLW link

(My un­der­stand­ing of) What Every­one in Tech­ni­cal Align­ment is Do­ing and Why

Aug 29, 2022, 1:23 AM
413 points
90 comments37 min readLW link1 review

Distil­led Rep­re­sen­ta­tions Re­search Agenda

Oct 18, 2022, 8:59 PM
15 points
2 comments8 min readLW link

My AGI safety re­search—2022 re­view, ’23 plans

Steven ByrnesDec 14, 2022, 3:15 PM
51 points
10 comments7 min readLW link

An overview of some promis­ing work by ju­nior al­ign­ment researchers

Orpheus16Dec 26, 2022, 5:23 PM
34 points
0 comments4 min readLW link

EIS XIII: Reflec­tions on An­thropic’s SAE Re­search Circa May 2024

scasperMay 21, 2024, 8:15 PM
157 points
16 comments3 min readLW link

World-Model In­ter­pretabil­ity Is All We Need

Thane RuthenisJan 14, 2023, 7:37 PM
36 points
22 comments21 min readLW link

Selec­tion The­o­rems: A Pro­gram For Un­der­stand­ing Agents

johnswentworthSep 28, 2021, 5:03 AM
128 points
28 comments6 min readLW link2 reviews

Why I’m not work­ing on {de­bate, RRM, ELK, nat­u­ral ab­strac­tions}

Steven ByrnesFeb 10, 2023, 7:22 PM
71 points
19 comments9 min readLW link

Gra­di­ent Des­cent on the Hu­man Brain

Apr 1, 2024, 10:39 PM
59 points
5 comments2 min readLW link

In­tel­li­gence–Agency Equiv­alence ≈ Mass–En­ergy Equiv­alence: On Static Na­ture of In­tel­li­gence & Phys­i­cal­iza­tion of Ethics

ankFeb 22, 2025, 12:12 AM
1 point
0 comments6 min readLW link

EIS VII: A Challenge for Mechanists

scasperFeb 18, 2023, 6:27 PM
36 points
4 comments3 min readLW link

EIS VIII: An Eng­ineer’s Un­der­stand­ing of De­cep­tive Alignment

scasperFeb 19, 2023, 3:25 PM
30 points
5 comments4 min readLW link

Re­sources for AI Align­ment Cartography

GyrodiotApr 4, 2020, 2:20 PM
45 points
8 comments9 min readLW link

In­tro­duc­ing the Longevity Re­search Institute

sarahconstantinMay 8, 2018, 3:30 AM
54 points
20 comments1 min readLW link
(srconstantin.wordpress.com)

An­nounce­ment: AI al­ign­ment prize round 3 win­ners and next round

cousin_itJul 15, 2018, 7:40 AM
93 points
7 comments1 min readLW link

Ma­chine Learn­ing Pro­jects on IDA

Jun 24, 2019, 6:38 PM
49 points
3 comments2 min readLW link

AI Align­ment Re­search Overview (by Ja­cob Stein­hardt)

Ben PaceNov 6, 2019, 7:24 PM
44 points
0 comments7 min readLW link
(docs.google.com)

Creat­ing Welfare Biol­ogy: A Re­search Proposal

ozymandiasNov 16, 2017, 7:06 PM
20 points
5 comments4 min readLW link

[Linkpost] In­ter­pretable Anal­y­sis of Fea­tures Found in Open-source Sparse Au­toen­coder (par­tial repli­ca­tion)

Fernando AvalosSep 9, 2024, 3:33 AM
6 points
1 comment1 min readLW link
(forum.effectivealtruism.org)

An­no­tated re­ply to Ben­gio’s “AI Scien­tists: Safe and Use­ful AI?”

Roman LeventovMay 8, 2023, 9:26 PM
18 points
2 comments7 min readLW link
(yoshuabengio.org)

H-JEPA might be tech­ni­cally al­ignable in a mod­ified form

Roman LeventovMay 8, 2023, 11:04 PM
12 points
2 comments7 min readLW link

Roadmap for a col­lab­o­ra­tive pro­to­type of an Open Agency Architecture

Deger TuranMay 10, 2023, 5:41 PM
31 points
0 comments12 min readLW link

Notes on the im­por­tance and im­ple­men­ta­tion of safety-first cog­ni­tive ar­chi­tec­tures for AI

Brendon_WongMay 11, 2023, 10:03 AM
3 points
0 comments3 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasperFeb 20, 2023, 6:25 PM
30 points
8 comments8 min readLW link

Re­search Agenda in re­verse: what *would* a solu­tion look like?

Stuart_ArmstrongJun 25, 2019, 1:52 PM
35 points
25 comments1 min readLW link

An­nounc­ing: The In­de­pen­dent AI Safety Registry

Shoshannah TekofskyDec 26, 2022, 9:22 PM
53 points
9 comments1 min readLW link

Fore­cast­ing AI Progress: A Re­search Agenda

rossgAug 10, 2020, 1:04 AM
39 points
4 comments1 min readLW link

Tech­ni­cal AGI safety re­search out­side AI

Richard_NgoOct 18, 2019, 3:00 PM
43 points
3 comments3 min readLW link

Why I am not cur­rently work­ing on the AAMLS agenda

jessicataJun 1, 2017, 5:57 PM
28 points
3 comments5 min readLW link

In­fer­ence from a Math­e­mat­i­cal De­scrip­tion of an Ex­ist­ing Align­ment Re­search: a pro­posal for an outer al­ign­ment re­search program

Christopher KingJun 2, 2023, 9:54 PM
7 points
4 comments16 min readLW link

[Question] Re­search ideas (AI In­ter­pretabil­ity & Neu­ro­sciences) for a 2-months project

fluxJan 8, 2023, 3:36 PM
3 points
1 comment1 min readLW link

EIS X: Con­tinual Learn­ing, Mo­du­lar­ity, Com­pres­sion, and Biolog­i­cal Brains

scasperFeb 21, 2023, 4:59 PM
14 points
4 comments3 min readLW link

A Mul­tidis­ci­plinary Ap­proach to Align­ment (MATA) and Archety­pal Trans­fer Learn­ing (ATL)

MiguelDevJun 19, 2023, 2:32 AM
4 points
2 comments7 min readLW link

Nat­u­ral ab­strac­tions are ob­server-de­pen­dent: a con­ver­sa­tion with John Wentworth

Martín SotoFeb 12, 2024, 5:28 PM
39 points
13 comments7 min readLW link

RFC: a tool to cre­ate a ranked list of pro­jects in ex­plain­able AI

eamagApr 6, 2025, 9:18 PM
2 points
0 comments1 min readLW link
(eamag.me)

Par­tial Si­mu­la­tion Ex­trap­o­la­tion: A Pro­posal for Build­ing Safer Simulators

lukemarksJun 17, 2023, 1:55 PM
16 points
0 comments10 min readLW link

[UPDATE: dead­line ex­tended to July 24!] New wind in ra­tio­nal­ity’s sails: Ap­pli­ca­tions for Epistea Res­i­dency 2023 are now open

Jul 11, 2023, 11:02 AM
80 points
7 comments3 min readLW link

The AI Con­trol Prob­lem in a wider in­tel­lec­tual context

philosophybearJan 13, 2023, 12:28 AM
11 points
3 comments12 min readLW link

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

Aug 8, 2023, 1:30 AM
318 points
30 comments18 min readLW link1 review

Towards White Box Deep Learning

Maciej SatkiewiczMar 27, 2024, 6:20 PM
18 points
5 comments1 min readLW link
(arxiv.org)

Which of these five AI al­ign­ment re­search pro­jects ideas are no good?

rmoehnAug 8, 2019, 7:17 AM
25 points
13 comments1 min readLW link

Fund­ing Good Research

lukeprogMay 27, 2012, 6:41 AM
38 points
44 comments2 min readLW link

The Löbian Ob­sta­cle, And Why You Should Care

lukemarksSep 7, 2023, 11:59 PM
18 points
6 comments2 min readLW link

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

Oct 3, 2023, 7:45 AM
17 points
0 comments5 min readLW link

Thoughts On (Solv­ing) Deep Deception

JozdienOct 21, 2023, 10:40 PM
72 points
6 comments6 min readLW link

Notes on effec­tive-al­tru­ism-re­lated re­search, writ­ing, test­ing fit, learn­ing, and the EA Forum

MichaelAMar 28, 2021, 11:43 PM
14 points
0 comments4 min readLW link

AI Safety in a World of Vuln­er­a­ble Ma­chine Learn­ing Systems

Mar 8, 2023, 2:40 AM
70 points
28 comments29 min readLW link
(far.ai)

The Me­taethics and Nor­ma­tive Ethics of AGI Value Align­ment: Many Ques­tions, Some Implications

Eleos Arete CitriniSep 16, 2021, 4:13 PM
6 points
0 comments8 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. MurphyMay 12, 2022, 8:01 PM
58 points
0 comments59 min readLW link

Re­search agenda: For­mal­iz­ing ab­strac­tions of computations

Erik JennerFeb 2, 2023, 4:29 AM
93 points
10 comments31 min readLW link

A multi-dis­ci­plinary view on AI safety research

Roman LeventovFeb 8, 2023, 4:50 PM
46 points
4 comments26 min readLW link

AI learns be­trayal and how to avoid it

Stuart_ArmstrongSep 30, 2021, 9:39 AM
30 points
4 comments2 min readLW link

A FLI post­doc­toral grant ap­pli­ca­tion: AI al­ign­ment via causal anal­y­sis and de­sign of agents

PabloAMCNov 13, 2021, 1:44 AM
4 points
0 comments7 min readLW link

Fram­ing ap­proaches to al­ign­ment and the hard prob­lem of AI cognition

ryan_greenblattDec 15, 2021, 7:06 PM
16 points
15 comments27 min readLW link

Hu­man-AI Re­la­tion­al­ity is Already Here

bridgebotFeb 20, 2025, 7:08 AM
13 points
0 comments15 min readLW link

An Open Philan­thropy grant pro­posal: Causal rep­re­sen­ta­tion learn­ing of hu­man preferences

PabloAMCJan 11, 2022, 11:28 AM
19 points
6 comments8 min readLW link

Why Academia is Mostly Not Truth-Seeking

Zero ContradictionsOct 16, 2024, 7:14 PM
−7 points
6 comments1 min readLW link
(thewaywardaxolotl.blogspot.com)

Paradigm-build­ing: The hi­er­ar­chi­cal ques­tion framework

Cameron BergFeb 9, 2022, 4:47 PM
11 points
15 comments3 min readLW link

Ques­tion 1: Pre­dicted ar­chi­tec­ture of AGI learn­ing al­gorithm(s)

Cameron BergFeb 10, 2022, 5:22 PM
13 points
1 comment7 min readLW link

Ques­tion 2: Pre­dicted bad out­comes of AGI learn­ing architecture

Cameron BergFeb 11, 2022, 10:23 PM
5 points
1 comment10 min readLW link

NAO Up­dates, Fall 2024

jefftkOct 18, 2024, 12:00 AM
32 points
2 comments1 min readLW link
(naobservatory.org)

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron BergFeb 12, 2022, 7:13 PM
5 points
1 comment7 min readLW link

Ques­tion 5: The timeline hyperparameter

Cameron BergFeb 14, 2022, 4:38 PM
8 points
3 comments7 min readLW link

Shal­low re­view of live agen­das in al­ign­ment & safety

Nov 27, 2023, 11:10 AM
348 points
73 comments29 min readLW link1 review

Paradigm-build­ing: Con­clu­sion and prac­ti­cal takeaways

Cameron BergFeb 15, 2022, 4:11 PM
5 points
1 comment2 min readLW link

Agency over­hang as a proxy for Sharp left turn

Nov 7, 2024, 12:14 PM
6 points
0 comments5 min readLW link

How to Con­tribute to The­o­ret­i­cal Re­ward Learn­ing Research

Joar SkalseFeb 28, 2025, 7:27 PM
16 points
0 comments21 min readLW link

The The­o­ret­i­cal Re­ward Learn­ing Re­search Agenda: In­tro­duc­tion and Motivation

Joar SkalseFeb 28, 2025, 7:20 PM
25 points
4 comments14 min readLW link

EIS IV: A Spotlight on Fea­ture At­tri­bu­tion/​Saliency

scasperFeb 15, 2023, 6:46 PM
19 points
1 comment4 min readLW link

Give Neo a Chance

ankMar 6, 2025, 1:48 AM
3 points
7 comments7 min readLW link

You should de­lay en­g­ineer­ing-heavy re­search in light of R&D automation

Daniel PalekaJan 7, 2025, 2:11 AM
35 points
3 comments5 min readLW link
(newsletter.danielpaleka.com)

Gaia Net­work: An Illus­trated Primer

Jan 18, 2024, 6:23 PM
3 points
2 comments15 min readLW link

Elicit: Lan­guage Models as Re­search Assistants

Apr 9, 2022, 2:56 PM
71 points
6 comments13 min readLW link

EIS II: What is “In­ter­pretabil­ity”?

scasperFeb 9, 2023, 4:48 PM
28 points
6 comments4 min readLW link

EIS III: Broad Cri­tiques of In­ter­pretabil­ity Research

scasperFeb 14, 2023, 6:24 PM
20 points
2 comments11 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

JozdienJul 18, 2022, 7:11 AM
60 points
8 comments20 min readLW link

False Pos­i­tives in En­tity-Level Hal­lu­ci­na­tion De­tec­tion: A Tech­ni­cal Challenge

MaxKamacheeJan 14, 2025, 7:22 PM
1 point
0 comments2 min readLW link

The Road to Evil Is Paved with Good Ob­jec­tives: Frame­work to Clas­sify and Fix Misal­ign­ments.

ShivamJan 30, 2025, 2:44 AM
1 point
0 comments11 min readLW link

[Question] How far along Metr’s law can AI start au­tomat­ing or helping with al­ign­ment re­search?

Christopher KingMar 20, 2025, 3:58 PM
20 points
21 comments1 min readLW link

Syn­thetic Neuroscience

hpcfungMar 25, 2025, 5:45 PM
2 points
3 comments3 min readLW link

What should AI safety be try­ing to achieve?

EuanMcLeanMay 23, 2024, 11:17 AM
17 points
1 comment13 min readLW link

Ret­ro­spec­tive: PIBBSS Fel­low­ship 2024

Dec 20, 2024, 3:55 PM
64 points
1 comment4 min readLW link

Towards em­pa­thy in RL agents and be­yond: In­sights from cog­ni­tive sci­ence for AI Align­ment

Marc CarauleanuApr 3, 2023, 7:59 PM
15 points
6 comments1 min readLW link
(clipchamp.com)

EIS XI: Mov­ing Forward

scasperFeb 22, 2023, 7:05 PM
19 points
2 comments9 min readLW link

For al­ign­ment, we should si­mul­ta­neously use mul­ti­ple the­o­ries of cog­ni­tion and value

Roman LeventovApr 24, 2023, 10:37 AM
23 points
5 comments5 min readLW link

Unal­igned AGI & Brief His­tory of Inequality

ankFeb 22, 2025, 4:26 PM
−20 points
4 comments7 min readLW link

EIS XII: Sum­mary

scasperFeb 23, 2023, 5:45 PM
18 points
0 comments6 min readLW link

How I think about alignment

Linda LinseforsAug 13, 2022, 10:01 AM
31 points
11 comments5 min readLW link

La­bor Par­ti­ci­pa­tion is a High-Pri­or­ity AI Align­ment Risk

alexJun 17, 2024, 6:09 PM
6 points
0 comments17 min readLW link

EIS V: Blind Spots In AI Safety In­ter­pretabil­ity Research

scasperFeb 16, 2023, 7:09 PM
57 points
24 comments10 min readLW link

Shard The­ory: An Overview

David UdellAug 11, 2022, 5:44 AM
166 points
34 comments10 min readLW link

Elic­it­ing La­tent Knowl­edge (ELK) - Distil­la­tion/​Summary

Marius HobbhahnJun 8, 2022, 1:18 PM
69 points
2 comments21 min readLW link

[Question] How can we se­cure more re­search po­si­tions at our uni­ver­si­ties for x-risk re­searchers?

Neil CrawfordSep 6, 2022, 5:17 PM
11 points
0 comments1 min readLW link

AI Ex­is­ten­tial Safety Fellowships

mmfliOct 28, 2023, 6:07 PM
5 points
0 comments1 min readLW link

Try­ing to un­der­stand John Went­worth’s re­search agenda

Oct 20, 2023, 12:05 AM
93 points
13 comments12 min readLW link

Align­ment Org Cheat Sheet

Sep 20, 2022, 5:36 PM
70 points
8 comments4 min readLW link

AISC pro­ject: TinyEvals

Jett JaniakNov 22, 2023, 8:47 PM
22 points
0 comments4 min readLW link

Gen­er­a­tive, Epi­sodic Ob­jec­tives for Safe AI

Michael GlassOct 5, 2022, 11:18 PM
11 points
3 comments8 min readLW link

AISC 2024 - Pro­ject Summaries

NickyPNov 27, 2023, 10:32 PM
48 points
3 comments18 min readLW link

Science of Deep Learn­ing—a tech­ni­cal agenda

Marius HobbhahnOct 18, 2022, 2:54 PM
37 points
7 comments4 min readLW link

Re­in­force­ment Learn­ing us­ing Lay­ered Mor­phol­ogy (RLLM)

MiguelDevDec 1, 2023, 5:18 AM
7 points
0 comments29 min readLW link

A call for a quan­ti­ta­tive re­port card for AI bioter­ror­ism threat models

JunoDec 4, 2023, 6:35 AM
12 points
0 comments10 min readLW link

What’s new at FAR AI

Dec 4, 2023, 9:18 PM
41 points
0 comments5 min readLW link
(far.ai)

In­ter­view with Vanessa Kosoy on the Value of The­o­ret­i­cal Re­search for AI

WillPetilloDec 4, 2023, 10:58 PM
37 points
0 comments35 min readLW link

EIS VI: Cri­tiques of Mechanis­tic In­ter­pretabil­ity Work in AI Safety

scasperFeb 17, 2023, 8:48 PM
49 points
9 comments12 min readLW link

In­tro­duc­ing Leap Labs, an AI in­ter­pretabil­ity startup

Jessica RumbelowMar 6, 2023, 4:16 PM
103 points
12 comments1 min readLW link

Ra­tional Effec­tive Utopia & Nar­row Way There: Mul­tiver­sal AI Align­ment, Place AI, New Ethico­physics… (Up­dated)

ankFeb 11, 2025, 3:21 AM
13 points
8 comments35 min readLW link

AI re­searchers an­nounce Neu­roAI agenda

Cameron BergOct 24, 2022, 12:14 AM
37 points
12 comments6 min readLW link
(arxiv.org)

Ap­ply to the Red­wood Re­search Mechanis­tic In­ter­pretabil­ity Ex­per­i­ment (REMIX), a re­search pro­gram in Berkeley

Oct 27, 2022, 1:32 AM
135 points
14 comments12 min readLW link

My sum­mary of “Prag­matic AI Safety”

Eleni AngelouNov 5, 2022, 12:54 PM
3 points
0 comments5 min readLW link
No comments.