Research Agendas

TagLast edit: Sep 16, 2021, 3:08 PM by plex

Research Agendas lay out the areas of research which individuals or groups are working on, or those that they believe would be valuable for others to work on. They help make research more legible and encourage discussion of priorities.

Embedded Agents

abramdemski and Scott Garrabrant

Oct 29, 2018, 7:53 PM

233 points

42 comments1 min readLW link 2 reviews

New safety research agenda: scalable agent alignment via reward modeling

VikaNov 20, 2018, 5:29 PM

34 points

12 comments1 min readLW link

(medium.com)

The Learning-Theoretic AI Alignment Research Agenda

Vanessa KosoyJul 4, 2018, 9:53 AM

93 points

37 comments32 min readLW link

On how various plans miss the hard bits of the alignment challenge

So8resJul 12, 2022, 2:49 AM

313 points

89 comments29 min readLW link 3 reviews

Research Agenda v0.9: Synthesising a human’s preferences into a utility function

Stuart_ArmstrongJun 17, 2019, 5:46 PM

70 points

26 comments33 min readLW link

AI Governance: A Research Agenda

habrykaSep 5, 2018, 6:00 PM

25 points

3 comments1 min readLW link

(www.fhi.ox.ac.uk)

Paul’s research agenda FAQ

zhukeepaJul 1, 2018, 6:25 AM

128 points

74 comments19 min readLW link 1 review

Our take on CHAI’s research agenda in under 1500 words

Alex FlintJun 17, 2020, 12:24 PM

113 points

18 comments5 min readLW link

An overview of 11 proposals for building safe advanced AI

evhubMay 29, 2020, 8:38 PM

220 points

37 comments38 min readLW link 2 reviews

Research Adenda: Modelling Trajectories of Language Models

NickyPNov 13, 2023, 2:33 PM

28 points

0 comments12 min readLW link

The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda

Cameron Berg, Judd Rosenblatt, AE Studio and Marc Carauleanu

Dec 18, 2023, 8:35 PM

175 points

22 comments12 min readLW link 1 review

Embedded Agency (full-text version)

Scott Garrabrant and abramdemski

Nov 15, 2018, 7:49 PM

201 points

17 comments54 min readLW link

Trying to isolate objectives: approaches toward high-level interpretability

JozdienJan 9, 2023, 6:33 PM

49 points

14 comments8 min readLW link

Deconfusing Human Values Research Agenda v1

Gordon Seidoh WorleyMar 23, 2020, 4:25 PM

28 points

12 comments4 min readLW link

Davidad’s Bold Plan for Alignment: An In-Depth Explanation

Charbel-Raphaël and Gabin

Apr 19, 2023, 4:09 PM

168 points

40 comments21 min readLW link 2 reviews

Thoughts on Human Models

Ramana Kumar and Scott Garrabrant

Feb 21, 2019, 9:10 AM

127 points

32 comments10 min readLW link 1 review

MIRI’s technical research agenda

So8resDec 23, 2014, 6:45 PM

55 points

52 comments3 min readLW link

Preface to CLR’s Research Agenda on Cooperation, Conflict, and TAI

JesseCliftonDec 13, 2019, 9:02 PM

62 points

10 comments2 min readLW link

Research agenda update

Steven ByrnesAug 6, 2021, 7:24 PM

55 points

40 comments7 min readLW link

Some conceptual alignment research projects

Richard_NgoAug 25, 2022, 10:51 PM

177 points

15 comments3 min readLW link

The Learning-Theoretic Agenda: Status 2023

Vanessa KosoyApr 19, 2023, 5:21 AM

143 points

21 comments56 min readLW link 3 reviews

New year, new research agenda post

Charlie SteinerJan 12, 2022, 5:58 PM

29 points

4 comments16 min readLW link

Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

Logan Riggs and Gurkenglas

Sep 3, 2020, 6:27 PM

68 points

11 comments2 min readLW link

Key Questions for Digital Minds

Jacy Reese AnthisMar 22, 2023, 5:13 PM

22 points

0 comments7 min readLW link

(www.sentienceinstitute.org)

The space of systems and the space of maps

Jan_Kulveit, rosehadshar, Nora_Ammann and clem_acs

Mar 22, 2023, 2:59 PM

38 points

0 comments5 min readLW link

Towards Hodge-podge Alignment

Cleo NardoDec 19, 2022, 8:12 PM

95 points

30 comments9 min readLW link

Theories of impact for Science of Deep Learning

Marius HobbhahnDec 1, 2022, 2:39 PM

24 points

0 comments11 min readLW link

Announcing the Alignment of Complex Systems Research Group

Jan_Kulveit and technicalities

Jun 4, 2022, 4:10 AM

91 points

20 comments5 min readLW link

Sparsify: A mechanistic interpretability research agenda

Lee SharkeyApr 3, 2024, 12:34 PM

96 points

23 comments22 min readLW link

Constructability: Plainly-coded AGIs may be feasible in the near future

Épiphanie Gédéon and Charbel-Raphaël

Apr 27, 2024, 4:04 PM

85 points

13 comments13 min readLW link

The Prop-room and Stage Cognitive Architecture

Robert KralischApr 29, 2024, 12:48 AM

14 points

4 comments14 min readLW link

Notes on notes on virtues

David GrossDec 30, 2020, 5:47 PM

71 points

11 comments11 min readLW link

What and Why: Developmental Interpretability of Reinforcement Learning

Garrett BakerJul 9, 2024, 2:09 PM

68 points

4 comments6 min readLW link

Towards the Operationalization of Philosophy & Wisdom

Thane RuthenisOct 28, 2024, 7:45 PM

20 points

2 comments33 min readLW link

(aiimpacts.org)

Self-prediction acts as an emergent regularizer

Cameron Berg, Judd Rosenblatt, Mike Vaiana, Diogo de Lucena, florin_pop and AE Studio

Oct 23, 2024, 10:27 PM

91 points

9 comments4 min readLW link

Seeking Collaborators

abramdemskiNov 1, 2024, 5:13 PM

57 points

15 comments7 min readLW link

Shallow review of technical AI safety, 2024

technicalities, Stag, Stephen McAleese, jordine and Dr. David Mathers

Dec 29, 2024, 12:01 PM

185 points

34 comments41 min readLW link

My AGI safety research—2024 review, ’25 plans

Steven ByrnesDec 31, 2024, 9:05 PM

109 points

4 comments8 min readLW link

Ultra-simplified research agenda

Stuart_ArmstrongNov 22, 2019, 2:29 PM

34 points

4 comments1 min readLW link

Worrisome misunderstanding of the core issues with AI transition

Roman LeventovJan 18, 2024, 10:05 AM

5 points

2 comments4 min readLW link

Four visions of Transformative AI success

Steven ByrnesJan 17, 2024, 8:45 PM

112 points

22 comments15 min readLW link

The Plan − 2023 Version

johnswentworthDec 29, 2023, 11:34 PM

152 points

40 comments31 min readLW link 1 review

Assessment of AI safety agendas: think about the downside risk

Roman LeventovDec 19, 2023, 9:00 AM

13 points

1 comment1 min readLW link

Embedded Curiosities

Scott Garrabrant and abramdemski

Nov 8, 2018, 2:19 PM

91 points

1 comment2 min readLW link

Subsystem Alignment

abramdemski and Scott Garrabrant

Nov 6, 2018, 4:16 PM

102 points

12 comments1 min readLW link

Robust Delegation

abramdemski and Scott Garrabrant

Nov 4, 2018, 4:38 PM

116 points

10 comments1 min readLW link

Embedded World-Models

abramdemski and Scott Garrabrant

Nov 2, 2018, 4:07 PM

96 points

16 comments1 min readLW link

Decision Theory

abramdemski and Scott Garrabrant

Oct 31, 2018, 6:41 PM

121 points

45 comments1 min readLW link

Announcing Human-aligned AI Summer School

Jan_Kulveit and Tomáš Gavenčiak

May 22, 2024, 8:55 AM

50 points

0 comments1 min readLW link

(humanaligned.ai)

The Shortest Path Between Scylla and Charybdis

Thane RuthenisDec 18, 2023, 8:08 PM

50 points

8 comments5 min readLW link

Research agenda: Supervising AIs improving AIs

Quintin Pope, Owen D, Roman Engeler and jacquesthibs

Apr 29, 2023, 5:09 PM

76 points

5 comments19 min readLW link

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasperDec 5, 2023, 4:48 PM

125 points

30 comments13 min readLW link

Sections 1 & 2: Introduction, Strategy and Governance

JesseCliftonDec 17, 2019, 9:27 PM

35 points

8 comments14 min readLW link

Sections 3 & 4: Credibility, Peaceful Bargaining Mechanisms

JesseCliftonDec 17, 2019, 9:46 PM

20 points

2 comments12 min readLW link

Sections 5 & 6: Contemporary Architectures, Humans in the Loop

JesseCliftonDec 20, 2019, 3:52 AM

27 points

4 comments10 min readLW link

Section 7: Foundations of Rational Agency

JesseCliftonDec 22, 2019, 2:05 AM

14 points

4 comments8 min readLW link

Acknowledgements & References

JesseCliftonDec 14, 2019, 7:04 AM

6 points

0 comments14 min readLW link

Alignment proposals and complexity classes

evhubJul 16, 2020, 12:27 AM

40 points

26 comments13 min readLW link

Orthogonal’s Formal-Goal Alignment theory of change

Tamsin LeakeMay 5, 2023, 10:36 PM

69 points

13 comments4 min readLW link

(carado.moe)

The Goodhart Game

John_MaxwellNov 18, 2019, 11:22 PM

13 points

5 comments5 min readLW link

[Linkpost] Interpretability Dreams

DanielFilanMay 24, 2023, 9:08 PM

39 points

2 comments2 min readLW link

(transformer-circuits.pub)

My AI Alignment Research Agenda and Threat Model, right now (May 2023)

Nicholas / Heather KrossMay 28, 2023, 3:23 AM

25 points

0 comments6 min readLW link

(www.thinkingmuchbetter.com)

Abstraction is Bigger than Natural Abstraction

Nicholas / Heather KrossMay 31, 2023, 12:00 AM

18 points

0 comments5 min readLW link

(www.thinkingmuchbetter.com)

[Question] Does anyone’s full-time job include reading and understanding all the most-promising formal AI alignment work?

Nicholas / Heather KrossJun 16, 2023, 2:24 AM

15 points

2 comments1 min readLW link

My research agenda in agent foundations

Alex_AltairJun 28, 2023, 6:00 PM

72 points

9 comments11 min readLW link

My Alignment Timeline

Nicholas / Heather KrossJul 3, 2023, 1:04 AM

22 points

0 comments2 min readLW link

My Central Alignment Priority (2 July 2023)

Nicholas / Heather KrossJul 3, 2023, 1:46 AM

12 points

1 comment3 min readLW link

Immobile AI makes a move: anti-wireheading, ontology change, and model splintering

Stuart_ArmstrongSep 17, 2021, 3:24 PM

32 points

3 comments2 min readLW link

Testing The Natural Abstraction Hypothesis: Project Update

johnswentworthSep 20, 2021, 3:44 AM

88 points

17 comments8 min readLW link 1 review

AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism

Stuart_ArmstrongSep 20, 2021, 11:56 AM

14 points

4 comments3 min readLW link

The Plan

johnswentworthDec 10, 2021, 11:41 PM

260 points

78 comments14 min readLW link 1 review

Paradigm-building: Introduction

Cameron BergFeb 8, 2022, 12:06 AM

28 points

0 comments2 min readLW link

Acceptability Verification: A Research Agenda

David Udell and evhub

Jul 12, 2022, 8:11 PM

50 points

0 comments1 min readLW link

(docs.google.com)

Gaia Network: a practical, incremental pathway to Open Agency Architecture

Roman Leventov and Rafael Kaufmann Nedal

Dec 20, 2023, 5:11 PM

22 points

8 comments16 min readLW link

Remarks 1–18 on GPT (compressed)

Cleo NardoMar 20, 2023, 10:27 PM

145 points

35 comments31 min readLW link

(My understanding of) What Everyone in Technical Alignment is Doing and Why

Thomas Larsen and elifland

Aug 29, 2022, 1:23 AM

413 points

90 comments37 min readLW link 1 review

Distilled Representations Research Agenda

Hoagy and mishajw

Oct 18, 2022, 8:59 PM

15 points

2 comments8 min readLW link

My AGI safety research—2022 review, ’23 plans

Steven ByrnesDec 14, 2022, 3:15 PM

51 points

10 comments7 min readLW link

An overview of some promising work by junior alignment researchers

Orpheus16Dec 26, 2022, 5:23 PM

34 points

0 comments4 min readLW link

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

scasperMay 21, 2024, 8:15 PM

157 points

16 comments3 min readLW link

World-Model Interpretability Is All We Need

Thane RuthenisJan 14, 2023, 7:37 PM

36 points

22 comments21 min readLW link

Selection Theorems: A Program For Understanding Agents

johnswentworthSep 28, 2021, 5:03 AM

128 points

28 comments6 min readLW link 2 reviews

Why I’m not working on {debate, RRM, ELK, natural abstractions}

Steven ByrnesFeb 10, 2023, 7:22 PM

71 points

19 comments9 min readLW link

Gradient Descent on the Human Brain

Jozdien and gaspode

Apr 1, 2024, 10:39 PM

59 points

5 comments2 min readLW link

Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics

ankFeb 22, 2025, 12:12 AM

1 point

0 comments6 min readLW link

EIS VII: A Challenge for Mechanists

scasperFeb 18, 2023, 6:27 PM

36 points

4 comments3 min readLW link

EIS VIII: An Engineer’s Understanding of Deceptive Alignment

scasperFeb 19, 2023, 3:25 PM

30 points

5 comments4 min readLW link

Resources for AI Alignment Cartography

GyrodiotApr 4, 2020, 2:20 PM

45 points

8 comments9 min readLW link

Introducing the Longevity Research Institute

sarahconstantinMay 8, 2018, 3:30 AM

54 points

20 comments1 min readLW link

(srconstantin.wordpress.com)

Announcement: AI alignment prize round 3 winners and next round

cousin_itJul 15, 2018, 7:40 AM

93 points

7 comments1 min readLW link

Machine Learning Projects on IDA

Owain_Evans, William_S and stuhlmueller

Jun 24, 2019, 6:38 PM

49 points

3 comments2 min readLW link

AI Alignment Research Overview (by Jacob Steinhardt)

Ben PaceNov 6, 2019, 7:24 PM

44 points

0 comments7 min readLW link

(docs.google.com)

Creating Welfare Biology: A Research Proposal

ozymandiasNov 16, 2017, 7:06 PM

20 points

5 comments4 min readLW link

[Linkpost] Interpretable Analysis of Features Found in Open-source Sparse Autoencoder (partial replication)

Fernando AvalosSep 9, 2024, 3:33 AM

6 points

1 comment1 min readLW link

(forum.effectivealtruism.org)

Annotated reply to Bengio’s “AI Scientists: Safe and Useful AI?”

Roman LeventovMay 8, 2023, 9:26 PM

18 points

2 comments7 min readLW link

(yoshuabengio.org)

H-JEPA might be technically alignable in a modified form

Roman LeventovMay 8, 2023, 11:04 PM

12 points

2 comments7 min readLW link

Roadmap for a collaborative prototype of an Open Agency Architecture

Deger TuranMay 10, 2023, 5:41 PM

31 points

0 comments12 min readLW link

Notes on the importance and implementation of safety-first cognitive architectures for AI

Brendon_WongMay 11, 2023, 10:03 AM

3 points

0 comments3 min readLW link

EIS IX: Interpretability and Adversaries

scasperFeb 20, 2023, 6:25 PM

30 points

8 comments8 min readLW link

Research Agenda in reverse: what would a solution look like?

Stuart_ArmstrongJun 25, 2019, 1:52 PM

35 points

25 comments1 min readLW link

Announcing: The Independent AI Safety Registry

Shoshannah TekofskyDec 26, 2022, 9:22 PM

53 points

9 comments1 min readLW link

Forecasting AI Progress: A Research Agenda

rossgAug 10, 2020, 1:04 AM

39 points

4 comments1 min readLW link

Technical AGI safety research outside AI

Richard_NgoOct 18, 2019, 3:00 PM

43 points

3 comments3 min readLW link

Why I am not currently working on the AAMLS agenda

jessicataJun 1, 2017, 5:57 PM

28 points

3 comments5 min readLW link

Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program

Christopher KingJun 2, 2023, 9:54 PM

7 points

4 comments16 min readLW link

[Question] Research ideas (AI Interpretability & Neurosciences) for a 2-months project

fluxJan 8, 2023, 3:36 PM

3 points

1 comment1 min readLW link

EIS X: Continual Learning, Modularity, Compression, and Biological Brains

scasperFeb 21, 2023, 4:59 PM

14 points

4 comments3 min readLW link

A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL)

MiguelDevJun 19, 2023, 2:32 AM

4 points

2 comments7 min readLW link

Natural abstractions are observer-dependent: a conversation with John Wentworth

Martín SotoFeb 12, 2024, 5:28 PM

39 points

13 comments7 min readLW link

RFC: a tool to create a ranked list of projects in explainable AI

eamagApr 6, 2025, 9:18 PM

2 points

0 comments1 min readLW link

(eamag.me)

Partial Simulation Extrapolation: A Proposal for Building Safer Simulators

lukemarksJun 17, 2023, 1:55 PM

16 points

0 comments10 min readLW link

[UPDATE: deadline extended to July 24!] New wind in rationality’s sails: Applications for Epistea Residency 2023 are now open

Jana Meixnerová and Irena Kotíková

Jul 11, 2023, 11:02 AM

80 points

7 comments3 min readLW link

The AI Control Problem in a wider intellectual context

philosophybearJan 13, 2023, 12:28 AM

11 points

3 comments12 min readLW link

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

Aug 8, 2023, 1:30 AM

318 points

30 comments18 min readLW link 1 review

Towards White Box Deep Learning

Maciej SatkiewiczMar 27, 2024, 6:20 PM

18 points

5 comments1 min readLW link

(arxiv.org)

Which of these five AI alignment research projects ideas are no good?

rmoehnAug 8, 2019, 7:17 AM

25 points

13 comments1 min readLW link

Funding Good Research

lukeprogMay 27, 2012, 6:41 AM

38 points

44 comments2 min readLW link

The Löbian Obstacle, And Why You Should Care

lukemarksSep 7, 2023, 11:59 PM

18 points

6 comments2 min readLW link

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

lukemarks, Amirali Abdullah, Rauno Arike, Fazl and nothoughtsheadempty

Oct 3, 2023, 7:45 AM

17 points

0 comments5 min readLW link

Thoughts On (Solving) Deep Deception

JozdienOct 21, 2023, 10:40 PM

72 points

6 comments6 min readLW link

Notes on effective-altruism-related research, writing, testing fit, learning, and the EA Forum

MichaelAMar 28, 2021, 11:43 PM

14 points

0 comments4 min readLW link

AI Safety in a World of Vulnerable Machine Learning Systems

AdamGleave and EuanMcLean

Mar 8, 2023, 2:40 AM

70 points

28 comments29 min readLW link

(far.ai)

The Metaethics and Normative Ethics of AGI Value Alignment: Many Questions, Some Implications

Eleos Arete CitriniSep 16, 2021, 4:13 PM

6 points

0 comments8 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. MurphyMay 12, 2022, 8:01 PM

58 points

0 comments59 min readLW link

Research agenda: Formalizing abstractions of computations

Erik JennerFeb 2, 2023, 4:29 AM

93 points

10 comments31 min readLW link

A multi-disciplinary view on AI safety research

Roman LeventovFeb 8, 2023, 4:50 PM

46 points

4 comments26 min readLW link

AI learns betrayal and how to avoid it

Stuart_ArmstrongSep 30, 2021, 9:39 AM

30 points

4 comments2 min readLW link

A FLI postdoctoral grant application: AI alignment via causal analysis and design of agents

PabloAMCNov 13, 2021, 1:44 AM

4 points

0 comments7 min readLW link

Framing approaches to alignment and the hard problem of AI cognition

ryan_greenblattDec 15, 2021, 7:06 PM

16 points

15 comments27 min readLW link

Human-AI Relationality is Already Here

bridgebotFeb 20, 2025, 7:08 AM

13 points

0 comments15 min readLW link

An Open Philanthropy grant proposal: Causal representation learning of human preferences

PabloAMCJan 11, 2022, 11:28 AM

19 points

6 comments8 min readLW link

Why Academia is Mostly Not Truth-Seeking

Zero ContradictionsOct 16, 2024, 7:14 PM

−7 points

6 comments1 min readLW link

(thewaywardaxolotl.blogspot.com)

Paradigm-building: The hierarchical question framework

Cameron BergFeb 9, 2022, 4:47 PM

11 points

15 comments3 min readLW link

Question 1: Predicted architecture of AGI learning algorithm(s)

Cameron BergFeb 10, 2022, 5:22 PM

13 points

1 comment7 min readLW link

Question 2: Predicted bad outcomes of AGI learning architecture

Cameron BergFeb 11, 2022, 10:23 PM

5 points

1 comment10 min readLW link

NAO Updates, Fall 2024

jefftkOct 18, 2024, 12:00 AM

32 points

2 comments1 min readLW link

(naobservatory.org)

Question 3: Control proposals for minimizing bad outcomes

Cameron BergFeb 12, 2022, 7:13 PM

5 points

1 comment7 min readLW link

Question 5: The timeline hyperparameter

Cameron BergFeb 14, 2022, 4:38 PM

8 points

3 comments7 min readLW link

Shallow review of live agendas in alignment & safety

technicalities and Stag

Nov 27, 2023, 11:10 AM

348 points

73 comments29 min readLW link 1 review

Paradigm-building: Conclusion and practical takeaways

Cameron BergFeb 15, 2022, 4:11 PM

5 points

1 comment2 min readLW link

Agency overhang as a proxy for Sharp left turn

Eris and Iuliia Levin

Nov 7, 2024, 12:14 PM

6 points

0 comments5 min readLW link

How to Contribute to Theoretical Reward Learning Research

Joar SkalseFeb 28, 2025, 7:27 PM

16 points

0 comments21 min readLW link

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar SkalseFeb 28, 2025, 7:20 PM

25 points

4 comments14 min readLW link

EIS IV: A Spotlight on Feature Attribution/Saliency

scasperFeb 15, 2023, 6:46 PM

19 points

1 comment4 min readLW link

Give Neo a Chance

ankMar 6, 2025, 1:48 AM

3 points

7 comments7 min readLW link

You should delay engineering-heavy research in light of R&D automation

Daniel PalekaJan 7, 2025, 2:11 AM

35 points

3 comments5 min readLW link

(newsletter.danielpaleka.com)

Gaia Network: An Illustrated Primer

Rafael Kaufmann Nedal and Roman Leventov

Jan 18, 2024, 6:23 PM

3 points

2 comments15 min readLW link

Elicit: Language Models as Research Assistants

stuhlmueller and jungofthewon

Apr 9, 2022, 2:56 PM

71 points

6 comments13 min readLW link

EIS II: What is “Interpretability”?

scasperFeb 9, 2023, 4:48 PM

28 points

6 comments4 min readLW link

EIS III: Broad Critiques of Interpretability Research

scasperFeb 14, 2023, 6:24 PM

20 points

2 comments11 min readLW link

Conditioning Generative Models for Alignment

JozdienJul 18, 2022, 7:11 AM

60 points

8 comments20 min readLW link

False Positives in Entity-Level Hallucination Detection: A Technical Challenge

MaxKamacheeJan 14, 2025, 7:22 PM

1 point

0 comments2 min readLW link

The Road to Evil Is Paved with Good Objectives: Framework to Classify and Fix Misalignments.

ShivamJan 30, 2025, 2:44 AM

1 point

0 comments11 min readLW link

[Question] How far along Metr’s law can AI start automating or helping with alignment research?

Christopher KingMar 20, 2025, 3:58 PM

20 points

21 comments1 min readLW link

Synthetic Neuroscience

hpcfungMar 25, 2025, 5:45 PM

2 points

3 comments3 min readLW link

What should AI safety be trying to achieve?

EuanMcLeanMay 23, 2024, 11:17 AM

17 points

1 comment13 min readLW link

Retrospective: PIBBSS Fellowship 2024

DusanDNesic, clem_acs and Lucas Teixeira

Dec 20, 2024, 3:55 PM

64 points

1 comment4 min readLW link

Towards empathy in RL agents and beyond: Insights from cognitive science for AI Alignment

Marc CarauleanuApr 3, 2023, 7:59 PM

15 points

6 comments1 min readLW link

(clipchamp.com)

EIS XI: Moving Forward

scasperFeb 22, 2023, 7:05 PM

19 points

2 comments9 min readLW link

For alignment, we should simultaneously use multiple theories of cognition and value

Roman LeventovApr 24, 2023, 10:37 AM

23 points

5 comments5 min readLW link

Unaligned AGI & Brief History of Inequality

ankFeb 22, 2025, 4:26 PM

−20 points

4 comments7 min readLW link

EIS XII: Summary

scasperFeb 23, 2023, 5:45 PM

18 points

0 comments6 min readLW link

How I think about alignment

Linda LinseforsAug 13, 2022, 10:01 AM

31 points

11 comments5 min readLW link

Labor Participation is a High-Priority AI Alignment Risk

alexJun 17, 2024, 6:09 PM

6 points

0 comments17 min readLW link

EIS V: Blind Spots In AI Safety Interpretability Research

scasperFeb 16, 2023, 7:09 PM

57 points

24 comments10 min readLW link

Shard Theory: An Overview

David UdellAug 11, 2022, 5:44 AM

166 points

34 comments10 min readLW link

Eliciting Latent Knowledge (ELK) - Distillation/Summary

Marius HobbhahnJun 8, 2022, 1:18 PM

69 points

2 comments21 min readLW link

[Question] How can we secure more research positions at our universities for x-risk researchers?

Neil CrawfordSep 6, 2022, 5:17 PM

11 points

0 comments1 min readLW link

AI Existential Safety Fellowships

mmfliOct 28, 2023, 6:07 PM

5 points

0 comments1 min readLW link

Trying to understand John Wentworth’s research agenda

johnswentworth, habryka and David Lorell

Oct 20, 2023, 12:05 AM

93 points

13 comments12 min readLW link

Alignment Org Cheat Sheet

Orpheus16 and Thomas Larsen

Sep 20, 2022, 5:36 PM

70 points

8 comments4 min readLW link

AISC project: TinyEvals

Jett JaniakNov 22, 2023, 8:47 PM

22 points

0 comments4 min readLW link

Generative, Episodic Objectives for Safe AI

Michael GlassOct 5, 2022, 11:18 PM

11 points

3 comments8 min readLW link

AISC 2024 - Project Summaries

NickyPNov 27, 2023, 10:32 PM

48 points

3 comments18 min readLW link

Science of Deep Learning—a technical agenda

Marius HobbhahnOct 18, 2022, 2:54 PM

37 points

7 comments4 min readLW link

Reinforcement Learning using Layered Morphology (RLLM)

MiguelDevDec 1, 2023, 5:18 AM

7 points

0 comments29 min readLW link

A call for a quantitative report card for AI bioterrorism threat models

JunoDec 4, 2023, 6:35 AM

12 points

0 comments10 min readLW link

What’s new at FAR AI

AdamGleave and EuanMcLean

Dec 4, 2023, 9:18 PM

41 points

0 comments5 min readLW link

(far.ai)

Interview with Vanessa Kosoy on the Value of Theoretical Research for AI

WillPetilloDec 4, 2023, 10:58 PM

37 points

0 comments35 min readLW link

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

scasperFeb 17, 2023, 8:48 PM

49 points

9 comments12 min readLW link

Introducing Leap Labs, an AI interpretability startup

Jessica RumbelowMar 6, 2023, 4:16 PM

103 points

12 comments1 min readLW link

Rational Effective Utopia & Narrow Way There: Multiversal AI Alignment, Place AI, New Ethicophysics… (Updated)

ankFeb 11, 2025, 3:21 AM

13 points

8 comments35 min readLW link

AI researchers announce NeuroAI agenda

Cameron BergOct 24, 2022, 12:14 AM

37 points

12 comments6 min readLW link

(arxiv.org)

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau, Xander Davies, Buck and Nate Thomas

Oct 27, 2022, 1:32 AM

135 points

14 comments12 min readLW link

My summary of “Pragmatic AI Safety”

Eleni AngelouNov 5, 2022, 12:54 PM

3 points

0 comments5 min readLW link

No comments.

Re­search Agendas

Research Agendas