All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 202220232024 2025

All Jan Feb Mar Apr May Jun Jul Aug SepOctNov Dec

All1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-DoddsOct 5, 2023, 9:01 PM

288 points

22 comments2 min readLW link 1 review

(transformer-circuits.pub)

Alignment Implications of LLM Successes: a Debate in One Act

Zack_M_DavisOct 21, 2023, 3:22 PM

265 points

56 comments13 min readLW link 2 reviews

Book Review: Going Infinite

ZviOct 24, 2023, 3:00 PM

246 points

113 comments97 min readLW link 1 review

(thezvi.wordpress.com)

Announcing MIRI’s new CEO and leadership team

Gretta DulebaOct 10, 2023, 7:22 PM

222 points

52 comments3 min readLW link

Thoughts on responsible scaling policies and regulation

paulfchristianoOct 24, 2023, 10:21 PM

221 points

33 comments6 min readLW link

Labs should be explicit about why they are building AGI

peterbarnettOct 17, 2023, 9:09 PM

214 points

18 comments1 min readLW link 1 review

Comp Sci in 2027 (Short story by Eliezer Yudkowsky)

sudoOct 29, 2023, 11:09 PM

203 points

24 comments10 min readLW link 1 review

(nitter.net)

We’re Not Ready: thoughts on “pausing” and responsible scaling policies

HoldenKarnofskyOct 27, 2023, 3:19 PM

200 points

33 comments8 min readLW link

AI as a science, and three obstacles to alignment strategies

So8resOct 25, 2023, 9:00 PM

193 points

80 comments11 min readLW link

Announcing Timaeus

Jesse Hoogland, Daniel Murfet, Alexander Gietelink Oldenziel and Stan van Wingerden

Oct 22, 2023, 11:59 AM

188 points

15 comments4 min readLW link

Evaluating the historical value misspecification argument

Matthew BarnettOct 5, 2023, 6:34 PM

188 points

162 comments7 min readLW link 3 reviews

Thomas Kwa’s MIRI research experience

Thomas Kwa, peterbarnett, Vivek Hebbar, Jeremy Gillen, Bird Concept and Raemon

Oct 2, 2023, 4:42 PM

174 points

53 comments1 min readLW link

President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence

Tristan WilliamsOct 30, 2023, 11:15 AM

171 points

39 comments LW link

(www.whitehouse.gov)

Architects of Our Own Demise: We Should Stop Developing AI Carelessly

RokoOct 26, 2023, 12:36 AM

170 points

75 comments3 min readLW link

RSPs are pauses done right

evhubOct 14, 2023, 4:06 AM

164 points

73 comments7 min readLW link 1 review

Holly Elmore and Rob Miles dialogue on AI Safety Advocacy

Bird Concept, Robert Miles and Holly_Elmore

Oct 20, 2023, 9:04 PM

162 points

30 comments27 min readLW link

Announcing Dialogues

Ben PaceOct 7, 2023, 2:57 AM

155 points

59 comments4 min readLW link

Will no one rid me of this turbulent pest?

MetacelsusOct 14, 2023, 3:27 PM

154 points

23 comments10 min readLW link

(denovo.substack.com)

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Simon Lermen and Jeffrey Ladish

Oct 12, 2023, 7:58 PM

151 points

29 comments14 min readLW link

At 87, Pearl is still able to change his mind

rotatingpaguroOct 18, 2023, 4:46 AM

148 points

15 comments5 min readLW link

The 99% principle for personal problems

Kaj_SotalaOct 2, 2023, 8:20 AM

144 points

20 comments2 min readLW link

(kajsotala.fi)

Graphical tensor notation for interpretability

Jordan TaylorOct 4, 2023, 8:04 AM

141 points

11 comments19 min readLW link

Don’t Dismiss Simple Alignment Approaches

Chris_LeongOct 7, 2023, 12:35 AM

137 points

9 comments4 min readLW link

Comparing Anthropic’s Dictionary Learning to Ours

Robert_AIZIOct 7, 2023, 11:30 PM

137 points

8 comments4 min readLW link

Response to Quintin Pope’s Evolution Provides No Evidence For the Sharp Left Turn

ZviOct 5, 2023, 11:39 AM

129 points

29 comments9 min readLW link

Goodhart’s Law in Reinforcement Learning

jacek, Joar Skalse, OliverHayman, charlie_griffin and Xingjian Bai

Oct 16, 2023, 12:54 AM

126 points

22 comments7 min readLW link

I Would Have Solved Alignment, But I Was Worried That Would Advance Timelines

307thOct 20, 2023, 4:37 PM

124 points

33 comments9 min readLW link

Responsible Scaling Policies Are Risk Management Done Wrong

simeon_cOct 25, 2023, 11:46 PM

123 points

35 comments22 min readLW link 1 review

(www.navigatingrisks.ai)

Stampy’s AI Safety Info soft launch

steven0461 and Robert Miles

Oct 5, 2023, 10:13 PM

120 points

9 comments2 min readLW link

Revealing Intentionality In Language Models Through AdaVAE Guided Sampling

jdpOct 20, 2023, 7:32 AM

119 points

15 comments22 min readLW link

unRLHF—Efficiently undoing LLM safeguards

Pranav Gade, Jeffrey Ladish and Simon Lermen

Oct 12, 2023, 7:58 PM

117 points

15 comments20 min readLW link

Symbol/Referent Confusions in Language Model Alignment Experiments

johnswentworthOct 26, 2023, 7:49 PM

116 points

50 comments6 min readLW link 1 review

Improving the Welfare of AIs: A Nearcasted Proposal

ryan_greenblattOct 30, 2023, 2:51 PM

114 points

9 comments20 min readLW link 1 review

A new intro to Quantum Physics, with the math fixed

titotalOct 29, 2023, 3:11 PM

113 points

24 comments17 min readLW link

(titotal.substack.com)

The Witching Hour

Richard_NgoOct 10, 2023, 12:19 AM

113 points

1 comment9 min readLW link

(www.narrativeark.xyz)

Charbel-Raphaël and Lucius discuss interpretability

Mateusz Bagiński, Charbel-Raphaël and Lucius Bushnaq

Oct 30, 2023, 5:50 AM

111 points

7 comments21 min readLW link

Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation

Fabien Roger and Buck

Oct 23, 2023, 4:37 PM

107 points

3 comments8 min readLW link

TOMORROW: the largest AI Safety protest ever!

Holly_ElmoreOct 20, 2023, 6:15 PM

105 points

26 comments2 min readLW link

Apply for MATS Winter 2023-24!

utilistrutil, Ryan Kidd and LauraVaughan

Oct 21, 2023, 2:27 AM

104 points

6 comments5 min readLW link

Value systematization: how values become coherent (and misaligned)

Richard_NgoOct 27, 2023, 7:06 PM

103 points

49 comments13 min readLW link

Truthseeking when your disagreements lie in moral philosophy

Elizabeth and Tristan Williams

Oct 10, 2023, 12:00 AM

99 points

4 comments4 min readLW link

(acesounderglass.com)

What’s up with “Responsible Scaling Policies”?

habryka and ryan_greenblatt

Oct 29, 2023, 4:17 AM

99 points

9 comments20 min readLW link 1 review

What’s Hard About The Shutdown Problem

johnswentworthOct 20, 2023, 9:13 PM

98 points

33 comments4 min readLW link

I don’t find the lie detection results that surprising (by an author of the paper)

JanBOct 4, 2023, 5:10 PM

97 points

8 comments3 min readLW link

[Question] Lying to chess players for alignment

ZaneOct 25, 2023, 5:47 PM

97 points

54 comments1 min readLW link

Sam Altman’s sister claims Sam sexually abused her—Part 1: Introduction, outline, author’s notes

pythagoras5015Oct 7, 2023, 9:06 PM

95 points

108 comments8 min readLW link

Investigating the learning coefficient of modular addition: hackathon project

Nina Panickssery and Dmitry Vaintrob

Oct 17, 2023, 7:51 PM

94 points

5 comments12 min readLW link

Open Source Replication & Commentary on Anthropic’s Dictionary Learning Paper

Neel NandaOct 23, 2023, 10:38 PM

93 points

12 comments9 min readLW link

Trying to understand John Wentworth’s research agenda

johnswentworth, habryka and David Lorell

Oct 20, 2023, 12:05 AM

93 points

13 comments12 min readLW link

You’re Measuring Model Complexity Wrong

Jesse Hoogland and Stan van Wingerden

Oct 11, 2023, 11:46 AM

93 points

17 comments13 min readLW link