All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 202320242025

All Jan Feb Mar Apr May Jun Jul Aug Sep Oct NovDec

All1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

Dec 18, 2024, 5:19 PM

483 points

75 comments10 min readLW link

Review: Planecrash

L Rudolf LDec 27, 2024, 2:18 PM

358 points

45 comments21 min readLW link

(nosetgauge.substack.com)

Biological risk from the mirror world

jasoncrawfordDec 12, 2024, 7:07 PM

333 points

38 comments7 min readLW link

(newsletter.rootsofprogress.org)

What Goes Without Saying

sarahconstantinDec 20, 2024, 6:00 PM

331 points

28 comments5 min readLW link

(sarahconstantin.substack.com)

The Field of AI Alignment: A Postmortem, and What To Do About It

johnswentworthDec 26, 2024, 6:48 PM

295 points

160 comments8 min readLW link

By default, capital will matter more than ever after AGI

L Rudolf LDec 28, 2024, 5:52 PM

288 points

100 comments16 min readLW link

(nosetgauge.substack.com)

Orienting to 3 year AGI timelines

Nikola JurkovicDec 22, 2024, 1:15 AM

277 points

51 comments8 min readLW link

A Three-Layer Model of LLM Psychology

Jan_KulveitDec 26, 2024, 4:49 PM

217 points

13 comments8 min readLW link

Understanding Shapley Values with Venn Diagrams

Carson LDec 6, 2024, 9:56 PM

214 points

34 comments LW link

(medium.com)

Communications in Hard Mode (My new job at MIRI)

tanagrabeastDec 13, 2024, 8:13 PM

204 points

25 comments5 min readLW link

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer and Mikita Balesni

Dec 5, 2024, 10:11 PM

203 points

24 comments7 min readLW link

Shallow review of technical AI safety, 2024

technicalities, Stag, Stephen McAleese, jordine and Dr. David Mathers

Dec 29, 2024, 12:01 PM

185 points

34 comments41 min readLW link

When Is Insurance Worth It?

kqrDec 19, 2024, 7:07 PM

173 points

71 comments4 min readLW link

(entropicthoughts.com)

o1: A Technical Primer

Jesse HooglandDec 9, 2024, 7:09 PM

170 points

19 comments9 min readLW link

(www.youtube.com)

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud, Jacob G-W, Evzen, Joseph Miller and TurnTrout

Dec 6, 2024, 10:19 PM

165 points

12 comments11 min readLW link

(arxiv.org)

Subskills of “Listening to Wisdom”

RaemonDec 9, 2024, 3:01 AM

154 points

29 comments42 min readLW link

o3

Zach Stein-PerlmanDec 20, 2024, 6:30 PM

154 points

164 comments1 min readLW link

“Alignment Faking” frame is somewhat fake

Jan_KulveitDec 20, 2024, 9:51 AM

151 points

13 comments6 min readLW link

What o3 Becomes by 2028

Vladimir_NesovDec 22, 2024, 12:37 PM

147 points

15 comments5 min readLW link

The “Think It Faster” Exercise

RaemonDec 11, 2024, 7:14 PM

144 points

35 comments13 min readLW link

Hire (or Become) a Thinking Assistant

RaemonDec 23, 2024, 3:58 AM

137 points

49 comments8 min readLW link

The Dangers of Mirrored Life

Niko_McCarty and fin

Dec 12, 2024, 8:58 PM

119 points

9 comments29 min readLW link

(www.asimov.press)

The Dream Machine

sarahconstantinDec 5, 2024, 12:00 AM

117 points

6 comments12 min readLW link

(sarahconstantin.substack.com)

The o1 System Card Is Not About o1

ZviDec 13, 2024, 8:30 PM

116 points

5 comments16 min readLW link

(thezvi.wordpress.com)

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and rusheb

Dec 17, 2024, 11:58 PM

115 points

1 comment2 min readLW link

AIs Will Increasingly Attempt Shenanigans

ZviDec 16, 2024, 3:20 PM

114 points

2 comments26 min readLW link

(thezvi.wordpress.com)

How to replicate and extend our alignment faking demo

Fabien RogerDec 19, 2024, 9:44 PM

113 points

5 comments2 min readLW link

(alignment.anthropic.com)

Why I’m Moving from Mechanistic to Prosaic Interpretability

Daniel TanDec 30, 2024, 6:35 AM

113 points

34 comments5 min readLW link

Sorry for the downtime, looks like we got DDosd

habrykaDec 2, 2024, 4:14 AM

112 points

13 comments1 min readLW link

Takes on “Alignment Faking in Large Language Models”

Joe CarlsmithDec 18, 2024, 6:22 PM

105 points

7 comments62 min readLW link

A shortcoming of concrete demonstrations as AGI risk advocacy

Steven ByrnesDec 11, 2024, 4:48 PM

105 points

27 comments2 min readLW link

A breakdown of AI capability levels focused on AI R&D labor acceleration

ryan_greenblattDec 22, 2024, 8:56 PM

104 points

5 comments6 min readLW link

[Question] What are the strongest arguments for very short timelines?

Kaj_SotalaDec 23, 2024, 9:38 AM

101 points

79 comments1 min readLW link

2024 Unofficial LessWrong Census/Survey

ScrewtapeDec 2, 2024, 5:30 AM

101 points

49 comments1 min readLW link

The nihilism of NeurIPS

charlieoneillDec 20, 2024, 11:58 PM

100 points

7 comments4 min readLW link

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

Dec 3, 2024, 9:19 PM

100 points

7 comments41 min readLW link

Matryoshka Sparse Autoencoders

Noa NabeshimaDec 14, 2024, 2:52 AM

98 points

15 comments11 min readLW link

MIRI’s 2024 End-of-Year Update

Rob BensingerDec 3, 2024, 4:33 AM

98 points

2 comments4 min readLW link

Should you be worried about H5N1?

gwDec 5, 2024, 9:11 PM

89 points

2 comments5 min readLW link

(www.georgeyw.com)

AIs Will Increasingly Fake Alignment

ZviDec 24, 2024, 1:00 PM

89 points

0 comments52 min readLW link

(thezvi.wordpress.com)

Is “VNM-agent” one of several options, for what minds can grow up into?

AnnaSalamonDec 30, 2024, 6:36 AM

89 points

55 comments2 min readLW link

Parable of the vanilla ice cream curse (and how it would prevent a car from starting!)

Mati_RoyDec 8, 2024, 6:57 AM

89 points

21 comments3 min readLW link

🇫🇷 Announcing CeSIA: The French Center for AI Safety

Charbel-RaphaëlDec 20, 2024, 2:17 PM

88 points

2 comments8 min readLW link

Circling as practice for “just be yourself”

Kaj_SotalaDec 16, 2024, 7:40 AM

86 points

5 comments4 min readLW link

(kajsotala.fi)

Some arguments against a land value tax

Matthew BarnettDec 29, 2024, 3:17 PM

83 points

40 comments15 min readLW link

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

Dec 11, 2024, 6:30 AM

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Effective Evil’s AI Misalignment Plan

lsusrDec 15, 2024, 7:39 AM

82 points

9 comments3 min readLW link

Testing which LLM architectures can do hidden serial reasoning

Filip SondejDec 16, 2024, 1:48 PM

81 points

9 comments4 min readLW link

Remap your caps lock key

bilalchughtaiDec 15, 2024, 2:03 PM

80 points

18 comments1 min readLW link

Best-of-N Jailbreaking

John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, Fazl, Henry Sleight, Ethan Perez and mrinank_sharma

Dec 14, 2024, 4:58 AM

78 points

5 comments2 min readLW link

(arxiv.org)