Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
483 points
75 comments10 min readLW link

Re­view: Planecrash

L Rudolf LDec 27, 2024, 2:18 PM
358 points
45 comments21 min readLW link
(nosetgauge.substack.com)

Biolog­i­cal risk from the mir­ror world

jasoncrawfordDec 12, 2024, 7:07 PM
333 points
38 comments7 min readLW link
(newsletter.rootsofprogress.org)

What Goes Without Saying

sarahconstantinDec 20, 2024, 6:00 PM
331 points
28 comments5 min readLW link
(sarahconstantin.substack.com)

The Field of AI Align­ment: A Post­mortem, and What To Do About It

johnswentworthDec 26, 2024, 6:48 PM
295 points
160 comments8 min readLW link

By de­fault, cap­i­tal will mat­ter more than ever af­ter AGI

L Rudolf LDec 28, 2024, 5:52 PM
288 points
100 comments16 min readLW link
(nosetgauge.substack.com)

Ori­ent­ing to 3 year AGI timelines

Nikola JurkovicDec 22, 2024, 1:15 AM
277 points
51 comments8 min readLW link

A Three-Layer Model of LLM Psychology

Jan_KulveitDec 26, 2024, 4:49 PM
217 points
13 comments8 min readLW link

Un­der­stand­ing Shap­ley Values with Venn Diagrams

Carson LDec 6, 2024, 9:56 PM
214 points
34 commentsLW link
(medium.com)

Com­mu­ni­ca­tions in Hard Mode (My new job at MIRI)

tanagrabeastDec 13, 2024, 8:13 PM
204 points
25 comments5 min readLW link

Fron­tier Models are Ca­pable of In-con­text Scheming

Dec 5, 2024, 10:11 PM
203 points
24 comments7 min readLW link

Shal­low re­view of tech­ni­cal AI safety, 2024

Dec 29, 2024, 12:01 PM
185 points
34 comments41 min readLW link

When Is In­surance Worth It?

kqrDec 19, 2024, 7:07 PM
173 points
71 comments4 min readLW link
(entropicthoughts.com)

o1: A Tech­ni­cal Primer

Jesse HooglandDec 9, 2024, 7:09 PM
170 points
19 comments9 min readLW link
(www.youtube.com)

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

Dec 6, 2024, 10:19 PM
165 points
12 comments11 min readLW link
(arxiv.org)

Sub­skills of “Listen­ing to Wis­dom”

RaemonDec 9, 2024, 3:01 AM
154 points
29 comments42 min readLW link

o3

Zach Stein-PerlmanDec 20, 2024, 6:30 PM
154 points
164 comments1 min readLW link

“Align­ment Fak­ing” frame is some­what fake

Jan_KulveitDec 20, 2024, 9:51 AM
151 points
13 comments6 min readLW link

What o3 Be­comes by 2028

Vladimir_NesovDec 22, 2024, 12:37 PM
147 points
15 comments5 min readLW link

The “Think It Faster” Exercise

RaemonDec 11, 2024, 7:14 PM
144 points
35 comments13 min readLW link

Hire (or Be­come) a Think­ing Assistant

RaemonDec 23, 2024, 3:58 AM
137 points
49 comments8 min readLW link

The Dangers of Mir­rored Life

Dec 12, 2024, 8:58 PM
119 points
9 comments29 min readLW link
(www.asimov.press)

The Dream Machine

sarahconstantinDec 5, 2024, 12:00 AM
117 points
6 comments12 min readLW link
(sarahconstantin.substack.com)

The o1 Sys­tem Card Is Not About o1

ZviDec 13, 2024, 8:30 PM
116 points
5 comments16 min readLW link
(thezvi.wordpress.com)

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

Dec 17, 2024, 11:58 PM
115 points
1 comment2 min readLW link

AIs Will In­creas­ingly At­tempt Shenanigans

ZviDec 16, 2024, 3:20 PM
114 points
2 comments26 min readLW link
(thezvi.wordpress.com)

How to repli­cate and ex­tend our al­ign­ment fak­ing demo

Fabien RogerDec 19, 2024, 9:44 PM
113 points
5 comments2 min readLW link
(alignment.anthropic.com)

Why I’m Mov­ing from Mechanis­tic to Pro­saic Interpretability

Daniel TanDec 30, 2024, 6:35 AM
113 points
34 comments5 min readLW link

Sorry for the down­time, looks like we got DDosd

habrykaDec 2, 2024, 4:14 AM
112 points
13 comments1 min readLW link

Takes on “Align­ment Fak­ing in Large Lan­guage Models”

Joe CarlsmithDec 18, 2024, 6:22 PM
105 points
7 comments62 min readLW link

A short­com­ing of con­crete demon­stra­tions as AGI risk advocacy

Steven ByrnesDec 11, 2024, 4:48 PM
105 points
27 comments2 min readLW link

A break­down of AI ca­pa­bil­ity lev­els fo­cused on AI R&D la­bor acceleration

ryan_greenblattDec 22, 2024, 8:56 PM
104 points
5 comments6 min readLW link

[Question] What are the strongest ar­gu­ments for very short timelines?

Kaj_SotalaDec 23, 2024, 9:38 AM
101 points
79 comments1 min readLW link

2024 Unoffi­cial LessWrong Cen­sus/​Survey

ScrewtapeDec 2, 2024, 5:30 AM
101 points
49 comments1 min readLW link

The nihilism of NeurIPS

charlieoneillDec 20, 2024, 11:58 PM
100 points
7 comments4 min readLW link

Deep Causal Transcod­ing: A Frame­work for Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Dec 3, 2024, 9:19 PM
100 points
7 comments41 min readLW link

Ma­tryoshka Sparse Autoencoders

Noa NabeshimaDec 14, 2024, 2:52 AM
98 points
15 comments11 min readLW link

MIRI’s 2024 End-of-Year Update

Rob BensingerDec 3, 2024, 4:33 AM
98 points
2 comments4 min readLW link

Should you be wor­ried about H5N1?

gwDec 5, 2024, 9:11 PM
89 points
2 comments5 min readLW link
(www.georgeyw.com)

AIs Will In­creas­ingly Fake Alignment

ZviDec 24, 2024, 1:00 PM
89 points
0 comments52 min readLW link
(thezvi.wordpress.com)

Is “VNM-agent” one of sev­eral op­tions, for what minds can grow up into?

AnnaSalamonDec 30, 2024, 6:36 AM
89 points
55 comments2 min readLW link

Parable of the vanilla ice cream curse (and how it would pre­vent a car from start­ing!)

Mati_RoyDec 8, 2024, 6:57 AM
89 points
21 comments3 min readLW link

🇫🇷 An­nounc­ing CeSIA: The French Cen­ter for AI Safety

Charbel-RaphaëlDec 20, 2024, 2:17 PM
88 points
2 comments8 min readLW link

Cir­cling as prac­tice for “just be your­self”

Kaj_SotalaDec 16, 2024, 7:40 AM
86 points
5 comments4 min readLW link
(kajsotala.fi)

Some ar­gu­ments against a land value tax

Matthew BarnettDec 29, 2024, 3:17 PM
83 points
40 comments15 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

Dec 11, 2024, 6:30 AM
82 points
6 comments2 min readLW link
(www.neuronpedia.org)

Effec­tive Evil’s AI Misal­ign­ment Plan

lsusrDec 15, 2024, 7:39 AM
82 points
9 comments3 min readLW link

Test­ing which LLM ar­chi­tec­tures can do hid­den se­rial reasoning

Filip SondejDec 16, 2024, 1:48 PM
81 points
9 comments4 min readLW link

Remap your caps lock key

bilalchughtaiDec 15, 2024, 2:03 PM
80 points
18 comments1 min readLW link

Best-of-N Jailbreaking

Dec 14, 2024, 4:58 AM
78 points
5 comments2 min readLW link
(arxiv.org)