Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
412 points
51 comments10 min readLW link

Biolog­i­cal risk from the mir­ror world

jasoncrawford12 Dec 2024 19:07 UTC
283 points
24 comments7 min readLW link
(newsletter.rootsofprogress.org)

Fron­tier Models are Ca­pable of In-con­text Scheming

5 Dec 2024 22:11 UTC
201 points
24 comments7 min readLW link

Un­der­stand­ing Shap­ley Values with Venn Diagrams

Carson L6 Dec 2024 21:56 UTC
190 points
29 comments1 min readLW link
(medium.com)

Com­mu­ni­ca­tions in Hard Mode (My new job at MIRI)

tanagrabeast13 Dec 2024 20:13 UTC
185 points
23 comments5 min readLW link

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

6 Dec 2024 22:19 UTC
151 points
11 comments11 min readLW link
(arxiv.org)

o1: A Tech­ni­cal Primer

Jesse Hoogland9 Dec 2024 19:09 UTC
140 points
17 comments9 min readLW link
(www.youtube.com)

Sub­skills of “Listen­ing to Wis­dom”

Raemon9 Dec 2024 3:01 UTC
133 points
16 comments42 min readLW link

“Align­ment Fak­ing” frame is some­what fake

Jan_Kulveit20 Dec 2024 9:51 UTC
122 points
4 comments6 min readLW link

The Dangers of Mir­rored Life

12 Dec 2024 20:58 UTC
118 points
7 comments29 min readLW link
(www.asimov.press)

The Dream Machine

sarahconstantin5 Dec 2024 0:00 UTC
116 points
6 comments12 min readLW link
(sarahconstantin.substack.com)

The o1 Sys­tem Card Is Not About o1

Zvi13 Dec 2024 20:30 UTC
116 points
5 comments16 min readLW link
(thezvi.wordpress.com)

o3

Zach Stein-Perlman20 Dec 2024 18:30 UTC
115 points
57 comments1 min readLW link

Sorry for the down­time, looks like we got DDosd

habryka2 Dec 2024 4:14 UTC
109 points
13 comments1 min readLW link

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

17 Dec 2024 23:58 UTC
107 points
1 comment2 min readLW link

A short­com­ing of con­crete demon­stra­tions as AGI risk advocacy

Steven Byrnes11 Dec 2024 16:48 UTC
103 points
27 comments2 min readLW link

What Goes Without Saying

sarahconstantin20 Dec 2024 18:00 UTC
102 points
2 comments5 min readLW link
(sarahconstantin.substack.com)

When Is In­surance Worth It?

kqr19 Dec 2024 19:07 UTC
99 points
12 comments4 min readLW link
(entropicthoughts.com)

MIRI’s 2024 End-of-Year Update

Rob Bensinger3 Dec 2024 4:33 UTC
98 points
2 comments4 min readLW link

How to repli­cate and ex­tend our al­ign­ment fak­ing demo

Fabien Roger19 Dec 2024 21:44 UTC
97 points
0 comments2 min readLW link
(alignment.anthropic.com)

The “Think It Faster” Exercise

Raemon11 Dec 2024 19:14 UTC
95 points
13 comments14 min readLW link

AIs Will In­creas­ingly At­tempt Shenanigans

Zvi16 Dec 2024 15:20 UTC
94 points
4 comments26 min readLW link
(thezvi.wordpress.com)

Takes on “Align­ment Fak­ing in Large Lan­guage Models”

Joe Carlsmith18 Dec 2024 18:22 UTC
92 points
8 comments62 min readLW link

2024 Unoffi­cial LessWrong Cen­sus/​Survey

Screwtape2 Dec 2024 5:30 UTC
91 points
42 comments1 min readLW link

Parable of the vanilla ice cream curse (and how it would pre­vent a car from start­ing!)

Mati_Roy8 Dec 2024 6:57 UTC
84 points
20 comments3 min readLW link

Cir­cling as prac­tice for “just be your­self”

Kaj_Sotala16 Dec 2024 7:40 UTC
83 points
4 comments4 min readLW link
(kajsotala.fi)

Deep Causal Transcod­ing: A Frame­work for Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

3 Dec 2024 21:19 UTC
83 points
7 comments41 min readLW link

Remap your caps lock key

bilalchughtai15 Dec 2024 14:03 UTC
80 points
15 comments1 min readLW link

Test­ing which LLM ar­chi­tec­tures can do hid­den se­rial reasoning

Filip Sondej16 Dec 2024 13:48 UTC
79 points
9 comments4 min readLW link

Should you be wor­ried about H5N1?

gw5 Dec 2024 21:11 UTC
79 points
2 comments5 min readLW link
(www.georgeyw.com)

Should there be just one west­ern AGI pro­ject?

3 Dec 2024 10:11 UTC
78 points
72 comments15 min readLW link

Best-of-N Jailbreaking

14 Dec 2024 4:58 UTC
78 points
6 comments2 min readLW link
(arxiv.org)

Effec­tive Evil’s AI Misal­ign­ment Plan

lsusr15 Dec 2024 7:39 UTC
77 points
9 comments3 min readLW link

Ma­tryoshka Sparse Autoencoders

Noa Nabeshima14 Dec 2024 2:52 UTC
75 points
10 comments11 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

11 Dec 2024 6:30 UTC
71 points
1 comment2 min readLW link
(www.neuronpedia.org)

The 2023 LessWrong Re­view: The Ba­sic Ask

Raemon4 Dec 2024 19:52 UTC
71 points
25 comments9 min readLW link

Drexler’s Nan­otech Software

PeterMcCluskey2 Dec 2024 4:55 UTC
65 points
9 comments4 min readLW link
(bayesianinvestor.com)

A Qual­i­ta­tive Case for LTFF: Filling Crit­i­cal Ecosys­tem Gaps

Linch3 Dec 2024 21:57 UTC
64 points
2 comments1 min readLW link

RL, but don’t do any­thing I wouldn’t do

Gunnar_Zarncke7 Dec 2024 22:54 UTC
63 points
5 comments1 min readLW link
(arxiv.org)

A case for donat­ing to AI risk re­duc­tion (in­clud­ing if you work in AI)

tlevin2 Dec 2024 19:05 UTC
61 points
2 comments1 min readLW link

Zen and The Art of Semi­con­duc­tor Man­u­fac­tur­ing

Recurrented9 Dec 2024 17:19 UTC
61 points
2 comments9 min readLW link
(futuring.substack.com)

Cog­ni­tive Work and AI Safety: A Ther­mo­dy­namic Perspective

Daniel Murfet8 Dec 2024 21:42 UTC
61 points
7 comments4 min readLW link

In­tri­ca­cies of Fea­ture Geom­e­try in Large Lan­guage Models

7 Dec 2024 18:10 UTC
59 points
0 comments12 min readLW link

An Illus­trated Sum­mary of “Ro­bust Agents Learn Causal World Model”

Dalcy14 Dec 2024 15:02 UTC
57 points
2 comments10 min readLW link

Ret­ro­spec­tive: PIBBSS Fel­low­ship 2024

20 Dec 2024 15:55 UTC
54 points
1 comment4 min readLW link

o1 Turns Pro

Zvi10 Dec 2024 17:00 UTC
53 points
3 comments14 min readLW link
(thezvi.wordpress.com)

Luck Based Medicine: No Good Very Bad Win­ter Cured My Hypothyroidism

Elizabeth8 Dec 2024 20:10 UTC
53 points
3 comments2 min readLW link
(acesounderglass.com)

An­thropic lead­er­ship conversation

Zach Stein-Perlman20 Dec 2024 22:00 UTC
52 points
10 comments6 min readLW link
(www.youtube.com)

I Fi­nally Worked Through Bayes’ The­o­rem (Per­sonal Achieve­ment)

keltan5 Dec 2024 2:04 UTC
51 points
6 comments9 min readLW link

A toy eval­u­a­tion of in­fer­ence code tampering

Fabien Roger9 Dec 2024 17:43 UTC
50 points
0 comments9 min readLW link
(alignment.anthropic.com)