LessWrong’s (first) album: I Have Been A Good Bing

1 Apr 2024 7:33 UTC
564 points
174 comments11 min readLW link

Trans­form­ers Rep­re­sent Belief State Geom­e­try in their Resi­d­ual Stream

Adam Shai16 Apr 2024 21:16 UTC
411 points
100 comments12 min readLW link

Thoughts on seed oil

dynomight20 Apr 2024 12:29 UTC
347 points
128 comments17 min readLW link
(dynomight.net)

[April Fools’ Day] In­tro­duc­ing Open As­teroid Impact

Linch1 Apr 2024 8:14 UTC
334 points
29 comments1 min readLW link
(openasteroidimpact.org)

Ex­press in­ter­est in an “FHI of the West”

habryka18 Apr 2024 3:32 UTC
268 points
41 comments3 min readLW link

Paul Chris­ti­ano named as US AI Safety In­sti­tute Head of AI Safety

Joel Burget16 Apr 2024 16:22 UTC
256 points
58 comments1 min readLW link
(www.commerce.gov)

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
228 points
93 comments10 min readLW link

Funny Anec­dote of Eliezer From His Sister

Noah Birnbaum22 Apr 2024 22:05 UTC
198 points
6 comments2 min readLW link

[Question] Ex­am­ples of Highly Coun­ter­fac­tual Dis­cov­er­ies?

johnswentworth23 Apr 2024 22:19 UTC
194 points
100 comments1 min readLW link

On Not Pul­ling The Lad­der Up Be­hind You

Screwtape26 Apr 2024 21:58 UTC
188 points
21 comments9 min readLW link

OMMC An­nounces RIP

1 Apr 2024 23:20 UTC
188 points
5 comments2 min readLW link

Why Would Belief-States Have A Frac­tal Struc­ture, And Why Would That Mat­ter For In­ter­pretabil­ity? An Explainer

18 Apr 2024 0:27 UTC
184 points
21 comments7 min readLW link

FHI (Fu­ture of Hu­man­ity In­sti­tute) has shut down (2005–2024)

gwern17 Apr 2024 13:54 UTC
176 points
22 comments1 min readLW link
(www.futureofhumanityinstitute.org)

Re­con­sider the anti-cav­ity bac­te­ria if you are Asian

Lao Mein15 Apr 2024 7:02 UTC
168 points
43 comments4 min readLW link

Iron­ing Out the Squiggles

Zack_M_Davis29 Apr 2024 16:13 UTC
153 points
36 comments11 min readLW link

Daniel Den­nett has died (1942-2024)

kave19 Apr 2024 16:17 UTC
150 points
5 comments1 min readLW link
(dailynous.com)

Pri­ors and Prejudice

MathiasKB22 Apr 2024 15:00 UTC
149 points
31 comments7 min readLW link

LLMs for Align­ment Re­search: a safety pri­or­ity?

abramdemski4 Apr 2024 20:03 UTC
145 points
24 comments11 min readLW link

My ex­pe­rience us­ing fi­nan­cial com­mit­ments to over­come akrasia

William Howard15 Apr 2024 22:57 UTC
137 points
31 comments18 min readLW link

When is a mind me?

Rob Bensinger17 Apr 2024 5:56 UTC
135 points
125 comments15 min readLW link

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
133 points
21 comments1 min readLW link
(www.anthropic.com)

A Dozen Ways to Get More Dakka

Davidmanheim8 Apr 2024 4:45 UTC
130 points
11 comments3 min readLW link

RTFB: On the New Pro­posed CAIP AI Bill

Zvi10 Apr 2024 18:30 UTC
119 points
14 comments34 min readLW link
(thezvi.wordpress.com)

A Selec­tion of Ran­domly Selected SAE Features

1 Apr 2024 9:09 UTC
109 points
2 comments4 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam Marks18 Apr 2024 16:17 UTC
107 points
10 comments12 min readLW link

The first fu­ture and the best future

KatjaGrace25 Apr 2024 6:40 UTC
106 points
12 comments1 min readLW link
(worldspiritsockpuppet.com)

[Question] What con­vinc­ing warn­ing shot could help pre­vent ex­tinc­tion from AI?

13 Apr 2024 18:09 UTC
105 points
18 comments2 min readLW link

Carl Sa­gan, nuk­ing the moon, and not nuk­ing the moon

eukaryote13 Apr 2024 4:08 UTC
103 points
8 comments6 min readLW link
(eukaryotewritesblog.com)

MIRI’s April 2024 Newsletter

Harlan12 Apr 2024 23:38 UTC
95 points
0 comments3 min readLW link
(intelligence.org)

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee Sharkey3 Apr 2024 12:34 UTC
94 points
22 comments22 min readLW link

Par­tial value takeover with­out world takeover

KatjaGrace5 Apr 2024 6:20 UTC
89 points
23 comments3 min readLW link
(worldspiritsockpuppet.com)

Towards Mul­ti­modal In­ter­pretabil­ity: Learn­ing Sparse In­ter­pretable Fea­tures in Vi­sion Transformers

hugofry29 Apr 2024 20:57 UTC
89 points
8 comments11 min readLW link

Re­ject­ing Television

Declan Molony23 Apr 2024 4:59 UTC
85 points
10 comments6 min readLW link

Con­structabil­ity: Plainly-coded AGIs may be fea­si­ble in the near future

27 Apr 2024 16:04 UTC
82 points
13 comments13 min readLW link

Es­say com­pe­ti­tion on the Au­toma­tion of Wis­dom and Philos­o­phy — $25k in prizes

16 Apr 2024 10:10 UTC
82 points
12 comments8 min readLW link
(blog.aiimpacts.org)

A cou­ple pro­duc­tivity tips for overthinkers

Steven Byrnes20 Apr 2024 16:05 UTC
78 points
13 comments4 min readLW link

Creat­ing un­re­stricted AI Agents with Com­mand R+

Simon Lermen16 Apr 2024 14:52 UTC
77 points
13 comments5 min readLW link

Mid-con­di­tional love

KatjaGrace17 Apr 2024 4:00 UTC
76 points
21 comments2 min readLW link
(worldspiritsockpuppet.com)

Co­her­ence of Caches and Agents

johnswentworth1 Apr 2024 23:04 UTC
76 points
9 comments11 min readLW link

AISC9 has ended and there will be an AISC10

Linda Linsefors29 Apr 2024 10:53 UTC
75 points
4 comments2 min readLW link

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
73 points
10 comments8 min readLW link

A Gen­tle In­tro­duc­tion to Risk Frame­works Beyond Forecasting

pendingsurvival11 Apr 2024 18:03 UTC
73 points
10 comments27 min readLW link

An­nounc­ing Suffer­ing For Good

Garrett Baker1 Apr 2024 17:08 UTC
72 points
5 comments1 min readLW link

Prompts for Big-Pic­ture Planning

Raemon13 Apr 2024 3:04 UTC
72 points
1 comment3 min readLW link

LW Front­page Ex­per­i­ments! (aka “Take the wheel, Shog­goth!”)

23 Apr 2024 3:58 UTC
71 points
27 comments5 min readLW link

Text Posts from the Kids Group: 2020

jefftk13 Apr 2024 22:30 UTC
69 points
3 comments19 min readLW link
(www.jefftk.com)

Mo­ti­va­tion gaps: Why so much EA crit­i­cism is hos­tile and lazy

titotal22 Apr 2024 11:49 UTC
69 points
5 comments1 min readLW link
(titotal.substack.com)

AXRP Epi­sode 27 - AI Con­trol with Buck Sh­legeris and Ryan Greenblatt

DanielFilan11 Apr 2024 21:30 UTC
69 points
10 comments107 min readLW link

The In­ner Ring by C. S. Lewis

Saul Munn24 Apr 2024 22:48 UTC
69 points
6 comments13 min readLW link
(www.lewissociety.org)

How We Pic­ture Bayesian Agents

8 Apr 2024 18:12 UTC
69 points
14 comments7 min readLW link