There is way too much serendipity

Malmesbury19 Jan 2024 19:37 UTC
365 points
56 comments7 min readLW link

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
305 points
95 comments3 min readLW link
(arxiv.org)

Gentle­ness and the ar­tifi­cial Other

Joe Carlsmith2 Jan 2024 18:21 UTC
291 points
33 comments11 min readLW link

The case for en­sur­ing that pow­er­ful AIs are controlled

24 Jan 2024 16:11 UTC
258 points
66 comments28 min readLW link

MIRI 2024 Mis­sion and Strat­egy Update

Malo5 Jan 2024 0:20 UTC
219 points
44 comments8 min readLW link

Toward A Math­e­mat­i­cal Frame­work for Com­pu­ta­tion in Superposition

18 Jan 2024 21:06 UTC
203 points
18 comments63 min readLW link

The im­pos­si­ble prob­lem of due process

mingyuan16 Jan 2024 5:18 UTC
195 points
64 comments14 min readLW link

This might be the last AI Safety Camp

24 Jan 2024 9:33 UTC
195 points
34 comments1 min readLW link

In­tro­duc­ing Align­ment Stress-Test­ing at Anthropic

evhub12 Jan 2024 23:51 UTC
182 points
23 comments2 min readLW link

Without fun­da­men­tal ad­vances, mis­al­ign­ment and catas­tro­phe are the de­fault out­comes of train­ing pow­er­ful AI

26 Jan 2024 7:22 UTC
160 points
60 comments57 min readLW link

Mak­ing ev­ery re­searcher seek grants is a bro­ken model

jasoncrawford26 Jan 2024 16:06 UTC
158 points
41 comments4 min readLW link
(rootsofprogress.org)

What’s up with LLMs rep­re­sent­ing XORs of ar­bi­trary fea­tures?

Sam Marks3 Jan 2024 19:44 UTC
157 points
61 comments16 min readLW link

Apol­o­giz­ing is a Core Ra­tion­al­ist Skill

johnswentworth2 Jan 2024 17:47 UTC
152 points
42 comments5 min readLW link

Deep athe­ism and AI risk

Joe Carlsmith4 Jan 2024 18:58 UTC
146 points
22 comments27 min readLW link

What good is G-fac­tor if you’re dumped in the woods? A field re­port from a camp coun­selor.

Hastings12 Jan 2024 13:17 UTC
137 points
22 comments1 min readLW link

Pro­ces­sor clock speeds are not how fast AIs think

Ege Erdil29 Jan 2024 14:39 UTC
132 points
55 comments2 min readLW link

The case for train­ing fron­tier AIs on Sume­rian-only corpus

15 Jan 2024 16:40 UTC
130 points
15 comments3 min readLW link

No­tice When Peo­ple Are Direc­tion­ally Correct

Chris_Leong14 Jan 2024 14:12 UTC
129 points
8 comments2 min readLW link

An even deeper atheism

Joe Carlsmith11 Jan 2024 17:28 UTC
125 points
47 comments15 min readLW link

A Shut­down Prob­lem Proposal

21 Jan 2024 18:12 UTC
125 points
61 comments6 min readLW link

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
123 points
29 comments8 min readLW link
(arxiv.org)

Why I take short timelines seriously

NicholasKees28 Jan 2024 22:27 UTC
121 points
29 comments4 min readLW link

Gen­der Exploration

sapphire14 Jan 2024 18:57 UTC
113 points
25 comments5 min readLW link
(open.substack.com)

Four vi­sions of Trans­for­ma­tive AI success

Steven Byrnes17 Jan 2024 20:45 UTC
112 points
22 comments15 min readLW link

Prac­ti­cally A Book Re­view: Ap­pendix to “Non­lin­ear’s Ev­i­dence: De­bunk­ing False and Mislead­ing Claims” (ThingOfThings)

tailcalled3 Jan 2024 17:07 UTC
111 points
25 comments2 min readLW link
(thingofthings.substack.com)

The case for more am­bi­tious lan­guage model evals

Jozdien30 Jan 2024 0:01 UTC
110 points
30 comments5 min readLW link

′ pe­ter­todd’’s last stand: The fi­nal days of open GPT-3 research

mwatkins22 Jan 2024 18:47 UTC
109 points
16 comments45 min readLW link

Be­ing nicer than Clippy

Joe Carlsmith16 Jan 2024 19:44 UTC
109 points
32 comments27 min readLW link

2023 in AI predictions

jessicata1 Jan 2024 5:23 UTC
107 points
35 comments5 min readLW link

Catch­ing AIs red-handed

5 Jan 2024 17:43 UTC
104 points
22 comments17 min readLW link

De­cep­tive AI ≠ De­cep­tively-al­igned AI

Steven Byrnes7 Jan 2024 16:55 UTC
96 points
19 comments6 min readLW link

Al­most ev­ery­one I’ve met would be well-served think­ing more about what to fo­cus on

Henrik Karlsson5 Jan 2024 21:01 UTC
95 points
8 comments11 min readLW link
(www.henrikkarlsson.xyz)

RAND re­port finds no effect of cur­rent LLMs on vi­a­bil­ity of bioter­ror­ism attacks

StellaAthena25 Jan 2024 19:17 UTC
94 points
14 comments1 min readLW link
(www.rand.org)

On the abo­li­tion of man

Joe Carlsmith18 Jan 2024 18:17 UTC
88 points
18 comments41 min readLW link

The Aspiring Ra­tion­al­ist Congregation

maia10 Jan 2024 22:52 UTC
86 points
23 comments10 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

16 Jan 2024 0:26 UTC
83 points
9 comments18 min readLW link

Some Va­ca­tion Photos

johnswentworth4 Jan 2024 17:15 UTC
82 points
0 comments1 min readLW link

An In­tro­duc­tion To The Man­delbrot Set That Doesn’t Men­tion Com­plex Numbers

Yitz17 Jan 2024 9:48 UTC
82 points
11 comments9 min readLW link

Pal­world de­vel­op­ment blog post

bhauth28 Jan 2024 5:56 UTC
81 points
12 comments1 min readLW link
(note.com)

Sur­vey of 2,778 AI au­thors: six parts in pictures

KatjaGrace6 Jan 2024 4:43 UTC
80 points
1 comment2 min readLW link

Univer­sal Love In­te­gra­tion Test: Hitler

Raemon10 Jan 2024 23:55 UTC
76 points
65 comments9 min readLW link

When “yang” goes wrong

Joe Carlsmith8 Jan 2024 16:35 UTC
72 points
6 comments13 min readLW link

We need a Science of Evals

22 Jan 2024 20:30 UTC
71 points
13 comments9 min readLW link

The True Story of How GPT-2 Be­came Max­i­mally Lewd

18 Jan 2024 21:03 UTC
70 points
7 comments6 min readLW link
(youtu.be)

[Re­post] The Copen­hagen In­ter­pre­ta­tion of Ethics

mesaoptimizer25 Jan 2024 15:20 UTC
70 points
4 comments5 min readLW link
(web.archive.org)

Epistemic Hell

rogersbacon27 Jan 2024 17:13 UTC
70 points
20 comments14 min readLW link

In­terLab – a toolkit for ex­per­i­ments with multi-agent interactions

22 Jan 2024 18:23 UTC
69 points
0 comments8 min readLW link
(acsresearch.org)

[Question] Will quan­tum ran­dom­ness af­fect the 2028 elec­tion?

24 Jan 2024 22:54 UTC
66 points
52 comments1 min readLW link

OpenAI’s Pre­pared­ness Frame­work: Praise & Recommendations

Akash2 Jan 2024 16:20 UTC
66 points
1 comment7 min readLW link

The Per­cep­tron Controversy

Yuxi_Liu10 Jan 2024 23:07 UTC
65 points
18 comments1 min readLW link
(yuxi-liu-wired.github.io)