RSS

In­ves­ti­gat­ing the con­se­quences of ac­ci­den­tally grad­ing CoT dur­ing RL

papetoast8 May 2026 6:17 UTC
7 points
0 comments1 min readLW link
(alignment.openai.com)

The Fric­tion­less Double

zw57 May 2026 23:11 UTC
6 points
4 comments8 min readLW link

Nat­u­ral Lan­guage Au­toen­coders Pro­duce Un­su­per­vised Ex­pla­na­tions of LLM Activations

7 May 2026 20:21 UTC
142 points
13 comments8 min readLW link

Axes of Plan­ning in LLMs + Par­tial Lit Review

NickyP7 May 2026 19:53 UTC
12 points
0 comments9 min readLW link
(blog.sus.cat)

A re­view of “In­ves­ti­gat­ing the con­se­quences of ac­ci­den­tally grad­ing CoT dur­ing RL”

Buck7 May 2026 18:06 UTC
62 points
0 comments8 min readLW link

Try, even if they have you cold

WalterL7 May 2026 17:19 UTC
75 points
4 comments2 min readLW link

Mechanis­tic es­ti­ma­tion for wide ran­dom MLPs

Jacob_Hilton7 May 2026 16:20 UTC
51 points
1 comment5 min readLW link
(www.alignment.org)

How to get bet­ter at chess (and ev­ery­thing else)

Sean Herrington7 May 2026 11:17 UTC
10 points
0 comments3 min readLW link
(www.chess.com)

Mul­tipo­lar Civil­i­sa­tion Depends on Main­tain­ing an At­tacker’s Dilemma

Naci Cankaya7 May 2026 11:13 UTC
19 points
1 comment5 min readLW link
(nacicankaya.substack.com)

Sculpted In­ter­ac­tion: a De­sign-First Ap­proach to AI Alignment

magfrump6 May 2026 23:47 UTC
14 points
0 comments7 min readLW link

Psy­chopa­thy: The Choice

Dawn Drescher6 May 2026 22:23 UTC
11 points
0 comments17 min readLW link
(impartial-priorities.org)

Many in­di­vi­d­ual CEVs are prob­a­bly quite bad

Viliam6 May 2026 20:18 UTC
89 points
27 comments3 min readLW link

Blind deep-de­ploy­ment evals for con­trol & sabotage

Dylan Bowman6 May 2026 19:54 UTC
23 points
0 comments2 min readLW link

A draft hon­esty policy for cred­ible com­mu­ni­ca­tion with AI systems

6 May 2026 18:50 UTC
3 points
0 comments13 min readLW link
(www.forethought.org)

x-risk-themed

kave6 May 2026 15:16 UTC
117 points
9 comments3 min readLW link
(kaverennedy.substack.com)

Mon­day AI Radar #24

Against Moloch6 May 2026 15:05 UTC
10 points
3 comments8 min readLW link
(againstmoloch.substack.com)

AI Safety at the Fron­tier: Paper High­lights of April 2026

gasteigerjo6 May 2026 13:58 UTC
16 points
1 comment10 min readLW link

There is no ev­i­dence you should reap­ply sun­screen ev­ery 2 hours.

Hide6 May 2026 9:19 UTC
47 points
10 comments9 min readLW link
(hidefromit.substack.com)

Build­ing An Ances­tor Si­mu­la­tion #2

Mira Kennard6 May 2026 8:21 UTC
5 points
0 comments5 min readLW link

Psy­chopa­thy: The Types

Dawn Drescher6 May 2026 7:35 UTC
1 point
0 comments10 min readLW link
(impartial-priorities.org)