RSS

Poli­ti­cal syco­phancy as a model or­ganism of scheming

May 12, 2025, 5:49 PM
26 points
0 comments14 min readLW link

AIs at the cur­rent ca­pa­bil­ity level may be im­por­tant for fu­ture safety work

ryan_greenblattMay 12, 2025, 2:06 PM
57 points
1 comment4 min readLW link

Highly Opinionated Ad­vice on How to Write ML Papers

Neel NandaMay 12, 2025, 1:59 AM
35 points
2 comments32 min readLW link

Ab­solute Zero: Alpha Zero for LLM

alapmi11 May 2025 20:42 UTC
13 points
2 comments1 min readLW link

Mind the Co­her­ence Gap: Les­sons from Steer­ing Llama with Goodfire

eitan sprejer9 May 2025 21:29 UTC
4 points
0 comments6 min readLW link

Slow cor­po­ra­tions as an in­tu­ition pump for AI R&D automation

9 May 2025 14:49 UTC
88 points
21 comments9 min readLW link

Video & tran­script: Challenges for Safe & Benefi­cial Brain-Like AGI

Steven Byrnes8 May 2025 21:11 UTC
24 points
0 comments18 min readLW link

Misal­ign­ment and Strate­gic Un­der­perfor­mance: An Anal­y­sis of Sand­bag­ging and Ex­plo­ra­tion Hacking

8 May 2025 19:06 UTC
75 points
1 comment15 min readLW link

An al­ign­ment safety case sketch based on debate

8 May 2025 15:02 UTC
55 points
13 comments25 min readLW link
(arxiv.org)

UK AISI’s Align­ment Team: Re­search Agenda

7 May 2025 16:33 UTC
107 points
2 comments10 min readLW link

The Sweet Les­son: AI Safety Should Scale With Compute

Jesse Hoogland5 May 2025 19:03 UTC
86 points
1 comment3 min readLW link

In­ter­pretabil­ity Will Not Reli­ably Find De­cep­tive AI

Neel Nanda4 May 2025 16:32 UTC
257 points
30 comments7 min readLW link

Sim­pleS­to­ries: A Bet­ter Syn­thetic Dataset and Tiny Models for Interpretability

Lennart Finke3 May 2025 14:04 UTC
11 points
0 comments1 min readLW link

In­terim Re­search Re­port: Mechanisms of Awareness

2 May 2025 20:29 UTC
38 points
5 comments8 min readLW link

What’s go­ing on with AI progress and trends? (As of 5/​2025)

ryan_greenblatt2 May 2025 19:00 UTC
70 points
7 comments8 min readLW link

My Re­search Pro­cess: Un­der­stand­ing and Cul­ti­vat­ing Re­search Taste

Neel Nanda1 May 2025 23:08 UTC
26 points
1 comment9 min readLW link

What is Inad­e­quate about Bayesi­anism for AI Align­ment: Mo­ti­vat­ing In­fra-Bayesianism

Brittany Gelb1 May 2025 19:06 UTC
17 points
0 comments7 min readLW link

How can we solve diffuse threats like re­search sab­o­tage with AI con­trol?

Vivek Hebbar30 Apr 2025 19:23 UTC
43 points
0 comments8 min readLW link

Video and tran­script of talk on au­tomat­ing al­ign­ment research

Joe Carlsmith30 Apr 2025 17:43 UTC
21 points
0 comments24 min readLW link
(joecarlsmith.com)

Can we safely au­to­mate al­ign­ment re­search?

Joe Carlsmith30 Apr 2025 17:37 UTC
53 points
29 comments48 min readLW link
(joecarlsmith.com)