RSS

[Paper] Stringolog­i­cal se­quence pre­dic­tion I

Vanessa Kosoy7 Apr 2026 9:11 UTC
11 points
0 comments2 min readLW link
(arxiv.org)

AIs can now of­ten do mas­sive easy-to-ver­ify SWE tasks and I’ve up­dated to­wards shorter timelines

ryan_greenblatt6 Apr 2026 16:01 UTC
148 points
7 comments13 min readLW link

There should be $100M grants to au­to­mate AI safety

Marius Hobbhahn3 Apr 2026 18:44 UTC
56 points
4 comments8 min readLW link

My most com­mon re­search ad­vice: do quick san­ity checks

LawrenceC2 Apr 2026 2:41 UTC
36 points
2 comments3 min readLW link

Pre­dict­ing When RL Train­ing Breaks Chain-of-Thought Monitorability

David Lindner1 Apr 2026 10:23 UTC
27 points
0 comments5 min readLW link

(Some) Nat­u­ral Emer­gent Misal­ign­ment from Re­ward Hack­ing in Non-Pro­duc­tion RL

30 Mar 2026 10:56 UTC
110 points
3 comments17 min readLW link

Con­trolAI 2025 Im­pact Re­port: our progress to­ward an in­ter­na­tional ban on ASI

27 Mar 2026 18:10 UTC
79 points
3 comments4 min readLW link
(controlai.com)

Test your best meth­ods on our hard CoT in­terp tasks

26 Mar 2026 19:24 UTC
55 points
2 comments19 min readLW link

A Toy En­vi­ron­ment For Ex­plor­ing Rea­son­ing About Reward

25 Mar 2026 20:29 UTC
55 points
7 comments3 min readLW link

Me­tagam­ing mat­ters for train­ing, eval­u­a­tion, and oversight

18 Mar 2026 21:26 UTC
68 points
5 comments1 min readLW link
(alignment.openai.com)

“Act-based ap­proval-di­rected agents”, for IDA skeptics

Steven Byrnes18 Mar 2026 18:47 UTC
62 points
8 comments5 min readLW link

New RFP on In­ter­pretabil­ity from Sch­midt Sciences

Peter Hase17 Mar 2026 16:08 UTC
15 points
0 comments6 min readLW link
(schmidtsciences.smapply.io)

Power Steer­ing: Be­hav­ior Steer­ing via Layer-to-Layer Ja­co­bian Sin­gu­lar Vectors

Omar Ayyub13 Mar 2026 3:55 UTC
20 points
0 comments17 min readLW link

Oper­a­tional­iz­ing FDT

Vivek Hebbar13 Mar 2026 0:12 UTC
90 points
11 comments6 min readLW link

How well do mod­els fol­low their con­sti­tu­tions?

12 Mar 2026 0:07 UTC
97 points
5 comments26 min readLW link

The Refined Coun­ter­fac­tual Pri­soner’s Dilemma: An At­tempt to Ex­plode De­ci­sion-The­o­retic Con­se­quen­tial­ism

Chris_Leong11 Mar 2026 12:32 UTC
18 points
20 comments2 min readLW link

AIs will be used in “un­hinged” configurations

Arthur Conmy11 Mar 2026 11:19 UTC
58 points
3 comments4 min readLW link

The case for sa­ti­at­ing cheaply-satis­fied AI preferences

Alex Mallen10 Mar 2026 18:09 UTC
103 points
7 comments23 min readLW link

Cen­sored LLMs as a Nat­u­ral Testbed for Se­cret Knowl­edge Elicitation

9 Mar 2026 18:50 UTC
30 points
2 comments5 min readLW link

Payo­rian co­op­er­a­tion is easy with Kripke frames

transhumanist_atom_understander9 Mar 2026 0:29 UTC
70 points
7 comments8 min readLW link