RSS

The case for coun­ter­mea­sures to memetic spread of mis­al­igned values

Alex MallenMay 28, 2025, 9:12 PM
22 points
1 comment7 min readLW link

For­mal­iz­ing Embed­ded­ness Failures in Univer­sal Ar­tifi­cial Intelligence

Cole WyethMay 26, 2025, 12:36 PM
39 points
0 comments1 min readLW link
(arxiv.org)

Re­ward but­ton alignment

Steven ByrnesMay 22, 2025, 5:36 PM
50 points
15 comments12 min readLW link

Un­ex­ploitable search: block­ing mal­i­cious use of free parameters

May 21, 2025, 5:23 PM
34 points
16 comments6 min readLW link

Model­ing ver­sus Implementation

Cole WyethMay 18, 2025, 1:38 PM
27 points
10 comments3 min readLW link

Prob­lems with in­struc­tion-fol­low­ing as an al­ign­ment target

Seth HerdMay 15, 2025, 3:41 PM
48 points
14 comments10 min readLW link

Dodg­ing sys­tem­atic hu­man er­rors in scal­able oversight

Geoffrey IrvingMay 14, 2025, 3:19 PM
33 points
3 comments4 min readLW link

Work­ing through a small tiling result

James PayorMay 13, 2025, 8:28 PM
66 points
9 comments5 min readLW link

Mea­sur­ing Schel­ling Co­or­di­na­tion—Reflec­tions on Sub­ver­sion Strat­egy Eval

Graeme FordMay 12, 2025, 7:06 PM
5 points
0 comments8 min readLW link

Poli­ti­cal syco­phancy as a model or­ganism of scheming

May 12, 2025, 5:49 PM
39 points
0 comments14 min readLW link

AIs at the cur­rent ca­pa­bil­ity level may be im­por­tant for fu­ture safety work

ryan_greenblattMay 12, 2025, 2:06 PM
81 points
2 comments4 min readLW link

Highly Opinionated Ad­vice on How to Write ML Papers

Neel NandaMay 12, 2025, 1:59 AM
58 points
4 comments32 min readLW link

Ab­solute Zero: Alpha Zero for LLM

alapmiMay 11, 2025, 8:42 PM
21 points
13 comments1 min readLW link

Glass box learn­ers want to be black box

Cole WyethMay 10, 2025, 11:05 AM
46 points
10 comments4 min readLW link

Mind the Co­her­ence Gap: Les­sons from Steer­ing Llama with Goodfire

eitan sprejerMay 9, 2025, 9:29 PM
4 points
1 comment6 min readLW link

Slow cor­po­ra­tions as an in­tu­ition pump for AI R&D automation

May 9, 2025, 2:49 PM
91 points
23 comments9 min readLW link

Video & tran­script: Challenges for Safe & Benefi­cial Brain-Like AGI

Steven ByrnesMay 8, 2025, 9:11 PM
24 points
0 comments18 min readLW link

Misal­ign­ment and Strate­gic Un­der­perfor­mance: An Anal­y­sis of Sand­bag­ging and Ex­plo­ra­tion Hacking

May 8, 2025, 7:06 PM
75 points
1 comment15 min readLW link

An al­ign­ment safety case sketch based on debate

May 8, 2025, 3:02 PM
55 points
19 comments25 min readLW link
(arxiv.org)

UK AISI’s Align­ment Team: Re­search Agenda

May 7, 2025, 4:33 PM
109 points
2 comments11 min readLW link