RSS

Pri­ori­tiz­ing Work

jefftkMay 1, 2025, 2:00 AM
38 points
1 comment1 min readLW link
(www.jefftk.com)

Don’t rely on a “race to the top”

sjadlerMay 1, 2025, 12:33 AM
4 points
0 comments1 min readLW link

Meta-Tech­ni­cal­ities: Safe­guard­ing Values in For­mal Systems

LTMApr 30, 2025, 11:43 PM
2 points
0 comments3 min readLW link
(routecause.substack.com)

Ob­sta­cles in ARC’s agenda: Find­ing explanations

David MatolcsiApr 30, 2025, 11:03 PM
66 points
1 comment17 min readLW link

State of play of AI progress (and re­lated brakes on an in­tel­li­gence ex­plo­sion) [Linkpost]

Noosphere89Apr 30, 2025, 7:58 PM
7 points
0 comments5 min readLW link
(www.interconnects.ai)

Don’t ac­cuse your in­ter­locu­tor of be­ing in­suffi­ciently truth-seeking

TFDApr 30, 2025, 7:38 PM
18 points
9 comments2 min readLW link
(www.thefloatingdroid.com)

How can we solve diffuse threats like re­search sab­o­tage with AI con­trol?

Vivek HebbarApr 30, 2025, 7:23 PM
34 points
0 comments8 min readLW link

[Question] Can Nar­row­ing One’s Refer­ence Class Un­der­mine the Dooms­day Ar­gu­ment?

Iannoose n.Apr 30, 2025, 6:24 PM
2 points
0 comments1 min readLW link

[Question] Does there ex­ist an in­ter­ac­tive rea­son­ing map tool that lets users vi­su­ally lay out claims, as­sign prob­a­bil­ities and con­fi­dence lev­els, and dy­nam­i­cally ad­just their be­liefs based on weighted in­fluences be­tween con­nected as­ser­tions?

Zack FriedmanApr 30, 2025, 6:22 PM
3 points
0 comments1 min readLW link

Distill­ing the In­ter­nal Model Prin­ci­ple part II

JoseFaustinoApr 30, 2025, 5:56 PM
13 points
0 comments19 min readLW link

Re­search Pri­ori­ties for Hard­ware-En­abled Mechanisms (HEMs)

aogApr 30, 2025, 5:43 PM
16 points
2 comments15 min readLW link
(www.longview.org)

Video and tran­script of talk on au­tomat­ing al­ign­ment research

Joe CarlsmithApr 30, 2025, 5:43 PM
21 points
0 comments24 min readLW link
(joecarlsmith.com)

Can we safely au­to­mate al­ign­ment re­search?

Joe CarlsmithApr 30, 2025, 5:37 PM
32 points
6 comments48 min readLW link
(joecarlsmith.com)

In­ves­ti­gat­ing task-spe­cific prompts and sparse au­toen­coders for ac­ti­va­tion monitoring

Henk TillmanApr 30, 2025, 5:09 PM
16 points
0 comments1 min readLW link
(arxiv.org)

Scal­ing Laws for Scal­able Oversight

Apr 30, 2025, 12:13 PM
19 points
0 comments9 min readLW link

Early Chi­nese Lan­guage Me­dia Cover­age of the AI 2027 Re­port: A Qual­i­ta­tive Analysis

Apr 30, 2025, 11:06 AM
137 points
4 comments11 min readLW link

[Paper] Au­to­mated Fea­ture La­bel­ing with To­ken-Space Gra­di­ent Descent

Wuschel SchulzApr 30, 2025, 10:22 AM
4 points
0 comments4 min readLW link

A sin­gle prin­ci­ple re­lated to many Align­ment sub­prob­lems?

Q HomeApr 30, 2025, 9:49 AM
25 points
2 comments16 min readLW link

In­ter­pret­ing the METR Time Hori­zons Post

snewmanApr 30, 2025, 3:03 AM
65 points
12 comments10 min readLW link
(amistrongeryet.substack.com)

Should we ex­pect the fu­ture to be good?

Neil CrawfordApr 30, 2025, 12:36 AM
15 points
0 comments14 min readLW link