RSS

Towards mu­tu­ally as­sured cooperation

mikko12 Jun 2025 15:15 UTC
5 points
0 comments2 min readLW link

Be­ware Gen­eral Claims about “Gen­er­al­iz­able Rea­son­ing Ca­pa­bil­ities” (of Modern AI Sys­tems)

LawrenceC11 Jun 2025 19:27 UTC
149 points
2 comments16 min readLW link

Reli­gion for Rationalists

Gordon Seidoh Worley11 Jun 2025 19:05 UTC
13 points
37 comments4 min readLW link

How to think with images

Dinkar Juyal11 Jun 2025 15:49 UTC
5 points
2 comments15 min readLW link
(dinkarjuyal.github.io)

Difficul­ties of Escha­tolog­i­cal policy mak­ing [Linkpost]

Noosphere8911 Jun 2025 14:12 UTC
11 points
3 comments3 min readLW link
(jack-clark.net)

Hydra

Matrice Jacobine11 Jun 2025 14:07 UTC
24 points
0 comments1 min readLW link
(philosophybear.substack.com)

SafeRLHub: An In­ter­ac­tive Re­source for RL Safety and Interpretability

11 Jun 2025 5:47 UTC
3 points
0 comments7 min readLW link

More on policy ar­gu­ments and the AB problem

Sniffnoy11 Jun 2025 4:42 UTC
10 points
0 comments4 min readLW link

the void

nostalgebraist11 Jun 2025 3:19 UTC
147 points
28 comments1 min readLW link
(nostalgebraist.tumblr.com)

Mech in­terp is not pre-paradigmatic

Lee Sharkey10 Jun 2025 13:39 UTC
158 points
2 comments12 min readLW link

Re­search Without Permission

Priyanka Bharadwaj10 Jun 2025 7:33 UTC
26 points
1 comment3 min readLW link

Some Hu­man That I Used to Know (Filk)

Gordon Seidoh Worley10 Jun 2025 4:29 UTC
11 points
3 comments1 min readLW link

A quick list of re­ward hack­ing interventions

Alex Mallen10 Jun 2025 0:58 UTC
6 points
0 comments2 min readLW link

Ghiblifi­ca­tion for Privacy

jefftk10 Jun 2025 0:30 UTC
67 points
30 comments1 min readLW link
(www.jefftk.com)

Per­sonal Agents: AIs as trusted ad­vi­sors, care­tak­ers, and user proxies

JWJohnston9 Jun 2025 21:26 UTC
2 points
0 comments2 min readLW link

Cau­sa­tion, Cor­re­la­tion, and Con­found­ing: A Graph­i­cal Explainer

Tim Hua9 Jun 2025 20:46 UTC
9 points
2 comments9 min readLW link

When is it im­por­tant that open-weight mod­els aren’t re­leased? My thoughts on the benefits and dan­gers of open-weight mod­els in re­sponse to de­vel­op­ments in CBRN ca­pa­bil­ities.

ryan_greenblatt9 Jun 2025 19:19 UTC
63 points
10 comments9 min readLW link

METR’s Ob­ser­va­tions of Re­ward Hack­ing in Re­cent Fron­tier Models

Daniel Kokotajlo9 Jun 2025 18:03 UTC
97 points
6 comments11 min readLW link
(metr.org)

Ex­pec­ta­tion = in­ten­tion = set­point

jimmy9 Jun 2025 17:33 UTC
31 points
12 comments13 min readLW link

Iden­ti­fy­ing “De­cep­tion Vec­tors” In Models

Stephen Martin9 Jun 2025 17:30 UTC
5 points
0 comments1 min readLW link
(arxiv.org)