Towards mutually assured cooperation

mikko12 Jun 2025 15:15 UTC

5 points

0 comments2 min readLW link

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC11 Jun 2025 19:27 UTC

149 points

2 comments16 min readLW link

Religion for Rationalists

Gordon Seidoh Worley11 Jun 2025 19:05 UTC

13 points

37 comments4 min readLW link

How to think with images

Dinkar Juyal11 Jun 2025 15:49 UTC

5 points

2 comments15 min readLW link

(dinkarjuyal.github.io)

Difficulties of Eschatological policy making [Linkpost]

Noosphere8911 Jun 2025 14:12 UTC

11 points

3 comments3 min readLW link

(jack-clark.net)

Hydra

Matrice Jacobine11 Jun 2025 14:07 UTC

24 points

0 comments1 min readLW link

(philosophybear.substack.com)

SafeRLHub: An Interactive Resource for RL Safety and Interpretability

Siya and deneille

11 Jun 2025 5:47 UTC

3 points

0 comments7 min readLW link

More on policy arguments and the AB problem

Sniffnoy11 Jun 2025 4:42 UTC

10 points

0 comments4 min readLW link

the void

nostalgebraist11 Jun 2025 3:19 UTC

147 points

28 comments1 min readLW link

(nostalgebraist.tumblr.com)

Mech interp is not pre-paradigmatic

Lee Sharkey10 Jun 2025 13:39 UTC

158 points

2 comments12 min readLW link

Research Without Permission

Priyanka Bharadwaj10 Jun 2025 7:33 UTC

26 points

1 comment3 min readLW link

Some Human That I Used to Know (Filk)

Gordon Seidoh Worley10 Jun 2025 4:29 UTC

11 points

3 comments1 min readLW link

A quick list of reward hacking interventions

Alex Mallen10 Jun 2025 0:58 UTC

6 points

0 comments2 min readLW link

Ghiblification for Privacy

jefftk10 Jun 2025 0:30 UTC

67 points

30 comments1 min readLW link

(www.jefftk.com)

Personal Agents: AIs as trusted advisors, caretakers, and user proxies

JWJohnston9 Jun 2025 21:26 UTC

2 points

0 comments2 min readLW link

Causation, Correlation, and Confounding: A Graphical Explainer

Tim Hua9 Jun 2025 20:46 UTC

9 points

2 comments9 min readLW link

When is it important that open-weight models aren’t released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.

ryan_greenblatt9 Jun 2025 19:19 UTC

63 points

10 comments9 min readLW link

METR’s Observations of Reward Hacking in Recent Frontier Models

Daniel Kokotajlo9 Jun 2025 18:03 UTC

97 points

6 comments11 min readLW link

(metr.org)

Expectation = intention = setpoint

jimmy9 Jun 2025 17:33 UTC

31 points

12 comments13 min readLW link

Identifying “Deception Vectors” In Models

Stephen Martin9 Jun 2025 17:30 UTC

5 points

0 comments1 min readLW link

(arxiv.org)