AIs will greatly change engineering in AI companies well before AGI

ryan_greenblatt9 Sep 2025 16:58 UTC

38 points

3 comments11 min readLW link

Large Language Models and the Critical Brain Hypothesis

David Africa9 Sep 2025 15:45 UTC

28 points

0 comments6 min readLW link

Decision Theory Guarding is Sufficient for Scheming

james.lucassen9 Sep 2025 14:49 UTC

31 points

3 comments2 min readLW link

Safety cases for Pessimism

michaelcohen8 Sep 2025 13:26 UTC

16 points

1 comment4 min readLW link

How Can You Tell if You’ve Instilled a False Belief in Your LLM?

james.lucassen6 Sep 2025 16:45 UTC

14 points

1 comment10 min readLW link

(jlucassen.com)

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder, Clément Dumas, Stewy Slocum and Neel Nanda

5 Sep 2025 12:11 UTC

28 points

1 comment7 min readLW link

Natural Latents: Latent Variables Stable Across Ontologies

johnswentworth and David Lorell

4 Sep 2025 0:33 UTC

110 points

13 comments20 min readLW link

Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro

ryan_greenblatt3 Sep 2025 13:21 UTC

150 points

25 comments8 min readLW link

How To Become A Mechanistic Interpretability Researcher

Neel Nanda2 Sep 2025 23:38 UTC

99 points

12 comments55 min readLW link

Sleeping Experts in the (reflective) Solomonoff Prior

Daniel C and Cole Wyeth

31 Aug 2025 4:55 UTC

16 points

0 comments3 min readLW link

Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)

ryan_greenblatt27 Aug 2025 17:04 UTC

98 points

2 comments3 min readLW link

AI companies have started saying safeguards are load-bearing

Zach Stein-Perlman27 Aug 2025 13:00 UTC

51 points

2 comments5 min readLW link

AI Induced Psychosis: A shallow investigation

Tim Hua26 Aug 2025 20:03 UTC

320 points

38 comments26 min readLW link

Harmless reward hacks can generalize to misalignment in LLMs

Mia Taylor and Owain_Evans

26 Aug 2025 17:32 UTC

46 points

6 comments7 min readLW link

Do-Divergence: A Bound for Maxwell’s Demon

johnswentworth and David Lorell

26 Aug 2025 17:07 UTC

66 points

4 comments3 min readLW link

New Paper on Reflective Oracles & Grain of Truth Problem

Cole Wyeth26 Aug 2025 0:18 UTC

53 points

0 comments1 min readLW link

Hidden Reasoning in LLMs: A Taxonomy

Rauno Arike, RohanS and Shubhorup Biswas

25 Aug 2025 22:43 UTC

60 points

8 comments12 min readLW link

Notes on cooperating with unaligned AIs

Lukas Finnveden24 Aug 2025 4:19 UTC

45 points

8 comments21 min readLW link

(blog.redwoodresearch.org)

(∃ Stochastic Natural Latent) Implies (∃ Deterministic Natural Latent)

johnswentworth and David Lorell

22 Aug 2025 21:46 UTC

120 points

8 comments9 min readLW link

One more reason for AI capable of independent moral reasoning: alignment itself and cause prioritisation

Michele Campolo22 Aug 2025 15:53 UTC

−3 points

0 comments3 min readLW link