RogerDearnaley

Karma: 1,967

I’m a staff artificial intelligence engineer working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I did research into this during SERI MATS summer 2025. I’m now looking for work on this topic in the London/Cambridge area in the UK.

Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV

RogerDearnaley23 Dec 2025 3:40 UTC

40 points

25 comments20 min readLW link

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

RogerDearnaley28 May 2025 6:21 UTC

36 points

34 comments9 min readLW link

Why Aligning an LLM is Hard, and How to Make it Easier

RogerDearnaley23 Jan 2025 6:44 UTC

36 points

3 comments4 min readLW link

[Question] What Other Lines of Work are Safe from AI Automation?

RogerDearnaley11 Jul 2024 10:01 UTC

40 points

35 comments5 min readLW link

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC

64 points

41 comments24 min readLW link

7. Evolution and Ethics

RogerDearnaley15 Feb 2024 23:38 UTC

7 points

8 comments6 min readLW link 1 review

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

46 points

12 comments31 min readLW link

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley1 Feb 2024 21:15 UTC

14 points

15 comments13 min readLW link

Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect

RogerDearnaley26 Jan 2024 3:58 UTC

16 points

2 comments11 min readLW link

A Chinese Room Containing a Stack of Stochastic Parrots

RogerDearnaley12 Jan 2024 6:29 UTC

20 points

4 comments5 min readLW link 1 review

Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?

RogerDearnaley11 Jan 2024 12:56 UTC

36 points

4 comments39 min readLW link

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor

RogerDearnaley9 Jan 2024 20:42 UTC

48 points

8 comments36 min readLW link

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC

37 points

4 comments2 min readLW link

5. Moral Value for Sentient Animals? Alas, Not Yet

RogerDearnaley27 Dec 2023 6:42 UTC

34 points

41 comments23 min readLW link

Interpreting the Learning of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC

30 points

14 comments9 min readLW link

Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment

RogerDearnaley7 Dec 2023 6:14 UTC

9 points

0 comments11 min readLW link

6. The Mutable Values Problem in Value Learning and CEV

RogerDearnaley4 Dec 2023 18:31 UTC

12 points

0 comments49 min readLW link

After Alignment — Dialogue between RogerDearnaley and Seth Herd

RogerDearnaley and Seth Herd

2 Dec 2023 6:03 UTC

15 points

2 comments25 min readLW link

How to Control an LLM’s Behavior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC

65 points

30 comments11 min readLW link

4. A Moral Case for Evolved-Sapience-Chauvinism

RogerDearnaley24 Nov 2023 4:56 UTC

10 points

0 comments4 min readLW link