mattmacdermott

Karma: 1,261

Is instrumental convergence a thing for virtue-driven agents?

mattmacdermott2 Apr 2025 3:59 UTC

33 points

37 comments2 min readLW link

Validating against a misalignment detector is very different to training against one

mattmacdermott4 Mar 2025 15:41 UTC

39 points

4 comments4 min readLW link

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

Yoshua Bengio, Jesse Richardson, dwk and mattmacdermott

24 Feb 2025 18:31 UTC

44 points

15 comments11 min readLW link

Context-dependent consequentialism

Jeremy Gillen and mattmacdermott

4 Nov 2024 9:29 UTC

31 points

6 comments27 min readLW link

Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)

mattmacdermott1 Sep 2024 7:46 UTC

26 points

0 comments5 min readLW link

(yoshuabengio.org)

Bengio’s Alignment Proposal: “Towards a Cautious Scientist AI with Convergent Safety Bounds”

mattmacdermott29 Feb 2024 13:59 UTC

76 points

19 comments14 min readLW link

(yoshuabengio.org)

mattmacdermott’s Shortform

mattmacdermott3 Jan 2024 9:08 UTC

4 points

36 comments LW link

What’s next for the field of Agent Foundations?

Nora_Ammann, Alexander Gietelink Oldenziel and mattmacdermott

30 Nov 2023 17:55 UTC

59 points

23 comments10 min readLW link

Optimisation Measures: Desiderata, Impossibility, Proposals

mattmacdermott and Alexander Gietelink Oldenziel

7 Aug 2023 15:52 UTC

36 points

9 comments1 min readLW link

Reward Hacking from a Causal Perspective

tom4everitt, Francis Rhys Ward, sbenthall, James Fox, mattmacdermott and RyanCarey

21 Jul 2023 18:27 UTC

29 points

6 comments7 min readLW link

Incentives from a causal perspective

tom4everitt, James Fox, RyanCarey, mattmacdermott, sbenthall and Jonathan Richens

10 Jul 2023 17:16 UTC

27 points

0 comments6 min readLW link

Agency from a causal perspective

tom4everitt, mattmacdermott, James Fox, Francis Rhys Ward and Jonathan Richens

30 Jun 2023 17:37 UTC

40 points

5 comments6 min readLW link

Introduction to Towards Causal Foundations of Safe AGI

tom4everitt, Lewis Hammond, Francis Rhys Ward, RyanCarey, James Fox, mattmacdermott and sbenthall

12 Jun 2023 17:55 UTC

70 points

6 comments4 min readLW link

Some Summaries of Agent Foundations Work

mattmacdermott15 May 2023 16:09 UTC

62 points

1 comment13 min readLW link

Towards Measures of Optimisation

mattmacdermott and Alexander Gietelink Oldenziel

12 May 2023 15:29 UTC

53 points

37 comments4 min readLW link

Normative vs Descriptive Models of Agency

mattmacdermott2 Feb 2023 20:28 UTC

26 points

5 comments4 min readLW link