Embedded Agents
(A longer text-based version of this post is also available on MIRI’s blog here, and the bibliography for the whole sequence can be found here)
- Introduction to Cartesian Frames by 22 Oct 2020 13:00 UTC; 155 points) (
- The Plan − 2023 Version by 29 Dec 2023 23:34 UTC; 151 points) (
- 2018 Review: Voting Results! by 24 Jan 2020 2:00 UTC; 135 points) (
- 2018 Review: Voting Results! by 24 Jan 2020 2:00 UTC; 135 points) (
- Forecasting Thread: AI Timelines by 22 Aug 2020 2:33 UTC; 133 points) (
- Philosophy in the Darkest Timeline: Basics of the Evolution of Meaning by 7 Jun 2020 7:52 UTC; 132 points) (
- Selection Theorems: A Program For Understanding Agents by 28 Sep 2021 5:03 UTC; 127 points) (
- A Shutdown Problem Proposal by 21 Jan 2024 18:12 UTC; 125 points) (
- Welcome & FAQ! by 24 Aug 2021 20:14 UTC; 114 points) (
- Humans Are Embedded Agents Too by 23 Dec 2019 19:21 UTC; 82 points) (
- Prizes for Last Year’s 2018 Review by 2 Dec 2020 11:21 UTC; 72 points) (
- Agents Over Cartesian World Models by 27 Apr 2021 2:06 UTC; 67 points) (
- Against Time in Agent Models by 13 May 2022 19:55 UTC; 62 points) (
- «Boundaries/Membranes» and AI safety compilation by 3 May 2023 21:41 UTC; 57 points) (
- What Decision Theory is Implied By Predictive Processing? by 28 Sep 2020 17:20 UTC; 56 points) (
- [AN #127]: Rethinking agency: Cartesian frames as a formalization of ways to carve up the world into an agent and its environment by 2 Dec 2020 18:20 UTC; 53 points) (
- Sunday October 25, 12:00PM (PT) — Scott Garrabrant on “Cartesian Frames” by 21 Oct 2020 3:27 UTC; 48 points) (
- 13 Jul 2024 1:07 UTC; 45 points) 's comment on Alignment: “Do what I would have wanted you to do” by (
- Gödel’s Legacy: A game without end by 28 Jun 2020 18:50 UTC; 44 points) (
- Embedded Agency via Abstraction by 26 Aug 2019 23:03 UTC; 42 points) (
- [AN #163]: Using finite factored sets for causal and temporal inference by 8 Sep 2021 17:20 UTC; 41 points) (
- Understanding Selection Theorems by 28 May 2022 1:49 UTC; 41 points) (
- Simulators, constraints, and goal agnosticism: porbynotes vol. 1 by 23 Nov 2022 4:22 UTC; 37 points) (
- If brains are computers, what kind of computers are they? (Dennett transcript) by 30 Jan 2020 5:07 UTC; 37 points) (
- What are the most plausible “AI Safety warning shot” scenarios? by 26 Mar 2020 20:59 UTC; 35 points) (
- What’s your big idea? by 18 Oct 2019 15:47 UTC; 30 points) (
- International Relations; States, Rational Actors, and Other Approaches (Policy and International Relations Primer Part 4) by 22 Jan 2020 8:29 UTC; 27 points) (EA Forum;
- how has this forum changed your life? by 30 Jan 2020 21:54 UTC; 26 points) (
- What is abstraction? by 15 Dec 2018 8:36 UTC; 25 points) (
- [AN #148]: Analyzing generalization across more axes than just accuracy or loss by 28 Apr 2021 18:30 UTC; 24 points) (
- [AN #105]: The economic trajectory of humanity, and what we might mean by optimization by 24 Jun 2020 17:30 UTC; 24 points) (
- 28 Dec 2021 2:45 UTC; 23 points) 's comment on The Solomonoff Prior is Malign by (
- Clarifying Factored Cognition by 13 Dec 2020 20:02 UTC; 23 points) (
- My decomposition of the alignment problem by 2 Sep 2024 0:21 UTC; 22 points) (
- Theory of Ideal Agents, or of Existing Agents? by 13 Sep 2019 17:38 UTC; 20 points) (
- Sunday July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stuart_Armstrong by 8 Jul 2020 0:27 UTC; 19 points) (
- [AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI by 5 May 2019 2:20 UTC; 17 points) (
- Alignment Newsletter #31 by 5 Nov 2018 23:50 UTC; 17 points) (
- 5 Jun 2024 12:17 UTC; 16 points) 's comment on The Standard Analogy by (
- [AN #83]: Sample-efficient deep learning with ReMixMatch by 22 Jan 2020 18:10 UTC; 15 points) (
- [AN #79]: Recursive reward modeling as an alignment technique integrated with deep RL by 1 Jan 2020 18:00 UTC; 13 points) (
- 6 Nov 2019 18:03 UTC; 13 points) 's comment on Book Review: Design Principles of Biological Circuits by (
- 28 Oct 2019 4:27 UTC; 13 points) 's comment on Maybe Lying Doesn’t Exist by (
- [AN #143]: How to make embedded agents that reason probabilistically about their environments by 24 Mar 2021 17:20 UTC; 13 points) (
- Comments on Allan Dafoe on AI Governance by 29 Nov 2021 16:16 UTC; 13 points) (
- 25 Mar 2021 0:47 UTC; 12 points) 's comment on My research methodology by (
- [AN #66]: Decomposing robustness into capability robustness and alignment robustness by 30 Sep 2019 18:00 UTC; 12 points) (
- 31 Dec 2018 23:54 UTC; 12 points) 's comment on New safety research agenda: scalable agent alignment via reward modeling by (
- Towards deconfusing values by 29 Jan 2020 19:28 UTC; 12 points) (
- 16 Dec 2021 22:08 UTC; 12 points) 's comment on The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables by (
- Beyond Rewards and Values: A Non-dualistic Approach to Universal Intelligence by 30 Dec 2022 19:05 UTC; 10 points) (
- What are brains? by 10 Jun 2023 14:46 UTC; 10 points) (
- 25 Aug 2024 10:00 UTC; 8 points) 's comment on What is it to solve the alignment problem? by (
- 20 Sep 2019 22:57 UTC; 8 points) 's comment on Reframing Impact by (
- 21 Dec 2021 14:54 UTC; 7 points) 's comment on Attainable Utility Preservation: Empirical Results by (
- 12 Jul 2024 15:34 UTC; 6 points) 's comment on Instruction-following AGI is easier and more likely than value aligned AGI by (
- ACI#8: Value as a Function of Possible Worlds by 3 Jun 2024 21:49 UTC; 6 points) (
- Choice := Anthropics uncertainty? And potential implications for agency by 21 Apr 2022 16:38 UTC; 6 points) (
- 15 Aug 2024 9:08 UTC; 6 points) 's comment on shminux’s Shortform by (
- 30 Oct 2018 18:35 UTC; 5 points) 's comment on An Undergraduate Reading Of: Semantic information, autonomous agency and non-equilibrium statistical physics by (
- 1 Dec 2019 18:57 UTC; 4 points) 's comment on Dialogue on Appeals to Consequences by (
- 18 Jul 2024 22:50 UTC; 4 points) 's comment on What are the actual arguments in favor of computationalism as a theory of identity? by (
- 9 Jul 2024 23:32 UTC; 4 points) 's comment on When is a mind me? by (
- SlateStarCodex Fika by 2 Jan 2021 2:03 UTC; 3 points) (
- Decisions with Non-Logical Counterfactuals: request for input by 24 Oct 2019 17:23 UTC; 3 points) (
- 23 Jan 2020 17:21 UTC; 2 points) 's comment on (A → B) → A in Causal DAGs by (
- 12 Jan 2021 5:58 UTC; 2 points) 's comment on What are the open problems in Human Rationality? by (
- 30 Dec 2019 5:41 UTC; 2 points) 's comment on Stupidity and Dishonesty Explain Each Other Away by (
- 4 Jun 2023 2:13 UTC; 1 point) 's comment on Stephen Fowler’s Shortform by (
- 7 May 2023 15:41 UTC; 1 point) 's comment on awg’s Shortform by (
- 19 Jul 2022 2:09 UTC; 1 point) 's comment on All AGI safety questions welcome (especially basic ones) [July 2022] by (
- 13 Feb 2025 7:41 UTC; 1 point) 's comment on Siebe’s Shortform by (
I actually have some understanding of what MIRI’s Agent Foundations work is about
This post (and the rest of the sequence) was the first time I had ever read something about AI alignment and thought that it was actually asking the right questions. It is not about a sub-problem, it is not about marginal improvements. Its goal is a gears-level understanding of agents, and it directly explains why that’s hard. It’s a list of everything which needs to be figured out in order to remove all the black boxes and Cartesian boundaries, and understand agents as well as we understand refrigerators.
I nominate this post for two reasons.
One, it is an excellent example of providing supplemental writing about basic intuitions and thought processes, which is extremely helpful to me because I do not have a good enough command of the formal work to intuit them.
Two, it is one of the few examples of experimenting with different kinds of presentation. I feel like this is underappreciated and under-utilized; better ways of communicating seems like a strong baseline requirement of the rationality project, and this post pushes in that direction.
This post has significant changed my mental model of how to understand key challenges in AI safety, and also given me a clearer understanding of and language for describing why complex game-theoretic challenges are poorly specified or understood. The terms and concepts in this series of posts have become a key part of my basic intellectual toolkit.
This sequence was the first time I felt I understood MIRI’s research.
(Though I might prefer to nominate the text-version that has the whole sequence in one post.)
Read sequence as research for my EA/rationality novel, this was really good and also pretty easy to follow despite not having any technical background