Embedded Agents
(A longer text-based version of this post is also available on MIRI’s blog here, and the bibliography for the whole sequence can be found here)
- Introduction to Cartesian Frames by Oct 22, 2020, 1:00 PM; 155 points) (
- The Plan − 2023 Version by Dec 29, 2023, 11:34 PM; 152 points) (
- 2018 Review: Voting Results! by Jan 24, 2020, 2:00 AM; 135 points) (
- 2018 Review: Voting Results! by Jan 24, 2020, 2:00 AM; 135 points) (
- Forecasting Thread: AI Timelines by Aug 22, 2020, 2:33 AM; 133 points) (
- Philosophy in the Darkest Timeline: Basics of the Evolution of Meaning by Jun 7, 2020, 7:52 AM; 132 points) (
- Selection Theorems: A Program For Understanding Agents by Sep 28, 2021, 5:03 AM; 128 points) (
- A Shutdown Problem Proposal by Jan 21, 2024, 6:12 PM; 125 points) (
- Welcome & FAQ! by Aug 24, 2021, 8:14 PM; 114 points) (
- Humans Are Embedded Agents Too by Dec 23, 2019, 7:21 PM; 82 points) (
- Prizes for Last Year’s 2018 Review by Dec 2, 2020, 11:21 AM; 72 points) (
- Agents Over Cartesian World Models by Apr 27, 2021, 2:06 AM; 67 points) (
- Against Time in Agent Models by May 13, 2022, 7:55 PM; 62 points) (
- What Decision Theory is Implied By Predictive Processing? by Sep 28, 2020, 5:20 PM; 56 points) (
- «Boundaries/Membranes» and AI safety compilation by May 3, 2023, 9:41 PM; 56 points) (
- [AN #127]: Rethinking agency: Cartesian frames as a formalization of ways to carve up the world into an agent and its environment by Dec 2, 2020, 6:20 PM; 53 points) (
- Sunday October 25, 12:00PM (PT) — Scott Garrabrant on “Cartesian Frames” by Oct 21, 2020, 3:27 AM; 48 points) (
- Jul 13, 2024, 1:07 AM; 45 points) 's comment on Alignment: “Do what I would have wanted you to do” by (
- Gödel’s Legacy: A game without end by Jun 28, 2020, 6:50 PM; 45 points) (
- Embedded Agency via Abstraction by Aug 26, 2019, 11:03 PM; 42 points) (
- [AN #163]: Using finite factored sets for causal and temporal inference by Sep 8, 2021, 5:20 PM; 41 points) (
- Understanding Selection Theorems by May 28, 2022, 1:49 AM; 41 points) (
- Simulators, constraints, and goal agnosticism: porbynotes vol. 1 by Nov 23, 2022, 4:22 AM; 37 points) (
- If brains are computers, what kind of computers are they? (Dennett transcript) by Jan 30, 2020, 5:07 AM; 37 points) (
- What are the most plausible “AI Safety warning shot” scenarios? by Mar 26, 2020, 8:59 PM; 35 points) (
- What’s your big idea? by Oct 18, 2019, 3:47 PM; 30 points) (
- International Relations; States, Rational Actors, and Other Approaches (Policy and International Relations Primer Part 4) by Jan 22, 2020, 8:29 AM; 27 points) (EA Forum;
- how has this forum changed your life? by Jan 30, 2020, 9:54 PM; 26 points) (
- What is abstraction? by Dec 15, 2018, 8:36 AM; 25 points) (
- [AN #148]: Analyzing generalization across more axes than just accuracy or loss by Apr 28, 2021, 6:30 PM; 24 points) (
- [AN #105]: The economic trajectory of humanity, and what we might mean by optimization by Jun 24, 2020, 5:30 PM; 24 points) (
- Dec 28, 2021, 2:45 AM; 23 points) 's comment on The Solomonoff Prior is Malign by (
- Clarifying Factored Cognition by Dec 13, 2020, 8:02 PM; 23 points) (
- My decomposition of the alignment problem by Sep 2, 2024, 12:21 AM; 22 points) (
- Theory of Ideal Agents, or of Existing Agents? by Sep 13, 2019, 5:38 PM; 20 points) (
- Sunday July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stuart_Armstrong by Jul 8, 2020, 12:27 AM; 19 points) (
- [AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI by May 5, 2019, 2:20 AM; 17 points) (
- Alignment Newsletter #31 by Nov 5, 2018, 11:50 PM; 17 points) (
- Jun 5, 2024, 12:17 PM; 16 points) 's comment on The Standard Analogy by (
- [AN #83]: Sample-efficient deep learning with ReMixMatch by Jan 22, 2020, 6:10 PM; 15 points) (
- [AN #79]: Recursive reward modeling as an alignment technique integrated with deep RL by Jan 1, 2020, 6:00 PM; 13 points) (
- Nov 6, 2019, 6:03 PM; 13 points) 's comment on Book Review: Design Principles of Biological Circuits by (
- Oct 28, 2019, 4:27 AM; 13 points) 's comment on Maybe Lying Doesn’t Exist by (
- [AN #143]: How to make embedded agents that reason probabilistically about their environments by Mar 24, 2021, 5:20 PM; 13 points) (
- Comments on Allan Dafoe on AI Governance by Nov 29, 2021, 4:16 PM; 13 points) (
- Mar 25, 2021, 12:47 AM; 12 points) 's comment on My research methodology by (
- [AN #66]: Decomposing robustness into capability robustness and alignment robustness by Sep 30, 2019, 6:00 PM; 12 points) (
- Dec 31, 2018, 11:54 PM; 12 points) 's comment on New safety research agenda: scalable agent alignment via reward modeling by (
- Towards deconfusing values by Jan 29, 2020, 7:28 PM; 12 points) (
- Dec 16, 2021, 10:08 PM; 12 points) 's comment on The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables by (
- Beyond Rewards and Values: A Non-dualistic Approach to Universal Intelligence by Dec 30, 2022, 7:05 PM; 10 points) (
- What are brains? by Jun 10, 2023, 2:46 PM; 10 points) (
- Aug 25, 2024, 10:00 AM; 8 points) 's comment on What is it to solve the alignment problem? by (
- Sep 20, 2019, 10:57 PM; 8 points) 's comment on Reframing Impact by (
- Dec 21, 2021, 2:54 PM; 7 points) 's comment on Attainable Utility Preservation: Empirical Results by (
- Jul 12, 2024, 3:34 PM; 6 points) 's comment on Instruction-following AGI is easier and more likely than value aligned AGI by (
- ACI#8: Value as a Function of Possible Worlds by Jun 3, 2024, 9:49 PM; 6 points) (
- Choice := Anthropics uncertainty? And potential implications for agency by Apr 21, 2022, 4:38 PM; 6 points) (
- Aug 15, 2024, 9:08 AM; 6 points) 's comment on shminux’s Shortform by (
- Oct 30, 2018, 6:35 PM; 5 points) 's comment on An Undergraduate Reading Of: Semantic information, autonomous agency and non-equilibrium statistical physics by (
- Dec 1, 2019, 6:57 PM; 4 points) 's comment on Dialogue on Appeals to Consequences by (
- Jul 18, 2024, 10:50 PM; 4 points) 's comment on What are the actual arguments in favor of computationalism as a theory of identity? by (
- SlateStarCodex Fika by Jan 2, 2021, 2:03 AM; 3 points) (
- Decisions with Non-Logical Counterfactuals: request for input by Oct 24, 2019, 5:23 PM; 3 points) (
- Jan 23, 2020, 5:21 PM; 2 points) 's comment on (A → B) → A in Causal DAGs by (
- Jan 12, 2021, 5:58 AM; 2 points) 's comment on What are the open problems in Human Rationality? by (
- Dec 30, 2019, 5:41 AM; 2 points) 's comment on Stupidity and Dishonesty Explain Each Other Away by (
- Jun 4, 2023, 2:13 AM; 1 point) 's comment on Stephen Fowler’s Shortform by (
- May 7, 2023, 3:41 PM; 1 point) 's comment on awg’s Shortform by (
- Jul 19, 2022, 2:09 AM; 1 point) 's comment on All AGI safety questions welcome (especially basic ones) [July 2022] by (
- Feb 13, 2025, 7:41 AM; 1 point) 's comment on Siebe’s Shortform by (
I actually have some understanding of what MIRI’s Agent Foundations work is about
This post (and the rest of the sequence) was the first time I had ever read something about AI alignment and thought that it was actually asking the right questions. It is not about a sub-problem, it is not about marginal improvements. Its goal is a gears-level understanding of agents, and it directly explains why that’s hard. It’s a list of everything which needs to be figured out in order to remove all the black boxes and Cartesian boundaries, and understand agents as well as we understand refrigerators.
I nominate this post for two reasons.
One, it is an excellent example of providing supplemental writing about basic intuitions and thought processes, which is extremely helpful to me because I do not have a good enough command of the formal work to intuit them.
Two, it is one of the few examples of experimenting with different kinds of presentation. I feel like this is underappreciated and under-utilized; better ways of communicating seems like a strong baseline requirement of the rationality project, and this post pushes in that direction.
This post has significant changed my mental model of how to understand key challenges in AI safety, and also given me a clearer understanding of and language for describing why complex game-theoretic challenges are poorly specified or understood. The terms and concepts in this series of posts have become a key part of my basic intellectual toolkit.
This sequence was the first time I felt I understood MIRI’s research.
(Though I might prefer to nominate the text-version that has the whole sequence in one post.)
Read sequence as research for my EA/rationality novel, this was really good and also pretty easy to follow despite not having any technical background