Robustness to Scale
I want to quickly draw attention to a concept in AI alignment: Robustness to Scale. Briefly, you want your proposal for an AI to be robust (or at least fail gracefully) to changes in its level of capabilities. I discuss three different types of robustness to scale: robustness to scaling up, robustness to scaling down, and robustness to relative scale.
The purpose of this post is to communicate, not to persuade. It may be that we want to bite the bullet of the strongest form of robustness to scale, and build an AGI that is simply not robust to scale, but if we do, we should at least realize that we are doing that.
Robustness to scaling up means that your AI system does not depend on not being too powerful. One way to check for this is to think about what would happen if the thing that the AI is optimizing for were actually maximized. One example of failure of robustness to scaling up is when you expect an AI to accomplish a task in a specific way, but it becomes smart enough to find new creative ways to accomplish the task that you did not think of, and these new creative ways are disastrous. Another example is when you make an AI that is incentivized to do one thing, but you add restrictions that make it so that the best way to accomplish that thing has a side effect that you like. When you scale the AI up, it finds a way around your restrictions.
Robustness to scaling down means that your AI system does not depend on being sufficiently powerful. You can’t really make your system still work when it scales down, but you can maybe make sure it fails gracefully. For example, imagine you had a system that was trying to predict humans, and use these predictions to figure out what to do. When scaled up all the way, the predictions of humans are completely accurate, and it will only take actions that the predicted humans would approve of. If you scale down the capabilities, your system may predict the humans incorrectly. These errors may multiply as you stack many predicted humans together, and the system can end up optimizing for some seeming random goal.
Robustness to relative scale means that your AI system does not depend on any subsystems being similarly powerful to each other. This is most easy to see in systems that depend on adversarial subsystems. If part of you AI system is suggest plans, and another part is trying to find problems in those plans, if you scale up the suggester relative to the verifier, the suggester may find plans that are optimized for taking advantage of the verifier’s weaknesses.
My current state is that when I hear proposals for AI alignment that do not feel very strongly robust to scale, I become very worried about the plan. Part of this comes from feeling like we are actually very early on a logistic capabilities curve. I thus expect that as we scale up capabilities, we can get eventually get large differences very quickly. Thus, I expect that the scaled up (and partially scaled up) versions to actually happen. However, robustness to scale is very difficult, so it may be that we have to depend on systems that are not very robust, and be careful not to push them too far.
- Embedded Agency (full-text version) by 15 Nov 2018 19:49 UTC; 201 points) (
- 2018 Review: Voting Results! by 24 Jan 2020 2:00 UTC; 135 points) (
- 2018 Review: Voting Results! by 24 Jan 2020 2:00 UTC; 135 points) (
- Brainstorm of things that could force an AI team to burn their lead by 24 Jul 2022 23:58 UTC; 134 points) (
- Circumventing interpretability: How to defeat mind-readers by 14 Jul 2022 16:59 UTC; 114 points) (
- How do you feel about LessWrong these days? [Open feedback thread] by 5 Dec 2023 20:54 UTC; 106 points) (
- Towards a New Impact Measure by 18 Sep 2018 17:21 UTC; 103 points) (
- Swimming Upstream: A Case Study in Instrumental Rationality by 3 Jun 2018 3:16 UTC; 76 points) (
- Worrying about the Vase: Whitelisting by 16 Jun 2018 2:17 UTC; 73 points) (
- Announcement: AI alignment prize round 2 winners and next round by 16 Apr 2018 3:08 UTC; 64 points) (
- Best reasons for pessimism about impact of impact measures? by 10 Apr 2019 17:22 UTC; 60 points) (
- Sunday October 25, 12:00PM (PT) — Scott Garrabrant on “Cartesian Frames” by 21 Oct 2020 3:27 UTC; 48 points) (
- Robustness to Scaling Down: More Important Than I Thought by 23 Jul 2022 11:40 UTC; 38 points) (
- 9 Feb 2019 21:02 UTC; 30 points) 's comment on The Case for a Bigger Audience by (
- Brainstorm of things that could force an AI team to burn their lead by 25 Jul 2022 0:00 UTC; 26 points) (EA Forum;
- Do the Safety Properties of Powerful AI Systems Need to be Adversarially Robust? Why? by 9 Feb 2023 13:36 UTC; 22 points) (
- Non-Adversarial Goodhart and AI Risks by 27 Mar 2018 1:39 UTC; 22 points) (
- My decomposition of the alignment problem by 2 Sep 2024 0:21 UTC; 20 points) (
- 10 Jul 2020 4:25 UTC; 19 points) 's comment on Inner alignment in the brain by (
- 28 Jan 2023 0:40 UTC; 18 points) 's comment on Selection Theorems: A Program For Understanding Agents by (
- 19 Jun 2018 5:26 UTC; 18 points) 's comment on Worrying about the Vase: Whitelisting by (
- A Concrete Proposal for Adversarial IDA by 26 Mar 2019 19:50 UTC; 16 points) (
- 1 Dec 2019 7:41 UTC; 16 points) 's comment on Useful Does Not Mean Secure by (
- Evaluating Existing Approaches to AGI Alignment by 27 Mar 2018 19:57 UTC; 12 points) (
- 12 Apr 2018 8:04 UTC; 11 points) 's comment on Reframing misaligned AGI’s: well-intentioned non-neurotypical assistants by (
- 1 Apr 2018 21:13 UTC; 9 points) 's comment on Metaphilosophical competence can’t be disentangled from alignment by (
- 1 Dec 2019 18:08 UTC; 8 points) 's comment on Useful Does Not Mean Secure by (
- 5 Feb 2023 20:48 UTC; 8 points) 's comment on DragonGod’s Shortform by (
- 28 Mar 2018 20:27 UTC; 7 points) 's comment on Announcement: AI alignment prize winners and next round by (
- 8 Apr 2018 20:33 UTC; 7 points) 's comment on Can corrigibility be learned safely? by (
- 15 May 2022 4:14 UTC; 5 points) 's comment on But What’s Your *New Alignment Insight,* out of a Future-Textbook Paragraph? by (
- 6 Aug 2020 3:09 UTC; 4 points) 's comment on Infinite Data/Compute Arguments in Alignment by (
- 9 Nov 2019 15:03 UTC; 2 points) 's comment on Vanessa Kosoy’s Shortform by (
- 1 Mar 2018 22:03 UTC; 1 point) 's comment on Ambiguity Detection by (
Robustness to scale is still one of my primary explanations for why MIRI-style alignment research is useful, and why alignment work in general should be front-loaded. I am less sure about this specific post as an introduction to the concept (since I had it before the post, and don’t know if anyone got it from this post), but think that the distillation of concepts floating around meatspace to clear reference works is one of the important functions of LW.
(5 upvotes from a few AF users suggests this post probably should be nominated by an additional AF person, but unsure. I do apologize again for not having better nomination-endorsement-UI.
I think this post may have been relevant to my own thinking, but I’m particularly interested in how relevant the concept has been to other people who think professionally about alignment)
I think that the terms introduced by this post are great and I use them all the time
At the time I began writing this previous comment, I felt like I hadn’t directly gotten that much use of this post. But then after reflecting a bit about Beyond Astronomical Waste I realized this had actually been a fairly important concept in some of my other thinking.
I’ve used the concepts in this post a lot when discussing various things related to AI Alignment. I think asking “how robust is this AI design to various ways of scaling up?” has become one of my go-to hammers for evaluating a lot of AI Alignment proposals, and I’ve gotten a lot of mileage out of that.