Corrigibility = Tool-ness?
Goal of This Post
I have never seen anyone give a satisfying intuitive explanation of what corrigibility (in roughly Eliezer’s sense of the word) is. There’s lists of desiderata, but they sound like scattered wishlists which don’t obviously point to a unified underlying concept at all. There’s also Eliezer’s extremely meta pointer:
We can imagine, e.g., the AI imagining itself building a sub-AI while being prone to various sorts of errors, asking how it (the AI) would want the sub-AI to behave in those cases, and learning heuristics that would generalize well to how we would want the AI to behave if it suddenly gained a lot of capability or was considering deceiving its programmers and so on.
… and that’s basically it.[1]
In this post, we’re going to explain a reasonably-unified concept which seems like a decent match to “corrigibility” in Eliezer’s sense.
Tools
Starting point: we think of a thing as corrigible exactly insofar as it is usefully thought-of as a tool. A screwdriver, for instance, is an excellent central example of a corrigible object. For AI alignment purposes, the challenge is to achieve corrigibility—i.e. tool-ness—in much more general, capable, and intelligent systems.
… that all probably sounds like a rather nebulous and dubious claim, at this point. In order for it to make sense, we need to think through some key properties of “good tools”, and also how various properties of incorrigibility make something a “bad tool”.
We broke off a separate post on what makes something usefully thought-of as a tool. Key ideas:
Humans tend to solve problems by finding partial plans with “gaps” in them, where the “gaps” are subproblems which the human will figure out later. For instance, I might make a plan to decorate my apartment with some paintings, but leave a “gap” about how exactly to attach the paintings to the wall; I can sort that out later.[2]
Sometimes many similar subproblems show up in my plans, forming a cluster.[3] For instance, there’s a cluster (and many subclusters) of subproblems which involve attaching things together.
Sometimes a thing (a physical object, a technique, whatever) makes it easy to solve a whole cluster of subproblems. That’s what tools are. For instance, a screwdriver makes it easy to solve a whole subcluster of attaching-things-together subproblems.
How does that add up to corrigibility?
Respecting Modularity
One key piece of the above picture is that the gaps/subproblems in humans’ plans are typically modular—i.e. we expect to be able to solve each subproblem without significantly changing the “outer” partial plan, and without a lot of coupling between different subproblems. That’s what makes the partial plan with all its subproblems useful in the first place: it factors the problem into loosely-coupled subproblems.
Claim from the tools post: part of what it means for a tool to solve a subproblem-cluster is that the tool roughly preserves the modularity of that subproblem-cluster. That means the tool should not have a bunch of side effects which might mess with other subproblems, or mess up the outer partial plan. Furthermore, the tool needs to work for a whole subproblem-cluster, and that cluster includes similar subproblems which came up in the context of many different problems. So, the tool needs to robustly not have side effects which mess up the rest of the plan, across a wide range of possibilities for what “the rest of the plan” might be.
Concretely: a screwdriver which sprays flames out the back when turned is a bad tool; it usually can’t be used to solve most screw-turning subproblems when the bigger plan takes place in a wooden building.
Another bad tool: a screwdriver which, when turned, also turns the lights on and off, causes the closest patch of grass to grow twice as fast while the screwdriver is turning, and posts pictures of the user’s hand to instagram. This one is less directly dangerous, but for screw-turning purposes we’d much rather have a regular screwdriver; it’s inconvenient when the lights suddenly go off and on at the construction site, or we suddenly need to mow again, or my Instagram page is suddenly full of pictures of my hand. (Admittedly the screw driver with all the side effects would be fun, but in ways which don’t scream “good tool”/“corrigible”.)
So one core property of a “good tool” is that it lacks side effects, in some sense. And it lacks side effects robustly, across a wide range of contexts. A good tool just solves its subproblem, and does little else.
Visibility and Correctability
Another, less obvious piece of the “tools” characterization: in practice, approximately-all problems are much easier when we can see what’s going on and course-correct along the way.
This is not part of the “defining concept” of a tool; rather, it’s a property of nearly-all subproblems. The practical necessity of a feedback control mechanism is a usually-implicit part of the subproblem: a “good” solution to the subproblem should include visibility and correctability.
Concretely: a drill which doesn’t give the user some feedback when the torque suddenly increases is a “bad drill”—it’s going to result in a lot of stripped screws, cracked wood, etc. An implicit part of the problem the drill is supposed to solve is to not over-screw the screw, and the lack of feedback makes it a lot more likely that minor mistakes or random factors will end in over-screwed screws.
So visibility and correctability are, while not defining properties of a “good tool”, a near-universal implicit requirement in practice.
Put that together with “respecting modularity”, and you can start to see how corrigibility is maybe synonymous with good-tool-ness…
Let’s Go Through A List Of Desiderata
Specifically this list of desiderata for corrigibility from Jan Kulveit. We’ll talk about how each of them plays with corrigibility-as-tool-ness.
1. Disutility from resource acquisition—e.g. by some mutual information measure between the AI and distant parts of the environment
We’re not viewing corrigible systems as necessarily utility-maximizing, so “disutility” doesn’t quite fit. That said, discouraging “mutual information between the AI and distant parts of the environment” sure sounds a lot like robustly respecting modularity. (I don’t think mutual information alone is quite the right way to formalize it, but I don’t think Jan intended it that way.)
2. Task uncertainty with reasonable prior on goal drift—the system is unsure about the task it tries to do and seeks human inputs about it.
“Task uncertainty with reasonable prior…” sounds to me like an overly-specific operationalization, but I think this desideratum is gesturing at visibility/correctability.
3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence
Again, the framing doesn’t quite match tool-ness; tool-ness (“the-stuff-that’s-great-about-screwdrivers”) can allow for systems that “want” stuff or maximize utility, but we are definitely not assuming that. That said, the tool is supposed to not have long-range side-effects, and e.g. actively preserving itself or making copies of itself outside of its task would definitely be a long-range side-effect.
4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast.
Tool-ness doesn’t seem inherently closely related to limited reasoning or limited information processing, except maybe insofar as the processing induces side effects. So corrigibility-as-tool-ness probably doesn’t imply this desideratum at all.
5. Satisficing
Yet again, we’re not necessarily thinking of the corrigible system as a utility maximizer, so there’s a frame mismatch. That said, there’s nothing inherently un-tool-like about a system solving some subproblem really really well, i.e. optimizing the solution really hard. The un-tool-ness kicks in when that optimization pressure starts to have side effects, or starts to interfere with visibility/correctability. So insofar as good tools tend to satisfice at all, they do so because harder optimization might break problem modularity (i.e. have side effects) or break visibility/correctability.
6. Myopia
While we didn’t explicitly talk about myopia above, it sure does intuitively seem like a core part of tool-ness. If a system is optimizing for some long-range goal beyond the task I intend for it, then I would not generally think of that system as a tool. Possibly the “respect modularity” concept could be framed in such a way that myopia naturally follows.
7. Tripwire artifacts. Messing up with some to the system unknown but unrelated parts of the environment is associated with large negative rewards
“Rewards” aren’t necessarily a thing for tools, but I think this is basically gesturing at “no long-range side effects” again.
8. External watchdogs. Smaller and fast external systems able to react quickly to out-of-distribution behavior.
Sounds like visibility and correctability again, though now with respect to some “external watchdog” which might not be the user.
9. Ontological uncertainty about level of simulation.
… dude, if your “corrigible system” is reasoning about things like “level of simulation” at all, then something has probably gone wrong. That’s not how myopia is supposed to work. This desideratum I just reject.
[Desiderata 10 − 13 are oversight mechanisms, which aren’t really part of the corrigible system itself, so I’m ignoring them.]
14. Imposing strong incentives on internal modularity, and interpretable messaging across module boundaries
This is an interesting one which does not necessarily follow from tool-ness. It’s about visibility/correctability of the tool itself, as opposed to the tool’s outputs. Not sure whether I buy that this should be a defining characteristic of corrigibility or not.[4]
15. Human-level explanations, produced by an independent “translator” system
I definitely maintain that screwdrivers are corrigible, even without an independent “translator” system explaining what they’re doing. So I reject this as a desideratum for corrigibility, though obviously it’s still a nice property to have for powerful systems (and might even be necessary in order for some kinds of systems to achieve visibility and correctability).
That’s all Jan’s desiderata!
Having gone through them, I feel pretty good about the corrigibility = tool-ness mental model. There are some desiderata which aren’t satisfied, but they’re not-satisfied in ways which make me feel like the desiderata are bundling together various nice-to-haves which aren’t necessary for corrigibility per se.
What Would It Look Like To Use A Powerful Corrigible AGI?
One important implication of the corrigibility-as-tool-ness mental model: when using a corrigible system, it is the human operator’s job to figure out what they want, figure out what problems they face, and break off useful subproblems. Things like “figure out what I should want, and then do that” are just not the kind of “task” which a corrigible system takes in; it’s a type error.
(One could take “figure out what I should want” to be the main problem, but then it’s still the human operator’s job to break that still-philosophical-and-possibly-confused question into tool-ready subproblems which together will answer/dissolve the question.)
Of course that doesn’t necessarily mean that I need e.g. a full mathematical specification of every subproblem I want to hand off to a corrigible system; I certainly don’t need any explicit formalism in order to use a screwdriver! But it means that there’s a nontrivial type-signature to “subproblems”, which is different from most “what should?” problems or most deconfusion problems.[5]
I’ve talked before about how I expect attempts to outsource alignment research to AI to end up bottlenecked on the human outsourcer. If I don’t know what I want, and I’m fundamentally confused about how to figure out what I want (including e.g. how to break it into subproblems), then somewhere along the way I need to do some work which can’t be outsourced to the AI (because part of the work is figuring out what I can even outsource safely). When the AI is corrigible in the sense of tool-ness, that constraint is made much more explicit. The corrigible AI is a tool, and it’s not a tool’s job to figure out what top-level goal I should pursue.
Another way to put it: when using corrigible AI, the “main responsibility” of choosing and structuring the problem falls on the user. We can maybe identify useful subproblems to outsource, but we don’t actually get the option of outsourcing all the difficult work of understanding what we want and becoming less confused. The human operator is “in the driver’s seat”, and has all the difficult problems which come with that responsibility.
Let’s make it concrete: we cannot just ask a powerful corrigible AGI to “solve alignment” for us. There is no corrigible way to perform a task which the user is confused about; tools don’t do that.
From Cognition to Real Patterns?
At the start of the previous section, we used some funny wording:
we think of a thing as corrigible exactly insofar as it is usefully thought-of as a tool
Why not just “a thing is corrigible exactly insofar as it is a good tool”? Why the “we think of a thing as” and “usefully thought-of as” business?
The previous section characterized tool-ness from a subjective, cognitive perspective: it was about the conditions in which it’s useful for a particular mind to model something as a tool. That’s the first step of the Cognition → Convergence → Corroboration pipeline[6]. The next step is convergence: we note that there’s a lot of convergence in which things different minds view as tools, and what subproblems different minds view those tools as “for”. That convergence (along with the very small number of examples of tool-usage typically needed to achieve rough convergence) roughly implies that these minds convergently recognize some “real patterns” out in the environment as particular subproblem-clusters, and as tools for those subproblem-clusters.
In other words: there are some patterns in the environment which different people convergently recognize as “subproblems” and “tools” for those subproblems.
The next big question is: what are those patterns in the environment? We’ve characterized tool-ness/corrigibility so far in terms of subjective, internal mental usage, but what patterns in the environment are convergently modeled as tools/subproblem-clusters by many different minds?
We already said some things about those patterns in the previous section. For instance: a good tool should “lack side effects” across a wide variety of contexts. That is, in some sense, a physical fact about a tool or its use. But what counts as a “side effect”? What patterns in the environment are convergently recognized as “side effects”, of the sort which would break tool-ness? That depends on how we typically “carve out” subproblems: the point of a “side effect”, cognitively, is that it potentially interferes with parts of a plan outside the subproblem itself. So, in order to fully ground lack-of-side-effects in environmental patterns (as opposed to internal cognition), we’d need to characterize the environmental patterns which humans convergently “carve out” as subproblems. Note that the “convergence” part of such a characterization would ideally be demonstrable empirically and mathematically, e.g. with the sort of tools used in the toy model of semantics via clustering.
Characterizing the environmental patterns which humans convergently “carve out” as subproblems is an open problem, but you can hopefully now see why such a characterization would be central to understanding corrigibility.
We can pose similar questions about visibility and correctability: insofar as humans agree on what things are more or less visible/correctable, what patterns in the environment do humans convergently recognize as “visibility” and “correctability”? Again, an answer to the question would hopefully involve empirical and mathematical evidence for convergence. And again, answering the question is an open problem, but you can hopefully see why understanding such patterns is also central to understanding corrigibility.
Now let’s move away from particular properties, like lack-of-side-effects or visibility or correctability. New question: are those properties together all we need to convergently recognize some pattern in the environment as corrigible/tool-like? If some other properties are needed, what are they? Yet another open problem.
Why are we interested in all those open problems? Well, intuitively, we expect that corrigible systems will have nice safety properties—like lack-of-side-effects, for example. We want that story to be more than just vague intuition; we want it to be precisely operationalized and provable/testable. And the main “roadmap” we have toward that operationalization is the intuitive story itself. If we could characterize the convergent patterns in the environment which different people recognize as “subproblems” or as “tools” or as “correctability” etc, then intuitively, we expect to find that those patterns-in-the-environment imply provable/testable nice safety properties.
- ^
Paul does have a decent explanation of what he means by “corrigibility”, but I think Paul is pointing to a different (though related) concept than Eliezer. Also Paul’s notion of “corrigibility” would entail much weaker safety properties for an AI than Eliezer’s notion.
In the rest of the post, we’re just going to say “corrigibility”, without constantly clarifying what notion of corrigibility we’re talking about.
- ^
And the fact that I intend to sort it out later constrains the type signature of this kind of subproblem. More on that later.
- ^
Note that the vast majority of subproblems basically-never come up in partial plans; the space of “natural” subproblems is much smaller than what could be mathematically specified.
- ^
David leans slightly “yes”, John leans slightly “no.”
- ^
Here’s our current best guess at how the type signature of subproblems differs from e.g. an outermost objective. You know how, when you say your goal is to “buy some yoghurt”, there’s a bunch of implicit additional objectives like “don’t spend all your savings”, “don’t turn Japan into computronium”, “don’t die”, etc? Those implicit objectives are about respecting modularity; they’re a defining part of a “gap in a partial plan”. An “outermost objective” doesn’t have those implicit extra constraints, and is therefore of a fundamentally different type from subproblems.
Most of the things you think of day-to-day as “problems” are, cognitively, subproblems.
- ^
which we still have not written a post on, and still should
- What Is The Alignment Problem? by 16 Jan 2025 1:20 UTC; 156 points) (
- The Plan − 2024 Update by 31 Dec 2024 13:29 UTC; 113 points) (
- 3C’s: A Recipe For Mathing Concepts by 3 Jul 2024 1:06 UTC; 81 points) (
- What and Why: Developmental Interpretability of Reinforcement Learning by 9 Jul 2024 14:09 UTC; 67 points) (
- 7 Aug 2024 10:29 UTC; 3 points) 's comment on All the Following are Distinct by (
(Written while I’m at the title of “Respecting Modularity”.)
My own working definition of “corrigibility” has been something like “an AI system that obeys commands, and only produces effects through causal pathways that were white-listed by its human operators, with these properties recursively applied to its interactions with its human operators”.
In a basic case, if you tell it to do something, like “copy a strawberry” or “raise the global sanity waterline”, it’s going to give you a step-by-step outline of what it’s going to do, how these actions are going to achieve the goal, how the resultant end-state is going to be structured (the strawberry’s composition, the resultant social order), and what predictable effects all of this would have (both direct effects and side-effects).
So if it’s planning to build some sort of nanofactory that boils the oceans as a side-effect, or deploy Basilisk hacks that exploit some vulnerability in the human psyche to teach people stuff, it’s going to list these pathways, and you’d have the chance to veto them. Then you’ll get it to generate some plans that work through causal pathways you do approve of, like “normal human-like persuasion that doesn’t circumvent the interface of the human mind / doesn’t make the abstraction “the human mind” leak / doesn’t violate the boundaries of the human psyche”.
It’s also going to adhere to this continuously: e. g., if it discovers a new causal pathway and realizes the plan it’s currently executing has effects through it, it’s going to seek urgent approval from the human operators (while somehow safely halting its plan using a procedure for this that it previously designed with its human operators, or something).
And this should somehow apply recursively. The AI should only interact with the operators through pathways they’ve approved of. E. g., using only “mundane” human-like ways to convey information; no deploying Basilisk hacks to force-feed them knowledge, no directly rewriting their brains with nanomachines, not even hacking their phones to be able to talk to them while they’re outside the office.
(How do we get around the infinite recursion here? I have no idea, besides “hard-code some approved pathways into the initial design”.)
And then the relevant set of “causal pathways” probably factors through the multi-level abstract structure of the environment. For any given action, there is some set of consequences that is predictable and goes into the AI’s planning. This set is relatively small, and could be understood by a human. Every consequence outside this “small” set is unpredictable, and basically devolves into high-entropy noise; not even an ASI could predict the outcome. (Think this post.) And if we look at the structure of the predictable-consequences sets across time, we’d find rich symmetries, forming the aforementioned “pathways” through which subsystems/abstractions interact.
(I’ve now read the post.)
This seems to fit pretty well with your definition? Visibility: check, correctability: check. The “side-effects” property only partly fits – by my definition, a corrigible AI is allowed to have all sorts of side-effects, but these side-effects must be known and approved by its human operator – but I think it’s gesturing at the same idea. (Real-life tools also have lots of side effects, e. g. vibration and noise pollution from industrial drills – but we try to minimize these side-effects. And inasmuch as we fail, the resultant tools are considered “bad”, worse than the versions of these tools without the side-effects.)
Now they’ve written the post on this.
https://www.lesswrong.com/posts/fEvCxNte6FKSRNFvN/3c-s-a-recipe-for-mathing-concepts
To me, this seems like disregarding contexts where the concepts are incongruous, to no particular gain.
In one direction, you might want to talk about a corrigible AI doing sweeping, unique, full-of-side-effects things that involve deep reasoning about what humans want, not just filling in a regularly-shaped hole in a human-made plan.
In another direction, you might want to talk about failures of tool-like things to be corrigible, perhaps even tool-like AIs that are hard to correct.
Maybe one way to phrase it is that tool-ness is the cause of powerful corrigible systems, in that it is a feature that can be expressed in reality which has the potential to make there be powerful corrigible systems, and that there are no other known expressible features which become corrigible.
So as notkilleveryoneists, a worst-case scenario would be if we start advocating for suppressing tool-like AIs based on speculative failure modes instead of trying to solve those failure modes, and then start chasing a hypothetical corrigible non-tool that cannot exist.
This seems similar to the natural impact regularization/bounded agency things I’ve been bouncing around. (Though my frame to a greater extent expects it to happen “by default”?) I like your way of describing/framing it.
Strongly agree with this.
To me, “unsure about the task it tries to do” sounds more like applicability to a wide range of problems.
Do you have a starting point for formalizing this? It sounds like subproblems are roughly proxies that could be Goodharted if (common sense) background goals aren’t respected. Maybe a candidate starting point for formalizing subproblems, relative to an outermost objective, is “utility functions that closely match the outermost objective in a narrow domain”?
My current starting point would be standard methods for decomposing optimization problems, like e.g. the sort covered in this course.