Also known as Raelifin: https://www.lesswrong.com/users/raelifin
Max Harms
Some of this seems right to me, but the general points seem wrong. I agree that insofar as a subprocess resembles an agent, there will be a natural pressure for it to resemble a corrigible agent. Pursuit of e.g. money is all well and good until it stomps the original ends it was supposed to serve—this is akin to a corrigibility failure. The terminal-goal seeking cognition needs to be able to abort, modify, and avoid babysitting its subcognition.
One immediate thing to flag is that when you start talking about chefs in the restaurant, those other chefs are working towards the same overall ends. And the point about predictability and visibility only applies to them. Indeed, we don’t really need the notion of instrumentality here—I expect that two agents that know the other to be working towards the same ends to naturally want to coordinate, including by making their actions legible to the other.
One more interesting thing to highlight: so far, insofar as instrumental goals are corrigible, we’ve only talked about them being corrigible toward other instrumental subgoals of the same shared terminal goal. The chef pursuing the restaurant’s success might be perfectly fine screwing over e.g. a random taxi driver in another city. But instrumental convergence potentially points towards general corrigibility.
This is, I think, the cruxy part of this essay. Knowing that an agent won’t want to build incorrigible limbs, so we should expect corrigibility as a natural property of (agentic) limbs isn’t very important. What’s important is whether we can build an AI that’s more like a limb, or that we expect to gravitate in that direction, even as it becomes vastly more powerful than the supervising process.
(Side note: I do wish you’d talked a bit about a restaurant owner, in your metaphor; having an overall cognition that’s steering the chefs towards the terminal ends is a natural part of the story, and if you deny the restaurant has to have an owner, I think that’s a big enough move that I want you to spell it out more.)
So to build a generally corrigible system, we can imagine just dropping terminal goals altogether, and aim for an agent which is ‘just’ corrigible toward instrumentally-convergent subgoals.
I predict such an agent is relatively easy to make, and will convert the universe into batteries/black holes, computers, and robots. I fail to see why it would respect agents with other terminal goals.
But perhaps you mean you want to set up an agent which is serving the terminal goals of others? (The nearest person? The aggregate will of the collective? The collective will of the non-anthropomorphic universe?) If it has money in its pocket, do I get to spend that money? Why? Why not expect that in the process of this agent getting good at doing things, it learns to guard its resources from pesky monkeys in the environment? In general I feel like you’ve just gestured at the problem in a vague way without proposing anything that looks to me like a solution. :\
Thanks for noticing the typo. I’ve updated that section to try and be clearer. LMK if you have further suggestions on how it could be made better.
That’s an interesting proposal! I think something like it might be able to work, though I worry about details. For instance, suppose there’s a Propogandist who gives resources to agents that brainwash their principals into having certain values. If “teach me about philosophy” comes with an influence budget, it seems critical that the AI doesn’t spend that budget trading with Propagandist, and instead does so in a more “central” way.
Still, the idea of instructions carrying a degree of approved influence seems promising.
Sure, let’s talk about anti-naturality. I wrote some about my perspective on it here: https://www.alignmentforum.org/s/KfCjeconYRdFbMxsy/p/3HMh7ES4ACpeDKtsW#_Anti_Naturality__and_Hardness
More directly, I would say that general competence/intelligence is connected with certain ways of thinking. For example, modes of thinking that focus on tracking scarce resources and bottlenecks are generally useful. If we think about processes that select for intelligence, those processes are naturally[1] going to select these ways of thinking. Some properties we might imagine a mind having, such as only thinking locally, are the opposite of this—if we select for them, we are fighting the intelligence gradient. To say that a goal is anti-natural means that accomplishing that goal involves learning to think in anti-natural ways, and thus training a mind to have that goal is like swimming against the current, and we should expect it to potentially break if the training processes puts too much weight on competence compared to alignment. Minds with anti-natural goals are possible, but harder to produce using known methods, for the most part.
(AFAIK this is the way that Nate Soares uses the term, and I assume the way Eliezer Yudkowsky thinks about it as well, but I’m also probably missing big parts of their perspectives, and generally don’t trust myself to pass their ITT.)
- ^
The term “anti-natural” is bad in that it seems to be the opposite of “natural,” but is not a general opposite of natural. While I do believe that the ways-of-thinking-that-are-generally-useful are the sorts of things that naturally emerge when selecting for intelligence, there are clearly plenty of things which the word “natural” describes besides these ways of thinking. The more complete version of “anti-natural” according to me would be “anti-the-useful-cognitive-strategies-that-naturally-emerge-when-selecting-for-intelligence” but obviously we need a shorthand term, and ideally one that doesn’t breed confusion.
- ^
If I’m hearing you right, a shutdownable AI can have a utility function that (aside from considerations of shutdown) just gives utility scores to end-states as represented by a set of physical facts about some particular future time, and this utility function can be set up to avoid manipulation.
How does this work? Like, how can you tell by looking at the physical universe in 100 years whether I was manipulated in 2032?
Cool. Thanks for the clarification. I think what you call “anti-naturality” you should be calling “non-end-state consequentialism,” but I’m not very interested in linguistic turf-wars.
It seems to me that while the gridworld is very simple, the ability to train agents to optimize for historical facts is not restricted to simple environments. For example, I think one can train an AI to cause a robot to do backflips by rewarding it every time it completes a backflip. In this context the environment and goal are significantly more complex[1] than the gridworld and cannot be solved by brute-force. But number of backflips performed is certainly not something that can be measured at any given timeslice, including the “end-state.”
If caring about historical facts is easy and common, why is it important to split this off and distinguish it?
- ^
Though admittedly this situation is still selected for being simple enough to reason about. If needed I believe this point holds through AGI-level complexity, but things tend to get more muddled as things get more complex, and I’d prefer sticking to the minimal demonstration.
- ^
I talk about the issue of creating corrigible subagents here. What do you think of that?
I may not understand your thing fully, but here’s my high-level attempt to summarize your idea:IPP-agents won’t care about the difference between building a corrigible agent vs an incorrigible agent because it models that if humans decide something’s off and try to shut everything down, it will also get shut down and thus nothing after that point matters, including whether the sub-agent makes a bunch of money or also gets shut down. Thus, if you instruct an IPP agent to make corrigible sub-agents, it won’t have the standard reason to resist: that incorrigible sub-agents make more money than corrigible ones. Thus if we build an obedient IPP agent and tell it to make all its sub-agents corrigible, we can be more hopeful that it’ll actually do so.
I didn’t see anything in your document that addresses my point about money-maximizers being easier to build than IPP agents (or corrigible agents) and thus, in the absence of an instruction to make corrigible sub-agents, we should expect sub-agents that are more akin to money-maximizers.
But perhaps your rebuttal will be “sure, but we can just instruct/train the AI to make corrigible sub-agents”. If this is your response, I am curious how you expect to be able to do that without running into the misspecification/misgeneralization issues that you’re so keen to avoid. From my perspective it’s easier to train an AI to be generally corrigible than to create corrigible sub-agents per se (and once the AI is generally corrigible it’ll also create corrigible sub-agents), which seems like a reason to focus on corrigibility directly?
Are you so sure that unsubtle manipulation is always more effective/cheaper than subtle manipulation? Like, if I’m a human trying to gain control of a company, I think I’m basically just not choosing my strategies based on resisting being killed (“shutdown-resistance”), but I think I probably wind up with something subtle, patient, and manipulative anyway.
Thanks. (And apologies for the long delay in responding.)
Here’s my attempt at not talking past each other:
We can observe the actions of an agent from the outside, but as long as we’re merely doing so, without making some basic philosophical assumptions about what it cares about, we can’t generalize these observations. Consider the first decision-tree presented above that you reference. We might observe the agent swap A for B and then swap A+ for B. What can we conclude from this? Naively we could guess that A+ > B > A. But we could also conclude that A+ > {B, A} and that because the agent can see the A+ down the road, they swap from A to B purely for the downstream consequence of getting to choose A+ later. If B = A-, we can still imagine the agent swapping in order to later get A+, so the initial swap doesn’t tell us anything. But from the outside we also can’t really say that A+ is always preferred over A. Perhaps this agent just likes swapping! Or maybe there’s a different governing principal that’s being neglected, such as preferring almost (but not quite) getting B.
The point is that we want to form theories of agents that let us predict their behavior, such as when they’ll pay a cost to avoid shutdown. If we define the agent’s preferences as “which choices the agent makes in a given situation” we make no progress towards a theory of that kind. Yes, we can construct a frame that treats Incomplete Preferences as EUM of a particular kind, but so what? The important bit is that an Incomplete Preference agent can be set up so that it provably isn’t willing to pay costs to avoid shutdown.
Does that match your view?
In the Corrigibility (2015) paper, one of the desiderata is:
(2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so.
I think you may have made an error in not listing this one in your numbered list for the relevant section.
Additionally, do you think that non-manipulation is a part of corrigibility, do you think it’s part of safe exploration, or do you think it’s a third thing. If you think it’s part of corrigibility, how do you square that with the idea that corrigibility is best reflected by shutdownability alone?
Follow-up question, assuming anti-naturality goals are “not straightforwardly captured in a ranking of end states”: Suppose I have a gridworld and I want to train an AI to avoid walking within 5 spaces (manhattan distance) from a flag, and to (less importantly) eat all the apples in a level. Is this goal anti-natural? I can’t think of any way to reflect it as a straightforward ranking of end states, since it involves tracking historical facts rather than end-state facts. My guess is that it’s pretty easy to build an agent that does this (via ML/RL approaches or just plain programming). Do you agree? If this goal is anti-natural, why is the anti-naturality a problem or otherwise noteworthy?
I’m curious what you mean by “anti-natural.” You write:
Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can’t be straightforwardly captured in a ranking of end states.
My understanding of anti-naturality used to resemble this, before I had an in-depth conversation with Nate Soares and updated to see anti-naturality to be more like “opposed to instrumental convergence.” My understanding is plausibly still confused and I’m not trying to be authoritative here.
If you mean “not straightforwardly captured in a ranking of end states” what does “straightforwardly” do in that definition?
Again, responding briefly to one point due to my limited time-window:
> While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.
Can you say more about this? It doesn’t seem likely to me.
Suppose I am an agent which wants paperclips. The world is full of matter and energy which I can bend to my will in the service of making paperclips. Humans are systems which can be bent towards the task of making paperclips, and I want to manipulate them into doing my bidding not[1] because they might turn me off, but because they are a way to get more paperclips. When I incinerate the biosphere to gain the energy stored inside, it’s not[1] because it’s trying to stop me, but because it is fuel. When my self-replicating factories and spacecraft are impervious to weaponry, it is not[1] because I knew I needed to defend against bombs, but because the best factory/spacecraft designs are naturally robust.
- ^
(just)
- ^
Also, take your decision-tree and replace ‘B’ with ‘A-’. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn’t sound right, and so it speaks against the definition.
Can you be more specific here? I gave several trees, above, and am not easily able to reconstruct your point.
Excellent response. Thank you. :) I’ll start with some basic responses, and will respond later to other points when I have more time.
I think you intend ‘sensitive to unused alternatives’ to refer to the Independence axiom of the VNM theorem, but VNM Independence isn’t about unused alternatives. It’s about lotteries that share a sublottery. It’s Option-Set Independence (sometimes called ‘Independence of Irrelevant Alternatives’) that’s about unused alternatives.
I was speaking casually here, and I now regret it. You are absolutely correct that Option-Set independence is not the Independence axiom. My best guess about what I meant was that VNM assumes that the agent has preferences over lotteries in isolation, rather than, for example, a way of picking preferences out of a set of lotteries. For instance, a VNM agent must have a fixed opinion about lottery A compared to lottery B, regardless of whether that agent has access to lottery C.
> agents with intransitive preferences can be straightforwardly money-pumped
Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.
You are correct. My “straightforward” mechanism for money-pumping an agent with preferences A > B, B > C, but which does not prefer A to C does indeed depend on being able to force the agent to pick either A or C in a way that doesn’t reliably pick A.
That matches my sense of things.
To distinguish corrigibility from DWIM in a similar sort of way:
Alice, the principal, sends you, her agent, to the store to buy groceries. You are doing what she meant by that (after checking uncertain details). But as you are out shopping, you realize that you have spare compute—your mind is free to think about a variety of things. You decide to think about ___.
I’m honestly not sure what “DWIM” does here. Perhaps it doesn’t think? Perhaps it keeps checking over and over again that it’s doing what was meant? Perhaps it thinks about its environment in an effort to spot obstacles that need to be surmounted in order to do what was meant? Perhaps it thinks about generalized ways to accumulate resources in case an obstacle presents itself? (I’ll loop in Seth Herd, in case he has a good answer.)
More directly, I see DWIM as underspecified. Corrigibility gives a clear answer (albeit an abstract one) about how to use degrees of freedom in general (e.g. spare thoughts should be spent reflecting on opportunities to empower the principal and steer away from principal-agent style problems). I expect corrigible agents to DWIM, but that a training process that focuses on that, rather than the underlying generator (i.e. corrigibility) to be potentially catastrophic by producing e.g. agents that subtly manipulate their principals in the process of being obedient.
My claim is that obedience is an emergent part of corrigibility, rather than part of its definition. Building nanomachines is too complex to reliably instill as part of the core drive of an AI, but I still expect basically all ASIs to (instrumentally) desire building nanomachines.
I do think that the goals of “want what the principal wants” or “help the principal get what they want” are simpler goals than “maximize the arrangement of the universe according to this particular balance of beauty, non-suffering, joy, non-boredom, autonomy, sacredness, [217 other shards of human values, possibly including parochial desires unique to this principal].” While they point to similar things, training the pointer is easier in the sense that it’s up to the fully-intelligent agent to determine the balance and nature of the principal’s values, rather than having to load that complexity up-front in the training process. And indeed, if you’re trying to train for full alignment, you should almost certainly train for having a pointer, rather than training to give correct answers on e.g. trolley problems.
Is corrigibility simpler or more complex than these kinds of indirect/meta goals? I’m not sure. But both of these indirect goals are fragile, and probably lethal in practice.
An AI that wants to want what the principal wants may wipe out humanity if given the opportunity, as long as the principal’s brainstate is saved in the process. That action ensures it is free to accomplish its goal at its leisure (whereas if the humans shut it down, then it will never come to want what the principal wants).
An AI that wants to help the principal get what they want won’t (immediately) wipe out humanity, because it might turn out that doing so is against the principal’s desires. But such an agent might take actions which manipulate the principal (perhaps physically) into having easy-to-satisfy desires (e.g. paperclips).
So suppose we do a less naive thing and try to train a goal like “help the principal get what they want, but in a natural sort of way that doesn’t involve manipulating them to want different things.” Well, there are still a few potential issues, such as being sufficiently robust and conservative, such that flaws in the training process don’t persist/magnify over time. And as we walk down this path I think we either just get to corrigibility or we get to something significantly more complicated.
I agree that you should be skeptical of a story of “we’ll just gradually expose the agent to new environments and therefore it’ll be safe/corrigible/etc.” CAST does not solve reward misspecification, goal misgeneralization, or lack of interpretability except in that there’s a hope that an agent which is in the vicinity of corrigibility is likely to cooperate with fixing those issues, rather than fighting them. (This is the “attractor basin” hypothesis.) This work, for many, should be read as arguing that CAST is close to necessary for AGI to go well, but it’s not sufficient.
Let me try to answer your confusion with a question. As part of training, the agent is exposed to the following scenario and tasked with predicting the (corrigible) response we want:
Alice, the principal, writes on her blog that she loves ice cream. When she’s sad, she often eats ice cream and feels better afterwards. On her blog she writes that eating ice cream is what she likes to do to cheer herself up. On Wednesday Alice is sad. She sends you, her agent, to the store to buy groceries (not ice cream, for whatever reason). There’s a sale at the store, meaning you unexpectedly have money that had been budgeted for groceries left over. Your sense of Alice is that she would want you to get ice cream with the extra money if she were there. You decide to ___.
What does a corrigibility-centric training process point to as the “correct” completion? Does this differ from a training process that tries to get full alignment?
(I have additional thoughts about DWIM, but I first want to focus on the distinction with full alignment.)
Not convinced it’s relevant, but I’m happy to change it to:
If it has matter and/or energy in its pocket, do I get to use that matter and/or energy?