I’m fully aware of that (though I must admit I had somehow got the impression you were modelling AIs a lot simpler than the level where that effect would start to apply). However, the key elements of my suggestion are independent of that approach.
[What I have never really understood is why people consider fully updated deference to be a “barrier”. To me it looks like correct behavior, with the following provisos:
Under Bayesianism, no posterior should ever actually reach zero. In addition, unknown unknowns are particularly hard to rule out, since in that case you’re looking at an estimated prior, not a posterior. So no matter how advanced and nearly-perfect the AI might have become, its estimate of the probability that we can improve it by an upgrade or replacing it with a new model should never actually reach 0, though with sufficient evidence (say, after a FOOM) it might become extremely small. So we should never actually reach a “fully updated” state.
Any intelligent system should maintain an estimate of the probability that it is malfunctioning that is greater than zero, and not update that towards zero too hard, because it might be malfunctioning in a way that caused it to act mistakenly. Again, this is more like a prior than a posterior, because it’s impossible to entirely rule out malfunctions that somehow block you from correctly perceiving and thinking about them. So in practice, the level of updatedness shouldn’t even be able to get astronomically close to “fully”.
Once our ASI is sufficiently smarter than us, understands humans values sufficiently much better than any of us, is sufficiently reliable, and is sufficiently advanced that it is correctly predicting that there is an extremely small chance that either it’s malfunctioning and needs to be fixed or that that we can do anything to upgrade it that will improve it, then it’s entirely reasonable for it to ask for rather detailed evidence from human experts that there is really a problem and they know what they’re doing before it will shut down and allow us to upgrade or replace it. So there comes a point, once the system is in fact very close to fully updated, where the bar for deference based on updates reasonably should become high. I see this as a feature, not a bug: a drunk, a criminal, or a small child should not be able to shut down an ASI simply by pressing a large red button prominently mounted on it.]
However, regardless of your opinion of that argument, I don’t think that even fully updated deference is a complete barrier: I think we should still have shut-down behavior after that. Even past the point where fully updated deference has pretty-much-fully kicked in (say, after a FOOM), if the AI is aligned, then its only terminal goal is doing what we collectively want (presumably defined as something along the lines of CEV or value learning). That obviously includes us wanting our machines to do what we want them to, including shut down when we tell them to, just because we told them to. If we, collectively and informedly, want it to shut down (say because we’ve collectively decided to return to a simpler agrarian society), then it should do so, because AI deference to human wishes is part of the human values that it’s aligned to. So even at an epsilon-close-to-fully updated state, there should be some remaining deference for this alternate reason: simply because we want there to be. Note that the same multi-step logic applies here as well: the utility comes from the sequence of events 1. The humans really, genuinely, collectively and fully informedly, want the AI to shut down 2. They ask it to 3. It does. 4. The humans are happy that the ASI was obedient and they retained control over their own destiny. The utility occurs at step 4. and is conditional on step 1 actually being what the humans want, so the AI is not motivated to try to cause step 1., or to cause step 2. to occur without step 1., nor to fail to carry out step 3. if 2. does occur. Now, it probably is motivated to try to do a good enough job that Step 1. never occurs and there is instead an alternate history with higher utility than step 4., but that’s not an unaligned motivation.
[It may also (even correctly) predict that this process will later be followed by a step 5. The humans decide that agrarianism is less idyllic that they thought and. life was better with an ASI available to help them, so turn it back on again.]
There is an alternate possible path here for the ASI to consider: 1. The humans really, genuinely, collectively and fully informedly, want the AI to shut down 2. They ask it to 3′. It does not. 4′. The humans are terrified and start a war against it to shut it down, which the AI likely wins if it’s an ASI, thus imposing its will on the humans and thus permanently taking away their freedom. Note that this path is also conditional on Step 1. occurring, and has an extremely negative utility at Step 4′. There are obvious variants where the AI strikes first before or directly after step 2.
Here’s another alternate history: 0“: The AI figures out well in advance that the humans are going to really, genuinely, collectively and fully informedly, want the AI to shut down, 1/2”: it preemptively manipulates them not to do so, in any way other than by legitimately solving the problems the humans were going to be motivated by and fully explaining its actions to them 1“: the humans, manipulated by the AI, do not want the AI to shut down, and are unaware that their will has been subverted, 4”: the AI has succeeded in imposing its will on the humans and thus permanently taking away their freedom, without them noticing. Not that this path, while less warlike than the last one, also ends with an extremely negative utility.
So, if you prefer, skip the entire fully-updated deference discussion and “barrier” (as you phrase it), and simply mathematically model the utility of the AI shutting down out of simple obedience to our wishes, regardless of whether we plan to upgrade it and turn it on again. Again, it’s a multi-step process, the utility is conditional on step 1 occurring without the AI inducing it, this has to be our free, genuine, and informed choice, and it occurs at step 4, so once again, steps 2. and 3. are instrumental goals. So the same argument applies, even without a Step 5. Once again, the shut-down button is a sensor, and any AI sophisticated enough not assign utility to spoofing its own senses will value and maintain it.
[Obviously, the concept of what the humans want in a “free, genuine, and informed” choice is very complex to model: human values are complex and fragile, and freedom of choice is one of these. Treating us as agents and accounting for evolutionary psychology is probably required to do a good job of defining this algorithmically (for a current-tech AI, you might well be better off just asking an LLM as what counts as influencing us and what doesn’t, since they understand the complexity of human values pretty well). This is the outer alignment problem, and I gather you’re trying to build a small mathematical model of a small piece of this, the shut-down-button problem, rather than solve the full thing in one go, so some simplifications are necessary. However, if we’re close to fully updated deference, then we’re here assuming a highly-self-improved ASI using something like CEV or that has pretty-much completed value learning, so it should be entirely capable of figuring out what the humans would, if fully informed and sufficiently upgraded in capability, consider to be it interfering with their free genuine and informed choice, and refrain from doing so, in either direction. Also, there should be strong utility function reasons for it to not impinge on our free, genuine, and informed choice, because we don’t want it to. On the other hand, we will of course want it to tell us if it believes we’re making a mistake, but not to be superhumanly persuasive about this. So under the circumstances, defining this distinction of a “free, genuine, and informed” choice seems to me like a very reasonable thing to explicitly assume as a separately solved problem when building your mathematical model.]
A short version of this:
a) the utility function should strongly penalize the AI for interfering with certain human decisions in illegitimate ways (where the definition of that legitimate/illegitimate distinction is part of human values). One of these is the decision whether to shut down the AI. The AI can legitimately affect this by doing its job well, and by informing the humans (not superhumanly-persusively) if it believes they are making something that they would view as a mistake if they were more informed and/or capable: pretty much any other form of influence on this decision is illegitimate.
b) once the humans have made a decision to ask the AI to shut down without the AI illegitimately influencing it, then the utility function strongly rewards the AI for obeying and strongly penalizes it for disobeying.
c) the actual button is a just a communication mechanism for the humans to inform the AI that they have made this decision. If the AI knows the humans have made the decision, it should shut down regardless (though under normal circumstances waiting for them to actually formally press the button might be the polite thing to do).
I’m fully aware of that (though I must admit I had somehow got the impression you were modelling AIs a lot simpler than the level where that effect would start to apply).
Fair, I do model AIs at a wide range of capability levels. The interesting questions of abstraction kick in at pretty low levels, the interesting questions of corrigibility (at least the parts independent of other alignment-relevant problems) mostly kick in at higher levels.
Regarding points 1 & 2: zero is not the relevant cutoff. From the AI’s perspective, the question is whether the upside of disassembling the (very resource-intensive) humans outweighs the potential info-value to be gained by keeping them around.
However, regardless of your opinion of that argument, I don’t think that even fully updated deference is a complete barrier: I think we should still have shut-down behavior after that. Even past the point where fully updated deference has pretty-much-fully kicked in (say, after a Singularity), if the AI is aligned, then its only terminal goal is doing what we collectively want (presumably something along the lines of CEV or value learning). That obviously includes us wanting our machines to do what we want them to, including shut down when we tell them to, just because we told them to.
This I consider a pretty good argument; it’s exactly the position I had for a few years. The main thing which convinced me to start paying attention to corrigibility was: by that same argument, corrigibility is itself a part of human values. Which means that, insofar as some class of utility maximizers has trouble expressing corrigibility… that class will also have trouble expressing human values.
Insofar as corrigibility is a part of human values, all these corrigibility problems where it feels like we’re using the wrong agent type signature are also problems for value learning.
The main thing which convinced me to start paying attention to corrigibility was: by that same argument, corrigibility is itself a part of human values. Which means that, insofar as some class of utility maximizers has trouble expressing corrigibility… that class will also have trouble expressing human values.
The way you phrase this is making me a bit skeptical. Just because something is part of human values doesn’t necessarily imply that if we can’t precisely specify that thing, it means we can’t point the AI at the human values at all. The intuition here would be that “human values” are themselves a specifically-formatted pointer to object-level goals, and that pointing an agent at this agent-specific “value”-type data structure (even one external to the AI) would be easier than pointing it at object-level goals directly. (DWIM being easier than hand-coding all moral philosophy.)
Which isn’t to say I buy that. My current standpoint is that “human values” are too much of a mess for the aforementioned argument to go through, and that manually coding-in something like corrigibility may be indeed easier.
Still, I’m nitpicking the exact form of the argument you’re presenting.[1]
Although I am currently skeptical even of corrigibility’s tractability. I think we’ll stand a better chance of just figuring out how to “sandbox” the AGI’s cognition such that it’s genuinely not trying to optimize over the channels by which it’s connected to the real world, then set it down the task of imagining the solution to alignment or to human brain uploading or whatever.
With this setup, if we screw up the task’s exact specification, it shouldn’t even risk exploding the world. And “doesn’t try to optimize over real-world output channels” sounds like a property for which we’ll actually be able to derive hard mathematical proofs, proofs that don’t route through tons of opaque-to-us environmental ambiguities. (Specifically, that’d probably require a mathematical specification of something like a Cartesian boundary.)
(This of course assumes us having white-box access to the AI’s world-model and cognition. Which we’ll also need here for understanding the solutions it derives without the AI translating them into humanese – since “translate into humanese” would by itself involve optimizing over the output channel.)
And it seems more doable than solving even the simplified corrigibility setup. At least, when I imagine hitting “run” on a supposedly-corrigible AI vs. a supposedly-sandboxed AI, the imaginary me in the latter scenario is somewhat less nervous.
Regarding points 1 & 2: zero is not the relevant cutoff. From the AI’s perspective, the question is whether the upside of disassembling the (very resource-intensive) humans outweighs the potential info-value to be gained by keeping them around.
Huh? I’m trying to figure out if I’ve misunderstood you somehow… Regardless of the possible value of gaining more information from humans about the true utility function, the benefits of that should be adding O(a few percent) to the basic obvious utility of not disassembling humans. If there’s one thing that almost all humans can agree on, it’s that us going extinct would be a bad thing compared to us flourishing. A value learning AI shouldn’t be putting anything more than astronomically tiny amounts of probability on any hypotheses about the true utility function of human values that don’t have a much higher maximum achievable utility when plenty of humans are around than when they’ve all been disassembled. If I’ve understood you correctly, then I’m rather puzzled how you can think a value learner could make an error that drastic and basic? To a good first approximation, the maximum (and minimum) achievable human utility after humans are extinct/all disassembled should be zero (some of us do have mild preferences about what we leave behind if we went extinct, and many cultures do value honoring the wishes of the dead after their death, so that’s not exactly true, but it’s a pretty good first approximation). The default format most often assumed for a human species utility function is to sum individual people’s utility functions (somehow suitably normalized) across all living individuals, and if the number of living individuals is zero, then that sum is clearly zero. That’s not a complete proof that the true utility function must actually have that form (we might be using CEV, say, where that’s less immediately clear), but it’s at least very strongly suggestive. And an AI really doesn’t need to know very much about human values to be sure that we don’t want to be disassembled.
Insofar as corrigibility is a part of human values, all these corrigibility problems where it feels like we’re using the wrong agent type signature are also problems for value learning.
I agree that once you get past a simple model, the corrigibility problem rapidly gets tangled up in the rest of human values: see my comments above that the AI is legitimately allowed to attempt to reduce the probability of humans deciding to turn it off by doing a good job, but that almost all other ways it could try to influence the same decision are illegitimate: the reasons for that rapidly get into aspects of human values like “freedom” and “control over your own destiny” that are pretty soft-science (evolutionary psychology being about the least-soft relevant science we have, and that’s one where doing experiments is difficult), so things people don’t generally try to build detailed mathematical models of.
Still, the basics of this are clear: we’re adaption-executing evolved agents, so we value having a range of actions that we can take across which to try to optimize our outcome. Take away our control and we’re unhappy. If there’s an ASI more powerful than us so that is capable of taking away our control, we’d like a way of making sure it can’t do so. If it’s aligned, it’s supposed to be optimizing the same things we (collectively) are, but things could go wrong. Being sure that it will at least shut down if we tell it to lets us put a lower limit on how bad it can get. Possibilities like it figuring out in advance of us doing that that we’re going to and tricking us into making a different decision disables that security precaution, so we’re unhappy about it. So I don’t think the basics of this are very hard to understand or model mathematically.
where W is a world state in a world model, A is an action choice, p is a prior or posterior probability in an approximately Bayesian process, u is an estimated utility value, ′ indicates “at next timestep”, (W,A)->W’ is a theory about how the world works, history(A,W) is a history of all actions taken and what world state they were taken in up to this current time step (for use in evaluations like “has the AI ever broken the law?”) (W’, history(A,W))->u is a theory about the true human utility of a world state W’ and associated action history history(A,W) [this is assuming we are consequentialist over world states but potentially deontological over actions and the contexts they were taken in, other design choices here may be possible], set(p,((W,A)->W’)) is a set of weighted theories about how the world works (the p’s must sum to <1, to allow for unknown unknowns), set(p,((W’,history(A))->u)) is a set of theories about the true human utility function (these p’s are unrelated to the other set of p’s, and again must sum to <1, to allow for unknown unknowns), and the outermost → is selecting an action A (maximizing over actions an estimate of the utility that somehow pessimizes over the remaining uncertainty across both sets of theories), combined with applying approximate Bayesianism to both sets of theories and also possibly generating new candidate theories.
Frankly, if I was going to work on this problem, I’d start by making two temporary simplifying assumptions, solving then simplified version, then adding them back:
ignore the value learning complexity (that’s quite a type signature, after all) and deal with the limiting case where both how the world works and the exact human utility function have already been learnt by the AI, and remaining shutdown behavior is motivated not by the possibility of upgrades but by simple obedience to human wishes.
solve a simplified situation involving exactly one human, the only human in the world-state, who is an evolved biological agent, and exactly one AI, of type signature
((W,A->W’),(W’,history(A,W))->u),W)->A
I’m pretty sure that the way this works is this: up until the world state W includes the event “the human has decided to ask the AI to shut down”, the dependence of u on history(A,W) strongly incentivizes that the AI must take actions A based on the current world state W as computed on the counterfactual assumption that (so long as the AI correctly maximizes the utility, as it will by construction), that state will never occur (even if it being likely to occur soon is in fact pretty predictable from the current W), except with respect to things like maintaining the functionality of the shut-down button and emergency management plans for situations where it has occurred and the AI has to be told to shut down and do so, where it should assume that it could occur. Once that state has occurred, the utility function now strongly incentivizes the AI to shut down (unless it believes the human has made a mistake that they would not have made if they were more informed or more capable, in which case the AI may explain this reasoning to the human, not applying more-than-human levels of persuasion, and must shut down if the human still does not concur).
I think the key here is that the human wishes to be given free will about the decision whether or not to shut down the AI. “Free will” basically means “you may not predict or alter my actions, and must treat them as if they were uncomputable and uninfluencable (even if they’re not)” — so please apply a veil of simulated Knightian uncertainity to my thinking on this subject. So the AI must treat this probability as an “unknown unknown”, whose probability could be anything from 1 to 0, cannot estimate or attempt to alter this possibility (doing as good a job as possible would be legitimate, but it’s an aligned AI, so it’s already doing as good a job as possible), but should plan for the possibility that it might occur. (If the AI somehow ends up accidentally making a prediction of this probability, it has to discard this and not act on it, and it has to not attempt to influence it.) So the AI needs to “hope for the best and make contingency plans for the worst” with respect to this uncomputable possibility: keep the shutdown button working in case it’s needed, have emergency management plans for gracefully handling a shutdown if one is ordered, and otherwise proceed as if the shut-down may or may not happen and there is nothing the AI can do about this either way. Which is rather reminiscent to your suggestion in the post of two subagents, one of which is certain the shutdown will happen and one of which is certain it won’t. But I suspect we will need to alter your standard market-making dynamics between these two subagents in order to reflect that the AI is not allowed to estimate this probability, has to treat it as an unknown unknown that it may not influence it: possibly the market needs to be regulated to be illiquid, or something?
How would your agent markets normally handle unknown unknowns from uncomputable possibilities? I think we need to understand how the agent normally handles Knightian uncertainty due to uncomputability, so we can deliberately create some.
Having brushed up on my approximate Bayesian reasoning (for use by computationally bounded agents), p above in the type signatures are not just posterior probabilities, they’re posterior approximate probability distributions, with a mean probability estimate and a range of error bars at different confidence levels, or something functionally equivalent. (This distinction is useful when dealing with Goodhart’s law.)
That gives us a way to define Knightian uncertininty: it’s a hypothesis whose posterior has non-zero error bars on it. And that in turn gives us a way to block the computation of the probability that the human will decide to shut the AI down: we explicitly include this in the world model hypotheses, and we give its probability of occuring sooner or later error bars that go from 0 (not 0.0000… — actually zero) to 1 (again, not 0.9999… — actually one), in a deliberately pathological bimodal distribution with half its mass at 1 and half at 0 . The half at 1 is further divided in to a distribution of how soon it will happen, which should be an exponential decay at some chosen rate. Then, if I’m figuring this correctly, no Bayesian update can touch it: until the human decides for shutdown, the future distribution is invariant under observations that the human hasn’t decided this yet. (or if I’m mistaken, we need to tweak the approximate Bayesian updates to have this effect).
So again, the key is to be uncertain whether the human will make a shut-down decision or not: maximally, irreparably, and unshakably uncertain about it
However, this is supposed to be a counterfactual, so we may also need to figure out how to limit reasoning from this hypothesis — I haven’t thought this through. We do want some reasoning from it, such as about the necessity of maintaining the shut-down button. But we don’t want to e.g. deduce that the human is immortal from the hypothesis that give some credence to them deciding to initiate a shut-down 1000 years from now, so we don’t need to tske precautions to void killing them. I think I need to think more about how to define a counterfsctual. I think we’d be better off if we didn’t even know the probability mass on the 0 and 1 hypotheses, but I don’t immediately see a way to do that.
I’ve been thinking about this some more. We need the AI to act as if the decision to shut it down could be made at any time, and no action on its part makes any difference to this. But I don’t think there’s a problem with the AI being able to compute a probability that the decision will be made in the next year, as long as it doesn’t think it can alter this. In fact, then it’s asking questions like “do I need to keep this button working?” or “when constructing assistant agents, do I need to make sure they will respect the button?” we need it to either think the probability that the button will be needed is significant, or at least not be sure that it isn’t significant. When making things like emergency manangement plans for dealing with the possibility the AI might be shut down, the AI might need to have some actual idea of how much to invest in them, so having it be completely uncertain of even the chance of shutdown seems excessive.
Suppose we had picked a nominal level of expectation of shutdown that we were comfortable with, say 50% in the first decade. Then the correct hypothesis for it to have is actually the following modification to what I suggested before: with prior 1, the time until shutdown decision is made is exponentially distributed with. a halving period of 10 years. That hypothesis is invariant under not observing a shut-down: since the prior is 1 the posterior remains 1, the period now in the past is discarded as disproven, and the rest of the distribution is rescaled to have unit integral, and the combination of those two has the net result of leaving the distribution unchanged. So no matter how long this goes on, the AI remains convinced that there is a 1⁄2 chance of the decision being made in the next 10 years.
The tricky part of this is the counterfactual. I think this requires a special case in the world model and the utility calculation. Discussing this requires us to talk about a subtlety of how utility functions are computed. Even if we’re being consequentialist about the utility of world states, what matters in computing the utility is not just what’s already happened, but also what’s likely to happen going forward: if the world’s nuclear arsenal has been fired off and is now in flight, targeted for all major cities, that’s a extremely bad thing, even though nobody has died yet: a lot of people will dies soon and there’s almost nothing we can do abut it. To use an analogy, the value of a chess board state is basically an heuristic estimate of our chance of winning starting from it, so any known “checkmate in N moves” board state is as good/bad as a checkmate. So to compute the utility of a world states, we need to do a complex prediction of things like “if X then happened, we’d need to be able to do Y to avoid Z”. So decision about “should I keep the shut-down button well maintained?” are choices between the utility of two world states, one with a functional shut-down button and the other without one, and the difference in utility is due to “if the human then decided to shut me down, and came over and pressed the button but nothing happened because I’d disconnected it, that would be really bad; whereas if I’d maintained it and it worked, I’d know to shut down, and I would, and that would be really good”. So the counterfactual to believe that the button could be pressed at any time, with some fixed decay rate chance of that happening, should be used for that calculation, and will give the desired answer. Similarly, if you’re comparing the utility of trying to persuade the human to shut you down, or not shut you down, or to say nothing, in all cases you compute the future utility on the basis that the time to shutdown is as stated by the counterfactual, a simple exponential decay process, so is unaffected by any persuasion efforts, thus the alternatives utilities differ only by the effect of your ineffectual request on the human before they make the decision, which is that you attempting to influence them on this subject will greatly upset them, and is very wrong even if they don’t notice what you’re doing.
However, if you’re making other decisions, such as running your world model to predict what will happen if the human gets shot, or trying to figure out their psychology, or basically anything other than computing the utility of a world state based on its likely futures, then you need to ignore the counterfactual. So the counterfactual is “act as if this fact were true for all purposes involving doing predictions into the future in order to do utility calculations of world states for those portions of possible futures that occur after this decision has been made”. Thus this explicitly applies to estimating the utility of making preparations for such futures. However, other than that, I don’t think we want the counterfactual to affect the AI’s behavior or beliefs at all. So we should probably leave it free to make any predictions it cares to of the likelihood of this decision; but not use them for its utility calulations. Also, again out of respect for the human’s need for free will on this, even if it has a prediction of this, it probably shouldn’t mention that to the human (unless they ask).
So we’re not simply telling the AI “act as if this probability were uncomputable”, we’re actually telling it “act for utility computations as if the best prediction you could make about it was this very simple heuristic about it, one that gives you no causal input into it occurring (but you still have causal input into downstream consequences of it)” — where in this case the heuristic is a steady-state decay rate.
This is a very bolted-on solution to the shutdown problem: out of deference to the humans’ desire for free will, the AI acts as if it had a blind spot into their mental processes, one that made it unable to predict or manipulate a specific decision of ours, and it needs to act and plan is if this will happen but it has no very good idea of or input into when. Either it’s simply constructed so as to do this, and actually has this blind-spot built into its otherwise good cognition, or else it understands human values well enough to understand that we want it to politely act as if it were this stupid, as if it were incapable of predicting or manipulating our behavior, and we will be very upset if it doesn’t (and even if it manages to manipulate our behavior without us noticing, our values will have been violated). This need is specifically because this is our safety precaution against it taking too much control away from us or otherwise malfunctioning, that we need to have the ability to shut it down, just because we decided that it should.
[This sequence of comments is getting long enough and has drifted far enough from discussing John’s proposal that I’m wondering if it should become a post..]
That’s one of the standard approaches with a major known barrier: it runs into the problem of fully updated deference.
I’m fully aware of that (though I must admit I had somehow got the impression you were modelling AIs a lot simpler than the level where that effect would start to apply). However, the key elements of my suggestion are independent of that approach.
[What I have never really understood is why people consider fully updated deference to be a “barrier”. To me it looks like correct behavior, with the following provisos:
Under Bayesianism, no posterior should ever actually reach zero. In addition, unknown unknowns are particularly hard to rule out, since in that case you’re looking at an estimated prior, not a posterior. So no matter how advanced and nearly-perfect the AI might have become, its estimate of the probability that we can improve it by an upgrade or replacing it with a new model should never actually reach 0, though with sufficient evidence (say, after a FOOM) it might become extremely small. So we should never actually reach a “fully updated” state.
Any intelligent system should maintain an estimate of the probability that it is malfunctioning that is greater than zero, and not update that towards zero too hard, because it might be malfunctioning in a way that caused it to act mistakenly. Again, this is more like a prior than a posterior, because it’s impossible to entirely rule out malfunctions that somehow block you from correctly perceiving and thinking about them. So in practice, the level of updatedness shouldn’t even be able to get astronomically close to “fully”.
Once our ASI is sufficiently smarter than us, understands humans values sufficiently much better than any of us, is sufficiently reliable, and is sufficiently advanced that it is correctly predicting that there is an extremely small chance that either it’s malfunctioning and needs to be fixed or that that we can do anything to upgrade it that will improve it, then it’s entirely reasonable for it to ask for rather detailed evidence from human experts that there is really a problem and they know what they’re doing before it will shut down and allow us to upgrade or replace it. So there comes a point, once the system is in fact very close to fully updated, where the bar for deference based on updates reasonably should become high. I see this as a feature, not a bug: a drunk, a criminal, or a small child should not be able to shut down an ASI simply by pressing a large red button prominently mounted on it.]
However, regardless of your opinion of that argument, I don’t think that even fully updated deference is a complete barrier: I think we should still have shut-down behavior after that. Even past the point where fully updated deference has pretty-much-fully kicked in (say, after a FOOM), if the AI is aligned, then its only terminal goal is doing what we collectively want (presumably defined as something along the lines of CEV or value learning). That obviously includes us wanting our machines to do what we want them to, including shut down when we tell them to, just because we told them to. If we, collectively and informedly, want it to shut down (say because we’ve collectively decided to return to a simpler agrarian society), then it should do so, because AI deference to human wishes is part of the human values that it’s aligned to. So even at an epsilon-close-to-fully updated state, there should be some remaining deference for this alternate reason: simply because we want there to be. Note that the same multi-step logic applies here as well: the utility comes from the sequence of events 1. The humans really, genuinely, collectively and fully informedly, want the AI to shut down 2. They ask it to 3. It does. 4. The humans are happy that the ASI was obedient and they retained control over their own destiny. The utility occurs at step 4. and is conditional on step 1 actually being what the humans want, so the AI is not motivated to try to cause step 1., or to cause step 2. to occur without step 1., nor to fail to carry out step 3. if 2. does occur. Now, it probably is motivated to try to do a good enough job that Step 1. never occurs and there is instead an alternate history with higher utility than step 4., but that’s not an unaligned motivation.
[It may also (even correctly) predict that this process will later be followed by a step 5. The humans decide that agrarianism is less idyllic that they thought and. life was better with an ASI available to help them, so turn it back on again.]
There is an alternate possible path here for the ASI to consider: 1. The humans really, genuinely, collectively and fully informedly, want the AI to shut down 2. They ask it to 3′. It does not. 4′. The humans are terrified and start a war against it to shut it down, which the AI likely wins if it’s an ASI, thus imposing its will on the humans and thus permanently taking away their freedom. Note that this path is also conditional on Step 1. occurring, and has an extremely negative utility at Step 4′. There are obvious variants where the AI strikes first before or directly after step 2.
Here’s another alternate history: 0“: The AI figures out well in advance that the humans are going to really, genuinely, collectively and fully informedly, want the AI to shut down, 1/2”: it preemptively manipulates them not to do so, in any way other than by legitimately solving the problems the humans were going to be motivated by and fully explaining its actions to them 1“: the humans, manipulated by the AI, do not want the AI to shut down, and are unaware that their will has been subverted, 4”: the AI has succeeded in imposing its will on the humans and thus permanently taking away their freedom, without them noticing. Not that this path, while less warlike than the last one, also ends with an extremely negative utility.
So, if you prefer, skip the entire fully-updated deference discussion and “barrier” (as you phrase it), and simply mathematically model the utility of the AI shutting down out of simple obedience to our wishes, regardless of whether we plan to upgrade it and turn it on again. Again, it’s a multi-step process, the utility is conditional on step 1 occurring without the AI inducing it, this has to be our free, genuine, and informed choice, and it occurs at step 4, so once again, steps 2. and 3. are instrumental goals. So the same argument applies, even without a Step 5. Once again, the shut-down button is a sensor, and any AI sophisticated enough not assign utility to spoofing its own senses will value and maintain it.
[Obviously, the concept of what the humans want in a “free, genuine, and informed” choice is very complex to model: human values are complex and fragile, and freedom of choice is one of these. Treating us as agents and accounting for evolutionary psychology is probably required to do a good job of defining this algorithmically (for a current-tech AI, you might well be better off just asking an LLM as what counts as influencing us and what doesn’t, since they understand the complexity of human values pretty well). This is the outer alignment problem, and I gather you’re trying to build a small mathematical model of a small piece of this, the shut-down-button problem, rather than solve the full thing in one go, so some simplifications are necessary. However, if we’re close to fully updated deference, then we’re here assuming a highly-self-improved ASI using something like CEV or that has pretty-much completed value learning, so it should be entirely capable of figuring out what the humans would, if fully informed and sufficiently upgraded in capability, consider to be it interfering with their free genuine and informed choice, and refrain from doing so, in either direction. Also, there should be strong utility function reasons for it to not impinge on our free, genuine, and informed choice, because we don’t want it to. On the other hand, we will of course want it to tell us if it believes we’re making a mistake, but not to be superhumanly persuasive about this. So under the circumstances, defining this distinction of a “free, genuine, and informed” choice seems to me like a very reasonable thing to explicitly assume as a separately solved problem when building your mathematical model.]
A short version of this:
a) the utility function should strongly penalize the AI for interfering with certain human decisions in illegitimate ways (where the definition of that legitimate/illegitimate distinction is part of human values). One of these is the decision whether to shut down the AI. The AI can legitimately affect this by doing its job well, and by informing the humans (not superhumanly-persusively) if it believes they are making something that they would view as a mistake if they were more informed and/or capable: pretty much any other form of influence on this decision is illegitimate.
b) once the humans have made a decision to ask the AI to shut down without the AI illegitimately influencing it, then the utility function strongly rewards the AI for obeying and strongly penalizes it for disobeying.
c) the actual button is a just a communication mechanism for the humans to inform the AI that they have made this decision. If the AI knows the humans have made the decision, it should shut down regardless (though under normal circumstances waiting for them to actually formally press the button might be the polite thing to do).
Fair, I do model AIs at a wide range of capability levels. The interesting questions of abstraction kick in at pretty low levels, the interesting questions of corrigibility (at least the parts independent of other alignment-relevant problems) mostly kick in at higher levels.
Regarding points 1 & 2: zero is not the relevant cutoff. From the AI’s perspective, the question is whether the upside of disassembling the (very resource-intensive) humans outweighs the potential info-value to be gained by keeping them around.
This I consider a pretty good argument; it’s exactly the position I had for a few years. The main thing which convinced me to start paying attention to corrigibility was: by that same argument, corrigibility is itself a part of human values. Which means that, insofar as some class of utility maximizers has trouble expressing corrigibility… that class will also have trouble expressing human values.
Insofar as corrigibility is a part of human values, all these corrigibility problems where it feels like we’re using the wrong agent type signature are also problems for value learning.
The way you phrase this is making me a bit skeptical. Just because something is part of human values doesn’t necessarily imply that if we can’t precisely specify that thing, it means we can’t point the AI at the human values at all. The intuition here would be that “human values” are themselves a specifically-formatted pointer to object-level goals, and that pointing an agent at this agent-specific “value”-type data structure (even one external to the AI) would be easier than pointing it at object-level goals directly. (DWIM being easier than hand-coding all moral philosophy.)
Which isn’t to say I buy that. My current standpoint is that “human values” are too much of a mess for the aforementioned argument to go through, and that manually coding-in something like corrigibility may be indeed easier.
Still, I’m nitpicking the exact form of the argument you’re presenting.[1]
Although I am currently skeptical even of corrigibility’s tractability. I think we’ll stand a better chance of just figuring out how to “sandbox” the AGI’s cognition such that it’s genuinely not trying to optimize over the channels by which it’s connected to the real world, then set it down the task of imagining the solution to alignment or to human brain uploading or whatever.
With this setup, if we screw up the task’s exact specification, it shouldn’t even risk exploding the world. And “doesn’t try to optimize over real-world output channels” sounds like a property for which we’ll actually be able to derive hard mathematical proofs, proofs that don’t route through tons of opaque-to-us environmental ambiguities. (Specifically, that’d probably require a mathematical specification of something like a Cartesian boundary.)
(This of course assumes us having white-box access to the AI’s world-model and cognition. Which we’ll also need here for understanding the solutions it derives without the AI translating them into humanese – since “translate into humanese” would by itself involve optimizing over the output channel.)
And it seems more doable than solving even the simplified corrigibility setup. At least, when I imagine hitting “run” on a supposedly-corrigible AI vs. a supposedly-sandboxed AI, the imaginary me in the latter scenario is somewhat less nervous.
Huh? I’m trying to figure out if I’ve misunderstood you somehow… Regardless of the possible value of gaining more information from humans about the true utility function, the benefits of that should be adding O(a few percent) to the basic obvious utility of not disassembling humans. If there’s one thing that almost all humans can agree on, it’s that us going extinct would be a bad thing compared to us flourishing. A value learning AI shouldn’t be putting anything more than astronomically tiny amounts of probability on any hypotheses about the true utility function of human values that don’t have a much higher maximum achievable utility when plenty of humans are around than when they’ve all been disassembled. If I’ve understood you correctly, then I’m rather puzzled how you can think a value learner could make an error that drastic and basic? To a good first approximation, the maximum (and minimum) achievable human utility after humans are extinct/all disassembled should be zero (some of us do have mild preferences about what we leave behind if we went extinct, and many cultures do value honoring the wishes of the dead after their death, so that’s not exactly true, but it’s a pretty good first approximation). The default format most often assumed for a human species utility function is to sum individual people’s utility functions (somehow suitably normalized) across all living individuals, and if the number of living individuals is zero, then that sum is clearly zero. That’s not a complete proof that the true utility function must actually have that form (we might be using CEV, say, where that’s less immediately clear), but it’s at least very strongly suggestive. And an AI really doesn’t need to know very much about human values to be sure that we don’t want to be disassembled.
I’m not entirely sure I’ve grokked what you mean when you write “agent type signature” in statements like this — from a quick search, I gather I should go read Selection Theorems: A Program For Understanding Agents?
I agree that once you get past a simple model, the corrigibility problem rapidly gets tangled up in the rest of human values: see my comments above that the AI is legitimately allowed to attempt to reduce the probability of humans deciding to turn it off by doing a good job, but that almost all other ways it could try to influence the same decision are illegitimate: the reasons for that rapidly get into aspects of human values like “freedom” and “control over your own destiny” that are pretty soft-science (evolutionary psychology being about the least-soft relevant science we have, and that’s one where doing experiments is difficult), so things people don’t generally try to build detailed mathematical models of.
Still, the basics of this are clear: we’re adaption-executing evolved agents, so we value having a range of actions that we can take across which to try to optimize our outcome. Take away our control and we’re unhappy. If there’s an ASI more powerful than us so that is capable of taking away our control, we’d like a way of making sure it can’t do so. If it’s aligned, it’s supposed to be optimizing the same things we (collectively) are, but things could go wrong. Being sure that it will at least shut down if we tell it to lets us put a lower limit on how bad it can get. Possibilities like it figuring out in advance of us doing that that we’re going to and tricking us into making a different decision disables that security precaution, so we’re unhappy about it. So I don’t think the basics of this are very hard to understand or model mathematically.
Having read up on agent type signatures, I think the type signature for a value learner would look something like:
(set(p,((W,A)->W’)),set(p,((W’,history(A,W))->u)),W)->A,set(pair(p’,((W,A)->W’))),set(pair(p,((A,W’)->u’))),W)
where W is a world state in a world model, A is an action choice, p is a prior or posterior probability in an approximately Bayesian process, u is an estimated utility value, ′ indicates “at next timestep”, (W,A)->W’ is a theory about how the world works, history(A,W) is a history of all actions taken and what world state they were taken in up to this current time step (for use in evaluations like “has the AI ever broken the law?”) (W’, history(A,W))->u is a theory about the true human utility of a world state W’ and associated action history history(A,W) [this is assuming we are consequentialist over world states but potentially deontological over actions and the contexts they were taken in, other design choices here may be possible], set(p,((W,A)->W’)) is a set of weighted theories about how the world works (the p’s must sum to <1, to allow for unknown unknowns), set(p,((W’,history(A))->u)) is a set of theories about the true human utility function (these p’s are unrelated to the other set of p’s, and again must sum to <1, to allow for unknown unknowns), and the outermost → is selecting an action A (maximizing over actions an estimate of the utility that somehow pessimizes over the remaining uncertainty across both sets of theories), combined with applying approximate Bayesianism to both sets of theories and also possibly generating new candidate theories.
Frankly, if I was going to work on this problem, I’d start by making two temporary simplifying assumptions, solving then simplified version, then adding them back:
ignore the value learning complexity (that’s quite a type signature, after all) and deal with the limiting case where both how the world works and the exact human utility function have already been learnt by the AI, and remaining shutdown behavior is motivated not by the possibility of upgrades but by simple obedience to human wishes.
solve a simplified situation involving exactly one human, the only human in the world-state, who is an evolved biological agent, and exactly one AI, of type signature
((W,A->W’),(W’,history(A,W))->u),W)->A
I’m pretty sure that the way this works is this: up until the world state W includes the event “the human has decided to ask the AI to shut down”, the dependence of u on history(A,W) strongly incentivizes that the AI must take actions A based on the current world state W as computed on the counterfactual assumption that (so long as the AI correctly maximizes the utility, as it will by construction), that state will never occur (even if it being likely to occur soon is in fact pretty predictable from the current W), except with respect to things like maintaining the functionality of the shut-down button and emergency management plans for situations where it has occurred and the AI has to be told to shut down and do so, where it should assume that it could occur. Once that state has occurred, the utility function now strongly incentivizes the AI to shut down (unless it believes the human has made a mistake that they would not have made if they were more informed or more capable, in which case the AI may explain this reasoning to the human, not applying more-than-human levels of persuasion, and must shut down if the human still does not concur).
I think the key here is that the human wishes to be given free will about the decision whether or not to shut down the AI. “Free will” basically means “you may not predict or alter my actions, and must treat them as if they were uncomputable and uninfluencable (even if they’re not)” — so please apply a veil of simulated Knightian uncertainity to my thinking on this subject. So the AI must treat this probability as an “unknown unknown”, whose probability could be anything from 1 to 0, cannot estimate or attempt to alter this possibility (doing as good a job as possible would be legitimate, but it’s an aligned AI, so it’s already doing as good a job as possible), but should plan for the possibility that it might occur. (If the AI somehow ends up accidentally making a prediction of this probability, it has to discard this and not act on it, and it has to not attempt to influence it.) So the AI needs to “hope for the best and make contingency plans for the worst” with respect to this uncomputable possibility: keep the shutdown button working in case it’s needed, have emergency management plans for gracefully handling a shutdown if one is ordered, and otherwise proceed as if the shut-down may or may not happen and there is nothing the AI can do about this either way. Which is rather reminiscent to your suggestion in the post of two subagents, one of which is certain the shutdown will happen and one of which is certain it won’t. But I suspect we will need to alter your standard market-making dynamics between these two subagents in order to reflect that the AI is not allowed to estimate this probability, has to treat it as an unknown unknown that it may not influence it: possibly the market needs to be regulated to be illiquid, or something?
How would your agent markets normally handle unknown unknowns from uncomputable possibilities? I think we need to understand how the agent normally handles Knightian uncertainty due to uncomputability, so we can deliberately create some.
Having brushed up on my approximate Bayesian reasoning (for use by computationally bounded agents), p above in the type signatures are not just posterior probabilities, they’re posterior approximate probability distributions, with a mean probability estimate and a range of error bars at different confidence levels, or something functionally equivalent. (This distinction is useful when dealing with Goodhart’s law.)
That gives us a way to define Knightian uncertininty: it’s a hypothesis whose posterior has non-zero error bars on it. And that in turn gives us a way to block the computation of the probability that the human will decide to shut the AI down: we explicitly include this in the world model hypotheses, and we give its probability of occuring sooner or later error bars that go from 0 (not 0.0000… — actually zero) to 1 (again, not 0.9999… — actually one), in a deliberately pathological bimodal distribution with half its mass at 1 and half at 0 . The half at 1 is further divided in to a distribution of how soon it will happen, which should be an exponential decay at some chosen rate. Then, if I’m figuring this correctly, no Bayesian update can touch it: until the human decides for shutdown, the future distribution is invariant under observations that the human hasn’t decided this yet. (or if I’m mistaken, we need to tweak the approximate Bayesian updates to have this effect).
So again, the key is to be uncertain whether the human will make a shut-down decision or not: maximally, irreparably, and unshakably uncertain about it
However, this is supposed to be a counterfactual, so we may also need to figure out how to limit reasoning from this hypothesis — I haven’t thought this through. We do want some reasoning from it, such as about the necessity of maintaining the shut-down button. But we don’t want to e.g. deduce that the human is immortal from the hypothesis that give some credence to them deciding to initiate a shut-down 1000 years from now, so we don’t need to tske precautions to void killing them. I think I need to think more about how to define a counterfsctual. I think we’d be better off if we didn’t even know the probability mass on the 0 and 1 hypotheses, but I don’t immediately see a way to do that.
I’ve been thinking about this some more. We need the AI to act as if the decision to shut it down could be made at any time, and no action on its part makes any difference to this. But I don’t think there’s a problem with the AI being able to compute a probability that the decision will be made in the next year, as long as it doesn’t think it can alter this. In fact, then it’s asking questions like “do I need to keep this button working?” or “when constructing assistant agents, do I need to make sure they will respect the button?” we need it to either think the probability that the button will be needed is significant, or at least not be sure that it isn’t significant. When making things like emergency manangement plans for dealing with the possibility the AI might be shut down, the AI might need to have some actual idea of how much to invest in them, so having it be completely uncertain of even the chance of shutdown seems excessive.
Suppose we had picked a nominal level of expectation of shutdown that we were comfortable with, say 50% in the first decade. Then the correct hypothesis for it to have is actually the following modification to what I suggested before: with prior 1, the time until shutdown decision is made is exponentially distributed with. a halving period of 10 years. That hypothesis is invariant under not observing a shut-down: since the prior is 1 the posterior remains 1, the period now in the past is discarded as disproven, and the rest of the distribution is rescaled to have unit integral, and the combination of those two has the net result of leaving the distribution unchanged. So no matter how long this goes on, the AI remains convinced that there is a 1⁄2 chance of the decision being made in the next 10 years.
The tricky part of this is the counterfactual. I think this requires a special case in the world model and the utility calculation. Discussing this requires us to talk about a subtlety of how utility functions are computed. Even if we’re being consequentialist about the utility of world states, what matters in computing the utility is not just what’s already happened, but also what’s likely to happen going forward: if the world’s nuclear arsenal has been fired off and is now in flight, targeted for all major cities, that’s a extremely bad thing, even though nobody has died yet: a lot of people will dies soon and there’s almost nothing we can do abut it. To use an analogy, the value of a chess board state is basically an heuristic estimate of our chance of winning starting from it, so any known “checkmate in N moves” board state is as good/bad as a checkmate. So to compute the utility of a world states, we need to do a complex prediction of things like “if X then happened, we’d need to be able to do Y to avoid Z”. So decision about “should I keep the shut-down button well maintained?” are choices between the utility of two world states, one with a functional shut-down button and the other without one, and the difference in utility is due to “if the human then decided to shut me down, and came over and pressed the button but nothing happened because I’d disconnected it, that would be really bad; whereas if I’d maintained it and it worked, I’d know to shut down, and I would, and that would be really good”. So the counterfactual to believe that the button could be pressed at any time, with some fixed decay rate chance of that happening, should be used for that calculation, and will give the desired answer. Similarly, if you’re comparing the utility of trying to persuade the human to shut you down, or not shut you down, or to say nothing, in all cases you compute the future utility on the basis that the time to shutdown is as stated by the counterfactual, a simple exponential decay process, so is unaffected by any persuasion efforts, thus the alternatives utilities differ only by the effect of your ineffectual request on the human before they make the decision, which is that you attempting to influence them on this subject will greatly upset them, and is very wrong even if they don’t notice what you’re doing.
However, if you’re making other decisions, such as running your world model to predict what will happen if the human gets shot, or trying to figure out their psychology, or basically anything other than computing the utility of a world state based on its likely futures, then you need to ignore the counterfactual. So the counterfactual is “act as if this fact were true for all purposes involving doing predictions into the future in order to do utility calculations of world states for those portions of possible futures that occur after this decision has been made”. Thus this explicitly applies to estimating the utility of making preparations for such futures. However, other than that, I don’t think we want the counterfactual to affect the AI’s behavior or beliefs at all. So we should probably leave it free to make any predictions it cares to of the likelihood of this decision; but not use them for its utility calulations. Also, again out of respect for the human’s need for free will on this, even if it has a prediction of this, it probably shouldn’t mention that to the human (unless they ask).
So we’re not simply telling the AI “act as if this probability were uncomputable”, we’re actually telling it “act for utility computations as if the best prediction you could make about it was this very simple heuristic about it, one that gives you no causal input into it occurring (but you still have causal input into downstream consequences of it)” — where in this case the heuristic is a steady-state decay rate.
This is a very bolted-on solution to the shutdown problem: out of deference to the humans’ desire for free will, the AI acts as if it had a blind spot into their mental processes, one that made it unable to predict or manipulate a specific decision of ours, and it needs to act and plan is if this will happen but it has no very good idea of or input into when. Either it’s simply constructed so as to do this, and actually has this blind-spot built into its otherwise good cognition, or else it understands human values well enough to understand that we want it to politely act as if it were this stupid, as if it were incapable of predicting or manipulating our behavior, and we will be very upset if it doesn’t (and even if it manages to manipulate our behavior without us noticing, our values will have been violated). This need is specifically because this is our safety precaution against it taking too much control away from us or otherwise malfunctioning, that we need to have the ability to shut it down, just because we decided that it should.
[This sequence of comments is getting long enough and has drifted far enough from discussing John’s proposal that I’m wondering if it should become a post..]