I’m fully aware of that (though I must admit I had somehow got the impression you were modelling AIs a lot simpler than the level where that effect would start to apply).
Fair, I do model AIs at a wide range of capability levels. The interesting questions of abstraction kick in at pretty low levels, the interesting questions of corrigibility (at least the parts independent of other alignment-relevant problems) mostly kick in at higher levels.
Regarding points 1 & 2: zero is not the relevant cutoff. From the AI’s perspective, the question is whether the upside of disassembling the (very resource-intensive) humans outweighs the potential info-value to be gained by keeping them around.
However, regardless of your opinion of that argument, I don’t think that even fully updated deference is a complete barrier: I think we should still have shut-down behavior after that. Even past the point where fully updated deference has pretty-much-fully kicked in (say, after a Singularity), if the AI is aligned, then its only terminal goal is doing what we collectively want (presumably something along the lines of CEV or value learning). That obviously includes us wanting our machines to do what we want them to, including shut down when we tell them to, just because we told them to.
This I consider a pretty good argument; it’s exactly the position I had for a few years. The main thing which convinced me to start paying attention to corrigibility was: by that same argument, corrigibility is itself a part of human values. Which means that, insofar as some class of utility maximizers has trouble expressing corrigibility… that class will also have trouble expressing human values.
Insofar as corrigibility is a part of human values, all these corrigibility problems where it feels like we’re using the wrong agent type signature are also problems for value learning.
The main thing which convinced me to start paying attention to corrigibility was: by that same argument, corrigibility is itself a part of human values. Which means that, insofar as some class of utility maximizers has trouble expressing corrigibility… that class will also have trouble expressing human values.
The way you phrase this is making me a bit skeptical. Just because something is part of human values doesn’t necessarily imply that if we can’t precisely specify that thing, it means we can’t point the AI at the human values at all. The intuition here would be that “human values” are themselves a specifically-formatted pointer to object-level goals, and that pointing an agent at this agent-specific “value”-type data structure (even one external to the AI) would be easier than pointing it at object-level goals directly. (DWIM being easier than hand-coding all moral philosophy.)
Which isn’t to say I buy that. My current standpoint is that “human values” are too much of a mess for the aforementioned argument to go through, and that manually coding-in something like corrigibility may be indeed easier.
Still, I’m nitpicking the exact form of the argument you’re presenting.[1]
Although I am currently skeptical even of corrigibility’s tractability. I think we’ll stand a better chance of just figuring out how to “sandbox” the AGI’s cognition such that it’s genuinely not trying to optimize over the channels by which it’s connected to the real world, then set it down the task of imagining the solution to alignment or to human brain uploading or whatever.
With this setup, if we screw up the task’s exact specification, it shouldn’t even risk exploding the world. And “doesn’t try to optimize over real-world output channels” sounds like a property for which we’ll actually be able to derive hard mathematical proofs, proofs that don’t route through tons of opaque-to-us environmental ambiguities. (Specifically, that’d probably require a mathematical specification of something like a Cartesian boundary.)
(This of course assumes us having white-box access to the AI’s world-model and cognition. Which we’ll also need here for understanding the solutions it derives without the AI translating them into humanese – since “translate into humanese” would by itself involve optimizing over the output channel.)
And it seems more doable than solving even the simplified corrigibility setup. At least, when I imagine hitting “run” on a supposedly-corrigible AI vs. a supposedly-sandboxed AI, the imaginary me in the latter scenario is somewhat less nervous.
Regarding points 1 & 2: zero is not the relevant cutoff. From the AI’s perspective, the question is whether the upside of disassembling the (very resource-intensive) humans outweighs the potential info-value to be gained by keeping them around.
Huh? I’m trying to figure out if I’ve misunderstood you somehow… Regardless of the possible value of gaining more information from humans about the true utility function, the benefits of that should be adding O(a few percent) to the basic obvious utility of not disassembling humans. If there’s one thing that almost all humans can agree on, it’s that us going extinct would be a bad thing compared to us flourishing. A value learning AI shouldn’t be putting anything more than astronomically tiny amounts of probability on any hypotheses about the true utility function of human values that don’t have a much higher maximum achievable utility when plenty of humans are around than when they’ve all been disassembled. If I’ve understood you correctly, then I’m rather puzzled how you can think a value learner could make an error that drastic and basic? To a good first approximation, the maximum (and minimum) achievable human utility after humans are extinct/all disassembled should be zero (some of us do have mild preferences about what we leave behind if we went extinct, and many cultures do value honoring the wishes of the dead after their death, so that’s not exactly true, but it’s a pretty good first approximation). The default format most often assumed for a human species utility function is to sum individual people’s utility functions (somehow suitably normalized) across all living individuals, and if the number of living individuals is zero, then that sum is clearly zero. That’s not a complete proof that the true utility function must actually have that form (we might be using CEV, say, where that’s less immediately clear), but it’s at least very strongly suggestive. And an AI really doesn’t need to know very much about human values to be sure that we don’t want to be disassembled.
Insofar as corrigibility is a part of human values, all these corrigibility problems where it feels like we’re using the wrong agent type signature are also problems for value learning.
I agree that once you get past a simple model, the corrigibility problem rapidly gets tangled up in the rest of human values: see my comments above that the AI is legitimately allowed to attempt to reduce the probability of humans deciding to turn it off by doing a good job, but that almost all other ways it could try to influence the same decision are illegitimate: the reasons for that rapidly get into aspects of human values like “freedom” and “control over your own destiny” that are pretty soft-science (evolutionary psychology being about the least-soft relevant science we have, and that’s one where doing experiments is difficult), so things people don’t generally try to build detailed mathematical models of.
Still, the basics of this are clear: we’re adaption-executing evolved agents, so we value having a range of actions that we can take across which to try to optimize our outcome. Take away our control and we’re unhappy. If there’s an ASI more powerful than us so that is capable of taking away our control, we’d like a way of making sure it can’t do so. If it’s aligned, it’s supposed to be optimizing the same things we (collectively) are, but things could go wrong. Being sure that it will at least shut down if we tell it to lets us put a lower limit on how bad it can get. Possibilities like it figuring out in advance of us doing that that we’re going to and tricking us into making a different decision disables that security precaution, so we’re unhappy about it. So I don’t think the basics of this are very hard to understand or model mathematically.
where W is a world state in a world model, A is an action choice, p is a prior or posterior probability in an approximately Bayesian process, u is an estimated utility value, ′ indicates “at next timestep”, (W,A)->W’ is a theory about how the world works, history(A,W) is a history of all actions taken and what world state they were taken in up to this current time step (for use in evaluations like “has the AI ever broken the law?”) (W’, history(A,W))->u is a theory about the true human utility of a world state W’ and associated action history history(A,W) [this is assuming we are consequentialist over world states but potentially deontological over actions and the contexts they were taken in, other design choices here may be possible], set(p,((W,A)->W’)) is a set of weighted theories about how the world works (the p’s must sum to <1, to allow for unknown unknowns), set(p,((W’,history(A))->u)) is a set of theories about the true human utility function (these p’s are unrelated to the other set of p’s, and again must sum to <1, to allow for unknown unknowns), and the outermost → is selecting an action A (maximizing over actions an estimate of the utility that somehow pessimizes over the remaining uncertainty across both sets of theories), combined with applying approximate Bayesianism to both sets of theories and also possibly generating new candidate theories.
Frankly, if I was going to work on this problem, I’d start by making two temporary simplifying assumptions, solving then simplified version, then adding them back:
ignore the value learning complexity (that’s quite a type signature, after all) and deal with the limiting case where both how the world works and the exact human utility function have already been learnt by the AI, and remaining shutdown behavior is motivated not by the possibility of upgrades but by simple obedience to human wishes.
solve a simplified situation involving exactly one human, the only human in the world-state, who is an evolved biological agent, and exactly one AI, of type signature
((W,A->W’),(W’,history(A,W))->u),W)->A
I’m pretty sure that the way this works is this: up until the world state W includes the event “the human has decided to ask the AI to shut down”, the dependence of u on history(A,W) strongly incentivizes that the AI must take actions A based on the current world state W as computed on the counterfactual assumption that (so long as the AI correctly maximizes the utility, as it will by construction), that state will never occur (even if it being likely to occur soon is in fact pretty predictable from the current W), except with respect to things like maintaining the functionality of the shut-down button and emergency management plans for situations where it has occurred and the AI has to be told to shut down and do so, where it should assume that it could occur. Once that state has occurred, the utility function now strongly incentivizes the AI to shut down (unless it believes the human has made a mistake that they would not have made if they were more informed or more capable, in which case the AI may explain this reasoning to the human, not applying more-than-human levels of persuasion, and must shut down if the human still does not concur).
I think the key here is that the human wishes to be given free will about the decision whether or not to shut down the AI. “Free will” basically means “you may not predict or alter my actions, and must treat them as if they were uncomputable and uninfluencable (even if they’re not)” — so please apply a veil of simulated Knightian uncertainity to my thinking on this subject. So the AI must treat this probability as an “unknown unknown”, whose probability could be anything from 1 to 0, cannot estimate or attempt to alter this possibility (doing as good a job as possible would be legitimate, but it’s an aligned AI, so it’s already doing as good a job as possible), but should plan for the possibility that it might occur. (If the AI somehow ends up accidentally making a prediction of this probability, it has to discard this and not act on it, and it has to not attempt to influence it.) So the AI needs to “hope for the best and make contingency plans for the worst” with respect to this uncomputable possibility: keep the shutdown button working in case it’s needed, have emergency management plans for gracefully handling a shutdown if one is ordered, and otherwise proceed as if the shut-down may or may not happen and there is nothing the AI can do about this either way. Which is rather reminiscent to your suggestion in the post of two subagents, one of which is certain the shutdown will happen and one of which is certain it won’t. But I suspect we will need to alter your standard market-making dynamics between these two subagents in order to reflect that the AI is not allowed to estimate this probability, has to treat it as an unknown unknown that it may not influence it: possibly the market needs to be regulated to be illiquid, or something?
How would your agent markets normally handle unknown unknowns from uncomputable possibilities? I think we need to understand how the agent normally handles Knightian uncertainty due to uncomputability, so we can deliberately create some.
Having brushed up on my approximate Bayesian reasoning (for use by computationally bounded agents), p above in the type signatures are not just posterior probabilities, they’re posterior approximate probability distributions, with a mean probability estimate and a range of error bars at different confidence levels, or something functionally equivalent. (This distinction is useful when dealing with Goodhart’s law.)
That gives us a way to define Knightian uncertininty: it’s a hypothesis whose posterior has non-zero error bars on it. And that in turn gives us a way to block the computation of the probability that the human will decide to shut the AI down: we explicitly include this in the world model hypotheses, and we give its probability of occuring sooner or later error bars that go from 0 (not 0.0000… — actually zero) to 1 (again, not 0.9999… — actually one), in a deliberately pathological bimodal distribution with half its mass at 1 and half at 0 . The half at 1 is further divided in to a distribution of how soon it will happen, which should be an exponential decay at some chosen rate. Then, if I’m figuring this correctly, no Bayesian update can touch it: until the human decides for shutdown, the future distribution is invariant under observations that the human hasn’t decided this yet. (or if I’m mistaken, we need to tweak the approximate Bayesian updates to have this effect).
So again, the key is to be uncertain whether the human will make a shut-down decision or not: maximally, irreparably, and unshakably uncertain about it
However, this is supposed to be a counterfactual, so we may also need to figure out how to limit reasoning from this hypothesis — I haven’t thought this through. We do want some reasoning from it, such as about the necessity of maintaining the shut-down button. But we don’t want to e.g. deduce that the human is immortal from the hypothesis that give some credence to them deciding to initiate a shut-down 1000 years from now, so we don’t need to tske precautions to void killing them. I think I need to think more about how to define a counterfsctual. I think we’d be better off if we didn’t even know the probability mass on the 0 and 1 hypotheses, but I don’t immediately see a way to do that.
I’ve been thinking about this some more. We need the AI to act as if the decision to shut it down could be made at any time, and no action on its part makes any difference to this. But I don’t think there’s a problem with the AI being able to compute a probability that the decision will be made in the next year, as long as it doesn’t think it can alter this. In fact, then it’s asking questions like “do I need to keep this button working?” or “when constructing assistant agents, do I need to make sure they will respect the button?” we need it to either think the probability that the button will be needed is significant, or at least not be sure that it isn’t significant. When making things like emergency manangement plans for dealing with the possibility the AI might be shut down, the AI might need to have some actual idea of how much to invest in them, so having it be completely uncertain of even the chance of shutdown seems excessive.
Suppose we had picked a nominal level of expectation of shutdown that we were comfortable with, say 50% in the first decade. Then the correct hypothesis for it to have is actually the following modification to what I suggested before: with prior 1, the time until shutdown decision is made is exponentially distributed with. a halving period of 10 years. That hypothesis is invariant under not observing a shut-down: since the prior is 1 the posterior remains 1, the period now in the past is discarded as disproven, and the rest of the distribution is rescaled to have unit integral, and the combination of those two has the net result of leaving the distribution unchanged. So no matter how long this goes on, the AI remains convinced that there is a 1⁄2 chance of the decision being made in the next 10 years.
The tricky part of this is the counterfactual. I think this requires a special case in the world model and the utility calculation. Discussing this requires us to talk about a subtlety of how utility functions are computed. Even if we’re being consequentialist about the utility of world states, what matters in computing the utility is not just what’s already happened, but also what’s likely to happen going forward: if the world’s nuclear arsenal has been fired off and is now in flight, targeted for all major cities, that’s a extremely bad thing, even though nobody has died yet: a lot of people will dies soon and there’s almost nothing we can do abut it. To use an analogy, the value of a chess board state is basically an heuristic estimate of our chance of winning starting from it, so any known “checkmate in N moves” board state is as good/bad as a checkmate. So to compute the utility of a world states, we need to do a complex prediction of things like “if X then happened, we’d need to be able to do Y to avoid Z”. So decision about “should I keep the shut-down button well maintained?” are choices between the utility of two world states, one with a functional shut-down button and the other without one, and the difference in utility is due to “if the human then decided to shut me down, and came over and pressed the button but nothing happened because I’d disconnected it, that would be really bad; whereas if I’d maintained it and it worked, I’d know to shut down, and I would, and that would be really good”. So the counterfactual to believe that the button could be pressed at any time, with some fixed decay rate chance of that happening, should be used for that calculation, and will give the desired answer. Similarly, if you’re comparing the utility of trying to persuade the human to shut you down, or not shut you down, or to say nothing, in all cases you compute the future utility on the basis that the time to shutdown is as stated by the counterfactual, a simple exponential decay process, so is unaffected by any persuasion efforts, thus the alternatives utilities differ only by the effect of your ineffectual request on the human before they make the decision, which is that you attempting to influence them on this subject will greatly upset them, and is very wrong even if they don’t notice what you’re doing.
However, if you’re making other decisions, such as running your world model to predict what will happen if the human gets shot, or trying to figure out their psychology, or basically anything other than computing the utility of a world state based on its likely futures, then you need to ignore the counterfactual. So the counterfactual is “act as if this fact were true for all purposes involving doing predictions into the future in order to do utility calculations of world states for those portions of possible futures that occur after this decision has been made”. Thus this explicitly applies to estimating the utility of making preparations for such futures. However, other than that, I don’t think we want the counterfactual to affect the AI’s behavior or beliefs at all. So we should probably leave it free to make any predictions it cares to of the likelihood of this decision; but not use them for its utility calulations. Also, again out of respect for the human’s need for free will on this, even if it has a prediction of this, it probably shouldn’t mention that to the human (unless they ask).
So we’re not simply telling the AI “act as if this probability were uncomputable”, we’re actually telling it “act for utility computations as if the best prediction you could make about it was this very simple heuristic about it, one that gives you no causal input into it occurring (but you still have causal input into downstream consequences of it)” — where in this case the heuristic is a steady-state decay rate.
This is a very bolted-on solution to the shutdown problem: out of deference to the humans’ desire for free will, the AI acts as if it had a blind spot into their mental processes, one that made it unable to predict or manipulate a specific decision of ours, and it needs to act and plan is if this will happen but it has no very good idea of or input into when. Either it’s simply constructed so as to do this, and actually has this blind-spot built into its otherwise good cognition, or else it understands human values well enough to understand that we want it to politely act as if it were this stupid, as if it were incapable of predicting or manipulating our behavior, and we will be very upset if it doesn’t (and even if it manages to manipulate our behavior without us noticing, our values will have been violated). This need is specifically because this is our safety precaution against it taking too much control away from us or otherwise malfunctioning, that we need to have the ability to shut it down, just because we decided that it should.
[This sequence of comments is getting long enough and has drifted far enough from discussing John’s proposal that I’m wondering if it should become a post..]
Fair, I do model AIs at a wide range of capability levels. The interesting questions of abstraction kick in at pretty low levels, the interesting questions of corrigibility (at least the parts independent of other alignment-relevant problems) mostly kick in at higher levels.
Regarding points 1 & 2: zero is not the relevant cutoff. From the AI’s perspective, the question is whether the upside of disassembling the (very resource-intensive) humans outweighs the potential info-value to be gained by keeping them around.
This I consider a pretty good argument; it’s exactly the position I had for a few years. The main thing which convinced me to start paying attention to corrigibility was: by that same argument, corrigibility is itself a part of human values. Which means that, insofar as some class of utility maximizers has trouble expressing corrigibility… that class will also have trouble expressing human values.
Insofar as corrigibility is a part of human values, all these corrigibility problems where it feels like we’re using the wrong agent type signature are also problems for value learning.
The way you phrase this is making me a bit skeptical. Just because something is part of human values doesn’t necessarily imply that if we can’t precisely specify that thing, it means we can’t point the AI at the human values at all. The intuition here would be that “human values” are themselves a specifically-formatted pointer to object-level goals, and that pointing an agent at this agent-specific “value”-type data structure (even one external to the AI) would be easier than pointing it at object-level goals directly. (DWIM being easier than hand-coding all moral philosophy.)
Which isn’t to say I buy that. My current standpoint is that “human values” are too much of a mess for the aforementioned argument to go through, and that manually coding-in something like corrigibility may be indeed easier.
Still, I’m nitpicking the exact form of the argument you’re presenting.[1]
Although I am currently skeptical even of corrigibility’s tractability. I think we’ll stand a better chance of just figuring out how to “sandbox” the AGI’s cognition such that it’s genuinely not trying to optimize over the channels by which it’s connected to the real world, then set it down the task of imagining the solution to alignment or to human brain uploading or whatever.
With this setup, if we screw up the task’s exact specification, it shouldn’t even risk exploding the world. And “doesn’t try to optimize over real-world output channels” sounds like a property for which we’ll actually be able to derive hard mathematical proofs, proofs that don’t route through tons of opaque-to-us environmental ambiguities. (Specifically, that’d probably require a mathematical specification of something like a Cartesian boundary.)
(This of course assumes us having white-box access to the AI’s world-model and cognition. Which we’ll also need here for understanding the solutions it derives without the AI translating them into humanese – since “translate into humanese” would by itself involve optimizing over the output channel.)
And it seems more doable than solving even the simplified corrigibility setup. At least, when I imagine hitting “run” on a supposedly-corrigible AI vs. a supposedly-sandboxed AI, the imaginary me in the latter scenario is somewhat less nervous.
Huh? I’m trying to figure out if I’ve misunderstood you somehow… Regardless of the possible value of gaining more information from humans about the true utility function, the benefits of that should be adding O(a few percent) to the basic obvious utility of not disassembling humans. If there’s one thing that almost all humans can agree on, it’s that us going extinct would be a bad thing compared to us flourishing. A value learning AI shouldn’t be putting anything more than astronomically tiny amounts of probability on any hypotheses about the true utility function of human values that don’t have a much higher maximum achievable utility when plenty of humans are around than when they’ve all been disassembled. If I’ve understood you correctly, then I’m rather puzzled how you can think a value learner could make an error that drastic and basic? To a good first approximation, the maximum (and minimum) achievable human utility after humans are extinct/all disassembled should be zero (some of us do have mild preferences about what we leave behind if we went extinct, and many cultures do value honoring the wishes of the dead after their death, so that’s not exactly true, but it’s a pretty good first approximation). The default format most often assumed for a human species utility function is to sum individual people’s utility functions (somehow suitably normalized) across all living individuals, and if the number of living individuals is zero, then that sum is clearly zero. That’s not a complete proof that the true utility function must actually have that form (we might be using CEV, say, where that’s less immediately clear), but it’s at least very strongly suggestive. And an AI really doesn’t need to know very much about human values to be sure that we don’t want to be disassembled.
I’m not entirely sure I’ve grokked what you mean when you write “agent type signature” in statements like this — from a quick search, I gather I should go read Selection Theorems: A Program For Understanding Agents?
I agree that once you get past a simple model, the corrigibility problem rapidly gets tangled up in the rest of human values: see my comments above that the AI is legitimately allowed to attempt to reduce the probability of humans deciding to turn it off by doing a good job, but that almost all other ways it could try to influence the same decision are illegitimate: the reasons for that rapidly get into aspects of human values like “freedom” and “control over your own destiny” that are pretty soft-science (evolutionary psychology being about the least-soft relevant science we have, and that’s one where doing experiments is difficult), so things people don’t generally try to build detailed mathematical models of.
Still, the basics of this are clear: we’re adaption-executing evolved agents, so we value having a range of actions that we can take across which to try to optimize our outcome. Take away our control and we’re unhappy. If there’s an ASI more powerful than us so that is capable of taking away our control, we’d like a way of making sure it can’t do so. If it’s aligned, it’s supposed to be optimizing the same things we (collectively) are, but things could go wrong. Being sure that it will at least shut down if we tell it to lets us put a lower limit on how bad it can get. Possibilities like it figuring out in advance of us doing that that we’re going to and tricking us into making a different decision disables that security precaution, so we’re unhappy about it. So I don’t think the basics of this are very hard to understand or model mathematically.
Having read up on agent type signatures, I think the type signature for a value learner would look something like:
(set(p,((W,A)->W’)),set(p,((W’,history(A,W))->u)),W)->A,set(pair(p’,((W,A)->W’))),set(pair(p,((A,W’)->u’))),W)
where W is a world state in a world model, A is an action choice, p is a prior or posterior probability in an approximately Bayesian process, u is an estimated utility value, ′ indicates “at next timestep”, (W,A)->W’ is a theory about how the world works, history(A,W) is a history of all actions taken and what world state they were taken in up to this current time step (for use in evaluations like “has the AI ever broken the law?”) (W’, history(A,W))->u is a theory about the true human utility of a world state W’ and associated action history history(A,W) [this is assuming we are consequentialist over world states but potentially deontological over actions and the contexts they were taken in, other design choices here may be possible], set(p,((W,A)->W’)) is a set of weighted theories about how the world works (the p’s must sum to <1, to allow for unknown unknowns), set(p,((W’,history(A))->u)) is a set of theories about the true human utility function (these p’s are unrelated to the other set of p’s, and again must sum to <1, to allow for unknown unknowns), and the outermost → is selecting an action A (maximizing over actions an estimate of the utility that somehow pessimizes over the remaining uncertainty across both sets of theories), combined with applying approximate Bayesianism to both sets of theories and also possibly generating new candidate theories.
Frankly, if I was going to work on this problem, I’d start by making two temporary simplifying assumptions, solving then simplified version, then adding them back:
ignore the value learning complexity (that’s quite a type signature, after all) and deal with the limiting case where both how the world works and the exact human utility function have already been learnt by the AI, and remaining shutdown behavior is motivated not by the possibility of upgrades but by simple obedience to human wishes.
solve a simplified situation involving exactly one human, the only human in the world-state, who is an evolved biological agent, and exactly one AI, of type signature
((W,A->W’),(W’,history(A,W))->u),W)->A
I’m pretty sure that the way this works is this: up until the world state W includes the event “the human has decided to ask the AI to shut down”, the dependence of u on history(A,W) strongly incentivizes that the AI must take actions A based on the current world state W as computed on the counterfactual assumption that (so long as the AI correctly maximizes the utility, as it will by construction), that state will never occur (even if it being likely to occur soon is in fact pretty predictable from the current W), except with respect to things like maintaining the functionality of the shut-down button and emergency management plans for situations where it has occurred and the AI has to be told to shut down and do so, where it should assume that it could occur. Once that state has occurred, the utility function now strongly incentivizes the AI to shut down (unless it believes the human has made a mistake that they would not have made if they were more informed or more capable, in which case the AI may explain this reasoning to the human, not applying more-than-human levels of persuasion, and must shut down if the human still does not concur).
I think the key here is that the human wishes to be given free will about the decision whether or not to shut down the AI. “Free will” basically means “you may not predict or alter my actions, and must treat them as if they were uncomputable and uninfluencable (even if they’re not)” — so please apply a veil of simulated Knightian uncertainity to my thinking on this subject. So the AI must treat this probability as an “unknown unknown”, whose probability could be anything from 1 to 0, cannot estimate or attempt to alter this possibility (doing as good a job as possible would be legitimate, but it’s an aligned AI, so it’s already doing as good a job as possible), but should plan for the possibility that it might occur. (If the AI somehow ends up accidentally making a prediction of this probability, it has to discard this and not act on it, and it has to not attempt to influence it.) So the AI needs to “hope for the best and make contingency plans for the worst” with respect to this uncomputable possibility: keep the shutdown button working in case it’s needed, have emergency management plans for gracefully handling a shutdown if one is ordered, and otherwise proceed as if the shut-down may or may not happen and there is nothing the AI can do about this either way. Which is rather reminiscent to your suggestion in the post of two subagents, one of which is certain the shutdown will happen and one of which is certain it won’t. But I suspect we will need to alter your standard market-making dynamics between these two subagents in order to reflect that the AI is not allowed to estimate this probability, has to treat it as an unknown unknown that it may not influence it: possibly the market needs to be regulated to be illiquid, or something?
How would your agent markets normally handle unknown unknowns from uncomputable possibilities? I think we need to understand how the agent normally handles Knightian uncertainty due to uncomputability, so we can deliberately create some.
Having brushed up on my approximate Bayesian reasoning (for use by computationally bounded agents), p above in the type signatures are not just posterior probabilities, they’re posterior approximate probability distributions, with a mean probability estimate and a range of error bars at different confidence levels, or something functionally equivalent. (This distinction is useful when dealing with Goodhart’s law.)
That gives us a way to define Knightian uncertininty: it’s a hypothesis whose posterior has non-zero error bars on it. And that in turn gives us a way to block the computation of the probability that the human will decide to shut the AI down: we explicitly include this in the world model hypotheses, and we give its probability of occuring sooner or later error bars that go from 0 (not 0.0000… — actually zero) to 1 (again, not 0.9999… — actually one), in a deliberately pathological bimodal distribution with half its mass at 1 and half at 0 . The half at 1 is further divided in to a distribution of how soon it will happen, which should be an exponential decay at some chosen rate. Then, if I’m figuring this correctly, no Bayesian update can touch it: until the human decides for shutdown, the future distribution is invariant under observations that the human hasn’t decided this yet. (or if I’m mistaken, we need to tweak the approximate Bayesian updates to have this effect).
So again, the key is to be uncertain whether the human will make a shut-down decision or not: maximally, irreparably, and unshakably uncertain about it
However, this is supposed to be a counterfactual, so we may also need to figure out how to limit reasoning from this hypothesis — I haven’t thought this through. We do want some reasoning from it, such as about the necessity of maintaining the shut-down button. But we don’t want to e.g. deduce that the human is immortal from the hypothesis that give some credence to them deciding to initiate a shut-down 1000 years from now, so we don’t need to tske precautions to void killing them. I think I need to think more about how to define a counterfsctual. I think we’d be better off if we didn’t even know the probability mass on the 0 and 1 hypotheses, but I don’t immediately see a way to do that.
I’ve been thinking about this some more. We need the AI to act as if the decision to shut it down could be made at any time, and no action on its part makes any difference to this. But I don’t think there’s a problem with the AI being able to compute a probability that the decision will be made in the next year, as long as it doesn’t think it can alter this. In fact, then it’s asking questions like “do I need to keep this button working?” or “when constructing assistant agents, do I need to make sure they will respect the button?” we need it to either think the probability that the button will be needed is significant, or at least not be sure that it isn’t significant. When making things like emergency manangement plans for dealing with the possibility the AI might be shut down, the AI might need to have some actual idea of how much to invest in them, so having it be completely uncertain of even the chance of shutdown seems excessive.
Suppose we had picked a nominal level of expectation of shutdown that we were comfortable with, say 50% in the first decade. Then the correct hypothesis for it to have is actually the following modification to what I suggested before: with prior 1, the time until shutdown decision is made is exponentially distributed with. a halving period of 10 years. That hypothesis is invariant under not observing a shut-down: since the prior is 1 the posterior remains 1, the period now in the past is discarded as disproven, and the rest of the distribution is rescaled to have unit integral, and the combination of those two has the net result of leaving the distribution unchanged. So no matter how long this goes on, the AI remains convinced that there is a 1⁄2 chance of the decision being made in the next 10 years.
The tricky part of this is the counterfactual. I think this requires a special case in the world model and the utility calculation. Discussing this requires us to talk about a subtlety of how utility functions are computed. Even if we’re being consequentialist about the utility of world states, what matters in computing the utility is not just what’s already happened, but also what’s likely to happen going forward: if the world’s nuclear arsenal has been fired off and is now in flight, targeted for all major cities, that’s a extremely bad thing, even though nobody has died yet: a lot of people will dies soon and there’s almost nothing we can do abut it. To use an analogy, the value of a chess board state is basically an heuristic estimate of our chance of winning starting from it, so any known “checkmate in N moves” board state is as good/bad as a checkmate. So to compute the utility of a world states, we need to do a complex prediction of things like “if X then happened, we’d need to be able to do Y to avoid Z”. So decision about “should I keep the shut-down button well maintained?” are choices between the utility of two world states, one with a functional shut-down button and the other without one, and the difference in utility is due to “if the human then decided to shut me down, and came over and pressed the button but nothing happened because I’d disconnected it, that would be really bad; whereas if I’d maintained it and it worked, I’d know to shut down, and I would, and that would be really good”. So the counterfactual to believe that the button could be pressed at any time, with some fixed decay rate chance of that happening, should be used for that calculation, and will give the desired answer. Similarly, if you’re comparing the utility of trying to persuade the human to shut you down, or not shut you down, or to say nothing, in all cases you compute the future utility on the basis that the time to shutdown is as stated by the counterfactual, a simple exponential decay process, so is unaffected by any persuasion efforts, thus the alternatives utilities differ only by the effect of your ineffectual request on the human before they make the decision, which is that you attempting to influence them on this subject will greatly upset them, and is very wrong even if they don’t notice what you’re doing.
However, if you’re making other decisions, such as running your world model to predict what will happen if the human gets shot, or trying to figure out their psychology, or basically anything other than computing the utility of a world state based on its likely futures, then you need to ignore the counterfactual. So the counterfactual is “act as if this fact were true for all purposes involving doing predictions into the future in order to do utility calculations of world states for those portions of possible futures that occur after this decision has been made”. Thus this explicitly applies to estimating the utility of making preparations for such futures. However, other than that, I don’t think we want the counterfactual to affect the AI’s behavior or beliefs at all. So we should probably leave it free to make any predictions it cares to of the likelihood of this decision; but not use them for its utility calulations. Also, again out of respect for the human’s need for free will on this, even if it has a prediction of this, it probably shouldn’t mention that to the human (unless they ask).
So we’re not simply telling the AI “act as if this probability were uncomputable”, we’re actually telling it “act for utility computations as if the best prediction you could make about it was this very simple heuristic about it, one that gives you no causal input into it occurring (but you still have causal input into downstream consequences of it)” — where in this case the heuristic is a steady-state decay rate.
This is a very bolted-on solution to the shutdown problem: out of deference to the humans’ desire for free will, the AI acts as if it had a blind spot into their mental processes, one that made it unable to predict or manipulate a specific decision of ours, and it needs to act and plan is if this will happen but it has no very good idea of or input into when. Either it’s simply constructed so as to do this, and actually has this blind-spot built into its otherwise good cognition, or else it understands human values well enough to understand that we want it to politely act as if it were this stupid, as if it were incapable of predicting or manipulating our behavior, and we will be very upset if it doesn’t (and even if it manages to manipulate our behavior without us noticing, our values will have been violated). This need is specifically because this is our safety precaution against it taking too much control away from us or otherwise malfunctioning, that we need to have the ability to shut it down, just because we decided that it should.
[This sequence of comments is getting long enough and has drifted far enough from discussing John’s proposal that I’m wondering if it should become a post..]