RogerDearnaley comments on A Shutdown Problem Proposal

RogerDearnaley Jan 23, 2024, 11:17 AM
3 points
0
Having brushed up on my approximate Bayesian reasoning (for use by computationally bounded agents), p above in the type signatures are not just posterior probabilities, they’re posterior approximate probability distributions, with a mean probability estimate and a range of error bars at different confidence levels, or something functionally equivalent. (This distinction is useful when dealing with Goodhart’s law.)
That gives us a way to define Knightian uncertininty: it’s a hypothesis whose posterior has non-zero error bars on it. And that in turn gives us a way to block the computation of the probability that the human will decide to shut the AI down: we explicitly include this in the world model hypotheses, and we give its probability of occuring sooner or later error bars that go from 0 (not 0.0000… — actually zero) to 1 (again, not 0.9999… — actually one), in a deliberately pathological bimodal distribution with half its mass at 1 and half at 0 . The half at 1 is further divided in to a distribution of how soon it will happen, which should be an exponential decay at some chosen rate. Then, if I’m figuring this correctly, no Bayesian update can touch it: until the human decides for shutdown, the future distribution is invariant under observations that the human hasn’t decided this yet. (or if I’m mistaken, we need to tweak the approximate Bayesian updates to have this effect).
So again, the key is to be uncertain whether the human will make a shut-down decision or not: maximally, irreparably, and unshakably uncertain about it
However, this is supposed to be a counterfactual, so we may also need to figure out how to limit reasoning from this hypothesis — I haven’t thought this through. We do want some reasoning from it, such as about the necessity of maintaining the shut-down button. But we don’t want to e.g. deduce that the human is immortal from the hypothesis that give some credence to them deciding to initiate a shut-down 1000 years from now, so we don’t need to tske precautions to void killing them. I think I need to think more about how to define a counterfsctual. I think we’d be better off if we didn’t even know the probability mass on the 0 and 1 hypotheses, but I don’t immediately see a way to do that.
- RogerDearnaley Jan 24, 2024, 6:19 AM
  3 points
  0
  Parent
  I’ve been thinking about this some more. We need the AI to act as if the decision to shut it down could be made at any time, and no action on its part makes any difference to this. But I don’t think there’s a problem with the AI being able to compute a probability that the decision will be made in the next year, as long as it doesn’t think it can alter this. In fact, then it’s asking questions like “do I need to keep this button working?” or “when constructing assistant agents, do I need to make sure they will respect the button?” we need it to either think the probability that the button will be needed is significant, or at least not be sure that it isn’t significant. When making things like emergency manangement plans for dealing with the possibility the AI might be shut down, the AI might need to have some actual idea of how much to invest in them, so having it be completely uncertain of even the chance of shutdown seems excessive.
  Suppose we had picked a nominal level of expectation of shutdown that we were comfortable with, say 50% in the first decade. Then the correct hypothesis for it to have is actually the following modification to what I suggested before: with prior 1, the time until shutdown decision is made is exponentially distributed with. a halving period of 10 years. That hypothesis is invariant under not observing a shut-down: since the prior is 1 the posterior remains 1, the period now in the past is discarded as disproven, and the rest of the distribution is rescaled to have unit integral, and the combination of those two has the net result of leaving the distribution unchanged. So no matter how long this goes on, the AI remains convinced that there is a ¹⁄₂ chance of the decision being made in the next 10 years.
  The tricky part of this is the counterfactual. I think this requires a special case in the world model and the utility calculation. Discussing this requires us to talk about a subtlety of how utility functions are computed. Even if we’re being consequentialist about the utility of world states, what matters in computing the utility is not just what’s already happened, but also what’s likely to happen going forward: if the world’s nuclear arsenal has been fired off and is now in flight, targeted for all major cities, that’s a extremely bad thing, even though nobody has died yet: a lot of people will dies soon and there’s almost nothing we can do abut it. To use an analogy, the value of a chess board state is basically an heuristic estimate of our chance of winning starting from it, so any known “checkmate in N moves” board state is as good/bad as a checkmate. So to compute the utility of a world states, we need to do a complex prediction of things like “if X then happened, we’d need to be able to do Y to avoid Z”. So decision about “should I keep the shut-down button well maintained?” are choices between the utility of two world states, one with a functional shut-down button and the other without one, and the difference in utility is due to “if the human then decided to shut me down, and came over and pressed the button but nothing happened because I’d disconnected it, that would be really bad; whereas if I’d maintained it and it worked, I’d know to shut down, and I would, and that would be really good”. So the counterfactual to believe that the button could be pressed at any time, with some fixed decay rate chance of that happening, should be used for that calculation, and will give the desired answer. Similarly, if you’re comparing the utility of trying to persuade the human to shut you down, or not shut you down, or to say nothing, in all cases you compute the future utility on the basis that the time to shutdown is as stated by the counterfactual, a simple exponential decay process, so is unaffected by any persuasion efforts, thus the alternatives utilities differ only by the effect of your ineffectual request on the human before they make the decision, which is that you attempting to influence them on this subject will greatly upset them, and is very wrong even if they don’t notice what you’re doing.
  However, if you’re making other decisions, such as running your world model to predict what will happen if the human gets shot, or trying to figure out their psychology, or basically anything other than computing the utility of a world state based on its likely futures, then you need to ignore the counterfactual. So the counterfactual is “act as if this fact were true for all purposes involving doing predictions into the future in order to do utility calculations of world states for those portions of possible futures that occur after this decision has been made”. Thus this explicitly applies to estimating the utility of making preparations for such futures. However, other than that, I don’t think we want the counterfactual to affect the AI’s behavior or beliefs at all. So we should probably leave it free to make any predictions it cares to of the likelihood of this decision; but not use them for its utility calulations. Also, again out of respect for the human’s need for free will on this, even if it has a prediction of this, it probably shouldn’t mention that to the human (unless they ask).
  So we’re not simply telling the AI “act as if this probability were uncomputable”, we’re actually telling it “act for utility computations as if the best prediction you could make about it was this very simple heuristic about it, one that gives you no causal input into it occurring (but you still have causal input into downstream consequences of it)” — where in this case the heuristic is a steady-state decay rate.
  This is a very bolted-on solution to the shutdown problem: out of deference to the humans’ desire for free will, the AI acts as if it had a blind spot into their mental processes, one that made it unable to predict or manipulate a specific decision of ours, and it needs to act and plan is if this will happen but it has no very good idea of or input into when. Either it’s simply constructed so as to do this, and actually has this blind-spot built into its otherwise good cognition, or else it understands human values well enough to understand that we want it to politely act as if it were this stupid, as if it were incapable of predicting or manipulating our behavior, and we will be very upset if it doesn’t (and even if it manages to manipulate our behavior without us noticing, our values will have been violated). This need is specifically because this is our safety precaution against it taking too much control away from us or otherwise malfunctioning, that we need to have the ability to shut it down, just because we decided that it should.
  [This sequence of comments is getting long enough and has drifted far enough from discussing John’s proposal that I’m wondering if it should become a post..]