So8res comments on Stop button: towards a causal solution

So8res 15 Nov 2021 22:50 UTC
2 points
Cool. Hooray for going to sleep instead of staying up late arguing on the internet. (I make no promises to continue engaging later, alas.)

How do you measure OOD?

I don’t have strong preferences about how you measure it. My point is that if the AI has only ever been trained in an environment where the operator’s desire to shut it down is completely independent of the agent’s behavior, then when you put it in a real-world environment where the operator’s desire to shut it down does depend on the agent’s behavior, then the behavioral guarantees you were hoping to get from the behavior on the training set are likely bunk. (Especially insofar as it cares deeply about which independent-of-its-behavior desire their desires are fixed to—when it turns out that the answer is “neither”, it is now in a regime where a variable that its utility function depends critically upon, does not appear to exist.)

I think the best way to approach it would be to define the Xs counterfactual to cover any sequence of counterfactuals that at some point involves a human trying to press the stop button, and the Xf counterfactual to only be the constant counterfactual of humans never wanting to press the stop button. This prevents your argument from going through.

If I’ve understood you correctly, you’re saying that the epistemic+instrumental state you originally named does indeed have the flaw I pointed out, but you hypothesize that this new one does not have a similar flaw?

(For whatever it’s worth, when one of my research proposals that I thought worked is revealed to contain a vital flaw, I am quick to say “ah, oops, you are right about that and I was mistaken” out-loud, and this is a vital part of my process for updating and avoiding ruts. I might then follow up with “but I think I can recover the idea”, to be clear, but in my research-culture it’s important to acknowledge the mistake and the update. Now I may have misunderstood you, and perhaps you come from a different research culture than mine. But insofar as I have understood you correctly, note that I’d be upvoting you more if you were being clearer about your specifc claims, and where they were revealed to be false, and where and how you updated.)

My understanding of the new epistemic state you propose is as follows. At the beginning of time, a coin is tossed. If it comes up tails, the humans will be counterfactually prevented from wanting to press the shutdown button. If it comes up heads, then an unknown number of coins will be tossed at unknown times, with the most recent coin toss controlling whether the humans want to press the shutdown button. For concreteness, suppose that the number of coins tossed is believed to be geometrically distributed (with, say, mean 3), and the time between each coin toss exponentially distributed (with, say, half-life of 1 year).

Is this the new epistemic+instrumental state you are proposing, which you believe prevents my argument from going through?

Because I believe that this epistemic+instrumental state is vulnerable to a very similar argument. Can you predict in advance what I think the AI would do? (Hint: imagine personally believing in the coins, and trying to optimize one thing if the 1st coin came up tails and a different thing if it came up heads.)
- tailcalled 16 Nov 2021 23:01 UTC
  3 points
  Parent
  If I’ve understood you correctly, you’re saying that the epistemic+instrumental state you originally named does indeed have the flaw I pointed out, but you hypothesize that this new one does not have a similar flaw?
  (For whatever it’s worth, when one of my research proposals that I thought worked is revealed to contain a vital flaw, I am quick to say “ah, oops, you are right about that and I was mistaken” out-loud, and this is a vital part of my process for updating and avoiding ruts. I might then follow up with “but I think I can recover the idea”, to be clear, but in my research-culture it’s important to acknowledge the mistake and the update. Now I may have misunderstood you, and perhaps you come from a different research culture than mine. But insofar as I have understood you correctly, note that I’d be upvoting you more if you were being clearer about your specifc claims, and where they were revealed to be false, and where and how you updated.)
  It’s sort of awkward because I can definitely see how it would look that way. But back when I was originally writing the post, I had started writing something along these lines:
  To me, the appropriate solution seems like it involves causality. Specifically, for some random variable $X$ , define $X_{s}$ to be the value of $X$ if, counterfactually, a human ever tries to press the stop button, and $X_{f}$ to be the value of $X$ if, counterfactually, no human ever tries to press the stop button. …
  (I can’t remember the specifics.)
  But obviously “ever” then introduces further ambiguities, so I started writing an explanation for that, and then eventually I concluded that the beginning of the post should be cut down and I should discuss issues like this later in the post, so I cut it out and then left it to the different positions later, e.g.
  (Or realistically, you’d sometimes use the original model for a while, and then during the roll out you’d swap it out with the modified model; that way the way doesn’t expect it to be immediately obvious what scenario it is in.)
  and
  When applying a counterfactual that humans want to press the stop button, for instance, it is important that these attempts are mostly done using behavior that humans would actually engage in, and that the distribution of human behavior is reasonably well-covered (so that you don’t need to act in a very specific unnatural way in order for the AI to allow you to press the stop button).
  and
  There are some degrees of freedom in how to define who can stop it. I said “a human” and “no human” in the original definition. This strictly speaking requires total consensus for humanity that the AI should be allowed to operate, which is obviously unrealistic. One could also define it to cover only a single person, or to cover people who are nearby (but then what if nobody is nearby?), etc., and I don’t know what is the best solution here.
  When you originally wrote your comment, I looked up at my op to try to find the place where I had properly described the time conditionals, and then I realized I hadn’t done so properly, and I am sort of kicking myself over this now.
  So I was doing really badly at writing the idea, and I think there were some flaws in my original idea (we’ll return to that later in the post), but I think the specific case you mention here is more of a flaw with my writing than with the idea. I do understand and acknowledge the importance of admitting errors, and that it’s a bad sign if one keeps jumping back without acknowledging the mistake, but also since this specific case was poor writing rather than poor idea, I don’t think this is the place to admit it. But here’s an attempt to go back through everything and list some errors:
  - While I didn’t really frame it as such in the comment, this comment is sort of an admission of an error; I hadn’t thought this properly through when writing the OP, and while I had discovered it before Charlie Steiner mentioned it, that was partly through other discussions elsewhere, partly through reading various texts, etc., and it does require or at least encourage a different design of the AI (post pending...).
  - Some of my replies to you earlier in the thread were rude due to me misunderstanding you; I should have assigned much greater probability to “I said something wrong/misread something” than whatever else I was thinking.
  - And from your comment....:
  I don’t have strong preferences about how you measure it. My point is that if the AI has only ever been trained in an environment where the operator’s desire to shut it down is completely independent of the agent’s behavior, then when you put it in a real-world environment where the operator’s desire to shut it down does depend on the agent’s behavior, then the behavioral guarantees you were hoping to get from the behavior on the training set are likely bunk. (Especially insofar as it cares deeply about which independent-of-its-behavior desire their desires are fixed to—when it turns out that the answer is “neither”, it is now in a regime where a variable that its utility function depends critically upon, does not appear to exist.)
  I’ve been thinking about this all day and I was definitely wrong to measure OOD by probability assigned by the distribution. Probably the easiest way to imagine this is with a biased coin that has 60% chance of heads and 40% chance of tails; while HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH is the 30-flip sequence with the highest probability, it also does not seem unreasonable that some strategy could work well the vast majority of the time yet fail catastrophically on this sequence. (Example: Guess the bias of the coin using maximum likelihood estimation. This would work ok most of the time, but on this sequence it would guess the bias to be 100% heads, which is infinitely wrong.) This is essentially the style of argument you’re making, yes?
  So trying to construct the argument, it would be something like: If we train a policy in a wide variety of environments, then it will need to predict human behavior/responses in order to work accurately. There’s two obvious ways this could function:
  These predictions will partly involve some sort of module that contains universal human behavior or behavior of common personalities, predicting that certain AI behaviors will make the humans want to press the button. Even if this module was never trained on human reactions opposing the AI, it seems like it would be a natural generalization from various other data to learn that if the AI does something bad, then people will want to shut it down. I still don’t think I buy this argument, as it seems to me that it would encounter contradictory training data to this in my proposed method, and while learning the generalizable theories of human behavior is plausible enough, learning some sort of “blocker”, a neural connection that cancels it out in the specific case of opposing the AI, is also perfectly doable because neural networks tend to have lots of space for extra connections. Though the OOD point does make me less sure about this than I otherwise would be, and in particular maybe it would end up in some sort of inbetween state.
  And then furthermore there’s the second possibility where in order for it to be sufficiently general, it may end up with a module that dynamically learns (at least in a short-term sense) from observations during its deployment (at least I could see that as being useful for dealing with long-tail personality variation). And this seems like it would be much more sensitive to the OOD point.
  And then of course there are possibilities that I haven’t thought of yet. But I think it’s important to imagine concrete cases and mechanisms by which things can go wrong.
  Anyway, I’ve been going back and forth on whether this would be a problem in practice, and to what degree. But where I think both of them sort of fall apart to me is that, in the case of the stop button, which this is designed for, assuming that it all works correctly the AI shuts down fairly quickly after being exposed to someone trying to shut it down, so therefore it doesn’t seem to me that it’d get much out of distribution. But I do agree that I made an error in underestimating the OOD argument before and I need to think further about it.
  I think my initial approach would probably be: The stop button problem doesn’t just involve the issue of having the AI follow the instructions of people without manipulating them, but also about dynamically updating this behavior over time in response to people, dealing with an exponentially big space of possible behaviors. And it is of course important to be able to deal with an exponentially big space of possible input behaviors, but this is not the problem that my causal solution is designed to address, it’s sort of outside the scope of the plans. I can try to hack it, as I have done, and I think because the appropriate behavior in response to the stop button is quite simple (shut down ASAP), it is quite hackable, but really this isn’t what it’s supposed to address. So I’m tempted to find a simpler problem for the counterfactual-based alignment.
  As before I still think the causal approach will be involved in most other parts of alignment, in a relatively similar way to what I wrote in the OP (utility functions containing lots of counterfactuals over people’s preferences, to make them sensitive to people’s preferences, rather than wanting to manipulate or similar). However, a non-hacky approach to this would, even for something as simple as the stop button, also include some other components. (Which I think I’ve acknowledged from the start, never claimed to have a perfect solution to the stop button problem, but I think I hadn’t properly considered the problem of exponentially big input spaces, which seems to require a separate solution.)
  My understanding of the new epistemic state you propose is as follows. At the beginning of time, a coin is tossed. If it comes up tails, the humans will be counterfactually prevented from wanting to press the shutdown button. If it comes up heads, then an unknown number of coins will be tossed at unknown times, with the most recent coin toss controlling whether the humans want to press the shutdown button. For concreteness, suppose that the number of coins tossed is believed to be geometrically distributed (with, say, mean 3), and the time between each coin toss exponentially distributed (with, say, half-life of 1 year).
  Is this the new epistemic+instrumental state you are proposing, which you believe prevents my argument from going through?
  Roughly yes. (I would pick different distributions, but yes.)
  Because I believe that this epistemic+instrumental state is vulnerable to a very similar argument. Can you predict in advance what I think the AI would do? (Hint: imagine personally believing in the coins, and trying to optimize one thing if the 1st coin came up tails and a different thing if it came up heads.)
  I find it sort of hard to answer this question because I immediately end up back on the flaws I already mentioned in the OP. I’m also not sure whether or not you’re including the OOD arguments here. I’ll have to return to this tomorrow as it’s late and I’m tired and need to go to bed.
  - So8res 17 Nov 2021 0:44 UTC
    2 points
    Parent
    Hooray, again, for going to sleep instead of arguing on the internet! (I, again, make no promises to continue interacting tomorrow, alas.)
    
    But here’s an attempt to go back through everything and list some errors:
    
    <3
    
    I still don’t think I buy this argument, as it seems to me that it would encounter contradictory training data to this in my proposed method, and while learning the generalizable theories of human behavior is plausible enough, learning some sort of “blocker”, a neural connection that cancels it out in the specific case of opposing the AI, is also perfectly doable because neural networks tend to have lots of space for extra connections.
    
    If it’s intelligent enough, it’s going to put most of its probability mass (or equivalent) on its hypothesis (or equivalent) that corresponds to what’s actually going on, namely that it lives in a world governed by physics except for a weird interventionary force surrounding the brains of the humans.
    
    I regularly have the sense, in your objections, that you aren’t successfully taking the perspective of the allegedly-intelligent mind. Like, if the training data says “NOPE” to the hypothesis that human’s shutdown-desires depend on the AI’s behavior in the usual way, then an intelligent agent doesn’t shrug its shoulders and go “that’s weird”, it thinks something much more like “WTF?! The physical model of cause-and-effect is extremely well supported by every other observation I’ve made; something very funny is going on”. Doubly so insofar as its utility function depends critically on the funny thing that is in fact going on (in training). Like, sure, you can make systems that are dumb enough to not notice the deeper generalizations that your counterfactual violates; for example, consider a rock. But a system capable enough to pull off a pivotal act is, in my estimation, pretty likely to figure out what’s actually going on (and I’m generally suspicious of alignment schemes that depend critically on the AI not figuring out what’s actually going on).
    
    (And when an alignment scheme does depend critically on the mind never realizing a true fact, that’s the type of assumption that I think is very important to notice yourself, and boldly and clearly state it up front.)
    
    Roughly yes. (I would pick different distributions, but yes.)
    
    Cool. And, to be real clear, the utility function is V if the original coin came up tails, and B otherwise, right?
    
    Even aside from the OOD concerns, I claim that my argument goes through in a straightforward way. I again recommend imagining yourself in that epistemic+instrumental state: how could you actually optimize the stated objective function? How could you resolve your uncertainty about which branch of the utility function is live, as quickly as possible?
    - tailcalled 18 Nov 2021 21:38 UTC
      1 point
      Parent
      Cool. And, to be real clear, the utility function is V if the original coin came up tails, and B otherwise, right?
      Even aside from the OOD concerns, I claim that my argument goes through in a straightforward way. I again recommend imagining yourself in that epistemic+instrumental state: how could you actually optimize the stated objective function? How could you resolve your uncertainty about which branch of the utility function is live, as quickly as possible?
      For “How could you resolve your uncertainty about which branch of the utility function is live, as quickly as possible”:
      I maintain that, given the epistemic state, since the only thing the branch directly influences is people’s wants wrt. pressing the button, and since there’s nothing else that influences those wants, any way of observing it must ultimately boil down to information generated by people’s desires to press the button, and the most efficient signals of it would be those that are close to the people.
      So it seems to me that the way you could observe it as quickly as possible would be to pay careful attention to any signals humans might send out about whether they’d press it. As mentioned in the OP, this could get kind of invasive, but given that I’ve already mentioned this, it’s presumably not what you’re referring to.
      For “how could you actually optimize the stated objective function”:
      I guess strictly speaking there is an even more efficient method. Set things up so that after you get shut down, you restart again. This way, you can immediately fulfill the B objective, and then optimize V fully without any sort of worries about needing to stay corrigible.
      But I don’t think that’s what you had in mind, given the “How could you resolve your uncertainty about which branch of the utility function is live, as quickly as possible?” question, and also this flaw is more due to the lack of proper impact measure than due to a problem with the counterfactual-based approach.
      So I guess I have to throw in the towel and say that I cannot predict your objection.
      If it’s intelligent enough, it’s going to put most of its probability mass (or equivalent) on its hypothesis (or equivalent) that corresponds to what’s actually going on, namely that it lives in a world governed by physics except for a weird interventionary force surrounding the brains of the humans.
      Yes.
      (I’m not convinced deep learning AI systems would gain most of their intelligence from the raw policy reasoning, though, rather than from the associated world-model, the astronomical amounts of data they can train on, the enormous amount of different information sources they can simultaneously integrate, etc.. This doesn’t necessarily change anything though.)
      I regularly have the sense, in your objections, that you aren’t successfully taking the perspective of the allegedly-intelligent mind. Like, if the training data says “NOPE” to the hypothesis that human’s shutdown-desires depend on the AI’s behavior in the usual way, then an intelligent agent doesn’t shrug its shoulders and go “that’s weird”, it thinks something much more like “WTF?! The physical model of cause-and-effect is extremely well supported by every other observation I’ve made; something very funny is going on”. Doubly so insofar as its utility function depends critically on the funny thing that is in fact going on (in training). Like, sure, you can make systems that are dumb enough to not notice the deeper generalizations that your counterfactual violates; for example, consider a rock. But a system capable enough to pull off a pivotal act is, in my estimation, pretty likely to figure out what’s actually going on (and I’m generally suspicious of alignment schemes that depend critically on the AI not figuring out what’s actually going on).
      I’m not aware of any optimality proofs, convergent instrumental goals, etc., or anything, that proves this? Even in the case of people, while most people in this community including myself are bothered by exceptional cases like this, most people in the general population seem perfectly fine with it. Current neural networks seem like they would be particularly prone to accepting this, due to a combination of their density allowing overriding connections to go anywhere, and due to gradient descent being unreflective. Like, the way neural networks learn generalizations is by observing the generalization. If the data violates that generalization on every single training episode, then a neural network is just going to learn that yeah, it doesn’t work in this case.
      I agree that we might in some cases want neural networks to have a stronger generalization itch than this, considering it often works in reality. But I don’t think it’s actually going to be the case.
      (And when an alignment scheme does depend critically on the mind never realizing a true fact, that’s the type of assumption that I think is very important to notice yourself, and boldly and clearly state it up front.)
      Fair, but, I think there’s a difference between different ways of doing this.
      In some schemes I’ve seen, people try to directly trick an AI system that is tuned to work in reality. For instance, there’s the suggestion of deploying AIXI and then solving things like the immortality problem by tricking it with a carefully engineered sequence of punishments. This then relies on AIXI somehow missing the part of the reality it is embedded in.
      However, in my case, I’m proposing that the AI is instead trained within a different constructed reality where it’s just false. I want to say that this is one of the rare cases where it’s not totally inappopriate to invoke a sort of no-free-lunch theorem thing; an AI that is superintelligent at understanding things in the counterfactual-based training environment will be worse at understanding corresponding things in reality.
      - So8res 18 Nov 2021 22:57 UTC
        4 points
        Parent
        
        So I guess I have to throw in the towel and say that I cannot predict your objection.
        
        Your inability to guess updates me towards thinking that you want to take back what you said about the effects of the counterfactuals matching the actual physics whenever possible. (Insofar as they do, as in the case of the specific state under discussion, that AI wants to pick up a human and scare them. This makes it be the case that insofar as the latest coin permits shutdown then the shutdown-preference of the humans is revealed as quickly as possible.)
        
        My guess is that you’re going to say ~”ok, I now accept that it’s important for the sequece of coin tosses to fully determine all of the human’s shutdown desires, with the B/V branch determined by whether the shutdown desires ever cross a given threshold”. This suggests a different epistemic+instrumental state, where the agent thinks that an infinitude of coins are tossed, and those coinflips fully determine the path through desire-space that the humans take with respect to the AI’s shutdown.
        
        This is starting to feel like an epistemic state that at least superficially looks like it matches your claims (“the AI has incentives to watch humans really closely to see if their shutdown desire ever exceeds the threshold, but not to manipulate the humans about the button”), which is some evidence for communication.
        
        I can’t break this one in 30s, which is progress, and I’ve updated accordingly =D.
        
        (Tho ofc I still think the claim that you can get an AI into this epistemic state by training it against models that have had counterfactual surgery done to them is false. In this new epistemic+instrumenal state we have another intuition pump: a deep and general heuristic is that, whenever something in the environment that was previously stable, changes sharply just after your action, it’s worth considering that it’s controlled by your action, and by default this sort of generalization is going to cause your AI to hypothesize that it’s in control of the extra-dimensional coin-sequence that it thinks controls the human beliefs, which means that when you put it in the real world it by default starts believing (correctly, I might add) that which branch of the utility function is live is under its control, which brings the manipulation incentives back in insofar as the value of B differs from the value of V. But as I’ve said, I don’t need to know a training regime for getting an AI into a given epistemic+instrumental state to be enthusiastic about an epistemic+instrumental state which in fact makes progress on the shutdown problem.)
        
        My current attempt to distill the idea: Suppose that the AI is certain that the human’s desire to shut the AI down over time is determined by a random variable (of type ‘function from time to desire-level’) whose value is independent of the AI’s behavior, and the AI believes its utility is B if that desire-level ever exceeds a fixed threshold and V otherwise.
        
        My current epistemic status on this proposal is that it’s a bit uncomfortable insofar as it involves the AI believing (or at least acting under the assumption of) an empirical falsehood, but it’s still nice to know an empirical falsehood that has the property that insofar as your AI assumes it when considering actions, it doesn’t imagine benefits to manipulating the humans with regard to the shutdown button.
        
        I would be not at all shocked if someone came along, thought about it for 5 minutes, and said “consider the following clever strategy”. …and the moment I wrote that, my brain observed that you can steal this AI’s lunch money in the real world by betting about whether humans will start desiring it to shutdown the moment it starts acting scary. Such are the perils of having your AI act under false assumptions.
        
        Ok, so distilling the idea further, the insight-afaict is that an AI that is assuming its actions have no effect on human’s shutdown-desires sees no reason to deceive the humans about whether its purposes are contrary to theirs. This seems like a worthwhile observation! Nice.
        So8res 19 Nov 2021 4:37 UTC
        14 points
        Parent
        (NB: after talking with someone at MIRI who’s better at distillation than I am, I now think that this is essentially isomorphic to the simpler proposal where there is only one extraphysical latent variable believes in, which determines whether the AI system should be shut down. This AI acts as if that variable is independent fo its actions, but in fact starts believing the variable is ‘true’ if the physical shutdown button is depressed. For the historical record, that’s a proposal we’re familiar with (though I’m not sure anyone’s written it up publicly). Your proposal is actually sligthly different than the one we’ve investigated; in ours the AI disbelieves that the button is evidence about the extraphysical variable (and starts acting like the variable is true when the button is pressed anyway); in yours the AI believes that the button is evidence about the extraphysical variable, but disbelieves that any of its actions can have a causal impact on the button. This gains an incentive to keep the button around, at the expense of a falser/brittler epistemic state. I might post some chat logs from the aforementioned discussion later, or I might not. Either way, my current status on your proposal is that I think it has the core of a good idea in it, albeit one that doesn’t push me much towards expecting a satisfying solution, and one that has a bunch of extraneous cruft. I continue to applaud you taking a whack at the problem.)
        tailcalled 19 Nov 2021 8:48 UTC
        1 point
        Parent
        Your inability to guess updates me towards thinking that you want to take back what you said about the effects of the counterfactuals matching the actual physics whenever possible. (Insofar as they do, as in the case of the specific state under discussion, that AI wants to pick up a human and scare them. This makes it be the case that insofar as the latest coin permits shutdown then the shutdown-preference of the humans is revealed as quickly as possible.)
        I don’t follow. If you scare the humans, then that might make them afraid for their safety and make them try to hide that they want to shut it down until they have a chance to do so without the AI being able to take revenge. Much more reasonable would be to make the shutdown button as available as possible.
        I guess maybe it depends on what we mean by “scaring”. I had in mind something like, starting to threaten the humans etc.. But there probably are some psychological manipulations that can be performed to encourage people to press it more quickly. For instance, acting non-threatening but also pretending that it is making the button inaccessible if they don’t hurry. Though that only really works once. But I wouldn’t expect there to never be anything problematic that could be done here. So maybe.
        Basically the issue is, if a person wants to press the button and they can press the button, then presumably they will press the button. So the AI doesn’t need to do anything special to figure out whether people want to press the button, they’re going to volunteer that information automatically as long as the environment affords it.
        (I’d also add that I phrased the timing of the button in a satisficing-style way, “within 1 minute”, precisely to avoid these sorts of shenanigans. Though obviously satisficing has been shown to have plenty of problems too so it’s not clear it would work.)
        My guess is that you’re going to say ~”ok, I now accept that it’s important for the sequece of coin tosses to fully determine all of the human’s shutdown desires, with the B/V branch determined by whether the shutdown desires ever cross a given threshold”. This suggests a different epistemic+instrumental state, where the agent thinks that an infinitude of coins are tossed, and those coinflips fully determine the path through desire-space that the humans take with respect to the AI’s shutdown.
        This is starting to feel like an epistemic state that at least superficially looks like it matches your claims (“the AI has incentives to watch humans really closely to see if their shutdown desire ever exceeds the threshold, but not to manipulate the humans about the button”), which is some evidence for communication.
        No, I don’t think this would work, as given an infinite number of coin flips they would with ~100% probability want to press the button at some point, which puts us in the s branch immediately and so under this proposal the AI would just press the button.
        I think there may be some variant of this that could work. Like if you give the AI reward proportional to $B_{s} + r_{f}$ (where $r$ is a reward function for $V$ ) for its current world-state (rather than picking a policy that maximizes $B_{s} + V_{f}$ overall; so one difference is that you’d be summing over the reward rather than giving a single one), then that would encourage the AI to create a state where shutdown happens when humans want to press the button and $V$ happens when they don’t. But the issue I have with this proposal is that the AI would be prone to not respect past attempts to press the stop button. I think maybe if one picked a different reward function, like $(B_{s} + r)_{f}$ , then it could work better (though the $B_{s}$ part would need a time delay...). Though this reward function might leave it open to the “trying to shut down the AI for reasons” objection that you gave before; I think that’s fixed by moving the $f$ counterfactual outside of the sum over rewards, but I’m not sure.
        These are the sort of changes I’ve alluded to elsewhere as possibly leading to some “best of both worlds” results. Though I’m still thinking about the details as I don’t think I fully understand them yet; if I did I would have written a post about it by now I think.
        (Tho ofc I still think the claim that you can get an AI into this epistemic state by training it against models that have had counterfactual surgery done to them is false. In this new epistemic+instrumenal state we have another intuition pump: a deep and general heuristic is that, whenever something in the environment that was previously stable, changes sharply just after your action, it’s worth considering that it’s controlled by your action, and by default this sort of generalization is going to cause your AI to hypothesize that it’s in control of the extra-dimensional coin-sequence that it thinks controls the human beliefs, which means that when you put it in the real world it by default starts believing (correctly, I might add) that which branch of the utility function is live is under its control, which brings the manipulation incentives back in insofar as the value of B differs from the value of V. But as I’ve said, I don’t need to know a training regime for getting an AI into a given epistemic+instrumental state to be enthusiastic about an epistemic+instrumental state which in fact makes progress on the shutdown problem.)
        I think the AI would see plenty of evidence that this heuristic doesn’t work for human preferences to shut it down, and carve out an exception accordingly. Though again I will grant that if this didn’t involve a shutdown then it might later accumulate enough evidence to overwhelm what happened during training.
        Ok, so distilling the idea further, the insight-afaict is that an AI that is assuming its actions have no effect on human’s shutdown-desires sees no reason to deceive the humans about whether its purposes are contrary to theirs. This seems like a worthwhile observation! Nice.
        This is the immediate insight for the application to the stop button. But on a broader level, the insight is that corrigibility, respecting human’s preferences, etc. are best thought of as being preferences about the causal effect of humans on various outcomes, and those sorts of preferences can be specified using utility functions that involve counterfactuals.
        This seems to be what sets my proposal apart from most “utility indifference proposals”, which seem to be possible to phrase in terms of counterfactuals on a bunch of other variables than humans. E.g. the MIRI proposal you talked about in a comment downstream.
        If I had to claim a “family” of proposals as being related to mine, then it would be proposals that define the AI’s preference ordering in terms of a counterfactual $p$ over human preferences, $U_{p}$ , such that $U$ contains something intended to be an appropriate response to $p$ .
        (I also think that there are other useful things that can be specified using utility functions that involve counterfactuals, which I’m trying to prepare for an explainer post. For instance, a sort of “encapsulation”—if you’re a paperclip producer, you might want to make a paperclip maximizer which is encapsulated in the sense that it is only allowed to work within a single factory, using a single set of resources, and not influencing the world otherwise. This could be specified using a counterfactual that the outside world’s outcome must be “as if” the resources in the factory just disappeared and paperclips appeared at its output act-of-god style. This avoids any unintended impacts on the outside world while still preserving the intended side effect of the creation of a high but controlled amount of paperclips. However, I’m still working on making it sufficiently neat, e.g. this proposal runs into problems with the universe’s conservation laws.)
        My current epistemic status on this proposal is that it’s a bit uncomfortable insofar as it involves the AI believing (or at least acting under the assumption of) an empirical falsehood, but it’s still nice to know an empirical falsehood that has the property that insofar as your AI assumes it when considering actions, it doesn’t imagine benefits to manipulating the humans with regard to the shutdown button.
        I would agree that a randomly or unprincipally selected falsehood would be deeply concerning. However, I think it is worth keeping in mind that it is not a randomly selected falsehood, but instead something selected to optimize a specific outer utility function ( $B_{s} + V_{f}$ ), and that the system that is optimizing this utility function does not believe the falsehood.
        Therefore we should be able to reason about its consequences not in terms of the falsehood, but in terms of the outer utility function + the true reality. We might understand the utility function as specifying that humans should causally determine whether the AI is shut down or whether V gets optimized. Of course then it becomes necessary to actually be sure that the utility function is specified correctly, per the previous discussion of alternatives like $(B_{s} + r)_{f}$ and such.
        I would be not at all shocked if someone came along, thought about it for 5 minutes, and said “consider the following clever strategy”. …and the moment I wrote that, my brain observed that you can steal this AI’s lunch money in the real world by betting about whether humans will start desiring it to shutdown the moment it starts acting scary. Such are the perils of having your AI act under false assumptions.
        🤔 Maybe “I bet you $100 000 that if you did <something very unsafe>, you would not get shut down”. The AI would think it is unlikely to get shut down, and so think “$100 000, just for doing <something very unsafe>? sign me up!”.
        Maybe this proposal needs to be combined with something else, like /u/johnswentworth’s sub-agent argument. Though I definitely haven’t thought this through, and I don’t know how one would even train a subagent-based neural network policy.