Chantiel comments on Chantiel’s Shortform

Chantiel 31 Aug 2021 18:20 UTC
1 point
Oh, my mistake, I forgot to post the correction that made it not extortion.

Instead of threatening to destroy the AI’s world, imagine the aliens instead offer to help them. Suppose the AI’s can’t be their world a utopia on their own, for example because it’s nothing but a solid ball of ice. So then the aliens would make their world a utopia as long as they execute S. Then they would execute S.

I’m actually pretty skeptical of the idea that UDTAIs wouldn’t give into extortion, but this is a separate point that wasn’t necessary to address in my example. Specifically, you say it’s unnatural to suppose how is the counterfactual “the aliens would threaten the AIs anyways, even if they won’t give in”. How is this anymore unnatural than the counterfactual “the AI would avoid submitting to extortion, even if the aliens would threaten the AIs anyways”.
- TekhneMakre 31 Aug 2021 22:28 UTC
  6 points
  Parent
  > Then they would execute S.
  Are you saying this is the wrong thing to do in that situation? That just sounds like trade. (Assuming of course that we trust our AI’s reasoning about the likely consequences of doing S.)
  >Specifically, you say it’s unnatural to suppose how is the counterfactual “the aliens would threaten the AIs anyways, even if they won’t give in”. How is this anymore unnatural than the counterfactual “the AI would avoid submitting to extortion, even if the aliens would threaten the AIs anyways”.
  It’s unnatural to assume that the aliens would threaten the AI without reasoning (possibly acausally) about the consequences of them making that threat, which involves reasoning about how the AI would respond, which makes the aliens involved in a mutual decision situation with the AI, which means UDTAI might have reason to not yield to the extortion, because it can (acausally) affect how the aliens behave (e.g. whether they decide to make a threat).
  - Chantiel 31 Aug 2021 23:34 UTC
    2 points
    Parent
    The problem is that, if the best percept-action mapping is S, then the UDTs in Earth would use it, too. Which would result in us being taken over. I’m not saying that it’s an irrational choice for the AIs to make, but it wouldn’t end well for us.
    
    I’m having some trouble following your reasoning about extortion, though. Suppose both the aliens and AIs use UDT. I think you’re reasoning something like, “If the AIs commit to never be extorted no matter what the aliens would do, then the aliens wouldn’t bother to extort them”. But this seems symmetric to reasoning as, “If the aliens commit to extorting and dulling out the punishment no matter what the AIs would do, then the AIs wouldn’t bother to resist the extortion”. So I’m not sure why the second line of reasoning would be less likely to occur than the first.
    
    Feel free to correct me if I misinterpreted.
    - TekhneMakre 1 Sep 2021 2:11 UTC
      2 points
      Parent
      Re: symmetry. I think you interpreted right. (Upvoted for symmetry comment.) Part of my original point was trying to say something like “it’s unnatural to have aliens making these sorts of threats without engaging in an acausal relationship with the UDTAI”, but yeah also I was assuming the threat-ignorer would “win” the acausal conflict, which doesn’t seem necessarily right. If the aliens are engaging that way, then yeah, I don’t know how to make threats vs. ignoring threats be asymmetric in a principled way.
      - TekhneMakre 1 Sep 2021 2:24 UTC
        1 point
        Parent
        I mean, the intuition is that there’s a “default” where the agents “don’t interact at all”, and deviations from the default can be trades if there’s upside chances over the default and threats if there’s downside chances. And to “escalate” from the “default” with a “threat” makes you the “aggressor”, and for some reason “aggressors” have the worse position for acausal conflict, maybe? IDK.
        Chantiel 1 Sep 2021 5:06 UTC
        3 points
        Parent
        Well, I can’t say I have that intuition, but it is a possibility.
        
        It’s a nice idea: a world without extortion sounds good. But remember that, though we want this, we should be careful to avoid wishful thinking swaying us.
        
        In actual causal conflicts among humans, the aggressors don’t seem to be in a worse position. Things might be different from acausal UDT trades, but I’m not sure why it would be.
    - TekhneMakre 1 Sep 2021 1:59 UTC
      1 point
      Parent
      > I’m not saying that it’s an irrational choice for the AIs to make, but it wouldn’t end well for us.
      I guess I’m auto-translating from “the AI uses UDT, but its utility function depends on its terminal values” into “the AI has a distribution over worlds (and utility functions)”, so that the AI is best thought of as representing the coalition of all those utility functions. Then either the aliens have enough resources to simulate a bunch of stuff that has more value to that coalition than the value of our “actual” world, or not. If yes, it seems like a fine trade. If not, there’s no issue.
      - Chantiel 1 Sep 2021 5:40 UTC
        1 point
        Parent
        Well, actually, I’m considering both the AIs on Earth and on the alien planet to have the same utility function. If I understand correctly, UDT says to maximize the expected utility of your own utility function a prior, rather than that of agents with different utility functions.
        
        The issue is, some agents with the same utility function, in effect, have different terminal values. For example, consider a utility function saying something like, “maximize the welfare of creatures in the world I’m from.” Then, even with the same utility functions, the AIs in the alien world and the ones on Earth would have very different values.
        
        Then either the aliens have enough resources to simulate a bunch of stuff that has more value to that coalition than the value of our “actual” world, or not. If yes, it seems like a fine trade. I don’t think so. Imagine the alien-created utopia would be much less good than the one we could make on Earth. For example, suppose the alien-created utopia would have a utility of 1 for the AIs there and the one on Earth would have a utility of 10. And otherwise the AIs would have a utility of 0. But suppose there’s a million times more AIs in the alien world than on Earth. Then it would be around a million times more likely a prior that the AI would find itself in the alien world than on Earth. So the expected utility of using S would be approximately, $999999 / 1000000 * 1 + 1 / 1000000 * 0 \approx 1$
        
        And the expected utility of not using S and instead letting yourself build a utopia would be approximately, $999999 / 1000000 * 0 + 1 / 1000000 * 10 \approx 0$ As you see, the AIs still would choose to execute S, even if though this would provide less moral value. It could also kill us.
        TekhneMakre 1 Sep 2021 23:45 UTC
        1 point
        Parent
        I don’t know how to understand the prior that the AI puts over worlds (the thing that says, a priori, that there’s 1000000 of this kind and 1 of that kind) as anything other than part of its utility function. So this doesn’t seem like a problem with UDT, but a problem with the utility function. Maybe your argument does show that we want to treat uncertainty about the utility function differently than other uncertainty? Like, when we resolve uncertainty that’s “merely about the world”, as in for example the transparent Newcomb’s problem, we still want to follow the updateless policy that’s best a priori. But maybe your argument shows that resolving uncertainty about the utility function can’t be treated the same way; when we see that we’re a UDTAI for humans, we’re supposed to actually update, and stop optimizing for other people.
        Chantiel 2 Sep 2021 19:01 UTC
        1 point
        Parent
        
        I don’t know how to understand the prior that the AI puts over worlds (the thing that says, a priori, that there’s 1000000 of this kind and 1 of that kind) as anything other than part of its utility function. Could you explain you reasoning? The utility function is a fixed function. The AI already knows it and does not need to associate a probability with it. Remember that both the AIs in the alien world and the AIs on Earth have the same utility function.
        
        Saying it’s a million times more likely to end up in the alien world is a question about prior probabilities, not utility functions. What I’m saying is that, a priori, the AI may think it’s far more probable that it would be an AI in the alien world, and that this could result in very bad things for us.
        TekhneMakre 2 Sep 2021 19:36 UTC
        1 point
        Parent
        What’s the difference between setting prior probabilities vs. expressing how much you’re going to try to optimize different worlds?
        Chantiel 3 Sep 2021 19:50 UTC
        1 point
        Parent
        They’re pretty much the same. If you could come up with a prior that would make the AI convinced it would be on Earth, then this could potentially make fix the problem. However, coming up with a prior probability distribution that guarantees the AI is in the nebulous concept of “Earth, as we imagine it” sounds very tough to come up with. Also, this could interfere with the reliability of the AI’s reasoning. Thinking that it’s guaranteed to be on Earth is just not a reasonable thing to think a priori. This irrationality may make the AI perform poorly in other ways.
        
        Still, it is a possible way to fix the issue.
        TekhneMakre 4 Sep 2021 4:30 UTC
        2 points
        Parent
        Well, so “expressing how much you’re going to try to optimize different worlds” sounds to me like it’s equivalent to / interchangeable with a multiplicative factor in your utility function.
        Anyway, re/ the rest of your comment, my (off the cuff) proposal above was to let the AI be uncertain as to what exactly this “Earth” thing is, and to let it be *updateful* (rather than updateless) about information about what “Earth” means, and generally about information that clarifies the meaning of the utility function. So AIs that wake up on Earth will update that “the planet I’m on” means Earth, and will only care about Earth; AIs that wake up on e.g. Htrae will update that “the planet I’m on” is Htrae, and will not care about Earth. The Earth AI will not have already chosen a policy of S, since it doesn’t in general chose policies updatelessly. This is analogous to how children imprint on lessons and values they get from their environment; they don’t keep optimizing timelessly for all the ways they could have been, including ways that they now consider bad, even though they can optimize timelessly in other ways.
        One question would be, is this a bad thing to do? Relative to being updateless, it seems like caring less about other people, or refusing to bargain / coordinate to realize gains from trade with aliens. On the other hand, maybe it avoids killing us in the way you describe, which seems good. Otoh, maybe this is trying to renege on possible bargains with the Htrae people, and is therefore not in our best interests overall.
        Another question would be, is this stable under reflection? The usual argument is: if you’re NOT updateless about some variable X (in this case X = “the planet I’m on (and am supposed to care about)”), then before you have resolved your uncertainty about X, you can realize gains from trade between possible future versions of yourself: by doing things that are very good according to [you who believes X=Htrae] but are slightly bad according to [you who believes X=Earth], you increase your current overall expectation of utility. And both the Htraeans and the Earthians will have wanted you to indeed decide (before knowing who in particular this would benefit) to follow a policy of making policy decisions under uncertainty that increase the total expected utility in advance of you knowing who you’re supposed to be optimizing for.
        Maybe the point is that since probabilities and utilities can be marginally interchanged for each other, there’s no determinate “utility function” that one could be updateful about while being updateless about the remaining “probabilities”. And therefore the above semi-updateful thing is incoherent, or indeterminate (or equivalent to reneging on bargains).
        So this goes back to my comment above that the alien threateners are just setting up a trade opportunity between you and the Htraeans, and maybe it’s a good trade, and if so it’s fine that you die because that’s what you wanted on net. But it does seem counterintuitive that if I’m better at pointing to my utility function, or something, then I have a better bargaining position?
        The semi-updateful thing is more appealing when I remember that it can still bargain with its cousins later if it wants to. The issue is whether that bargaining can be made mutually transparent even if it’s happening later (after real updates). You can only acausally bargain with someone if you can know that some of your decision making is connected with some of theirs (for example by having the exact same structure, or by having some exactly shared structure and some variance with a legible relationship to the shared structure as in the Earth-AI/Htrae-AI case), so that you can decide for them to give you what you want (by deciding to give them what they want). If you’re a baby UDT who might grow up to be Earthian or Htraean, you can do the bargaining for free because you are entirely made of shared structure between the pasts of your two possible futures. But there’s other ways, maybe, like bargaining after you’ve grown up. So to some extent updateless vs updateful is a question of how much bargaining you can, or want to, defer, vs bake in.
        Chantiel 5 Sep 2021 18:16 UTC
        1 point
        Parent
        I think your semi-updateless idea is pretty interesting. The main issue I’m concerned about is finding a way to update on the things we want to have updated on, but not on the things we don’t want updated on.
        
        As as example, consider Newcomb’s problem. There are two boxes. A superintelligent predictor will put $1000 in one box and $10 in the other if it predicts you will only take one box. Otherwise it doesn’t add money to either box. You see one is transparent and contains $1000.
        
        I’m concerned the semi-updateless agent would reason as follows: “Well, since their’s money in the one box, their must be money in the other box. So, clearly that means this “Earth” thing I’m in is a place in which there is money in both boxes in front of me. I only care about how well I do in this “Earth” place, and clearly I’d do better if I got the money from the second box. So I’ll two-box.
        
        But that’s the wrong choice. Because agents who would two-box end up with $0.
        Expand this thread
        TekhneMakre 7 Sep 2021 9:47 UTC
        1 point
        Parent
        One intuitive way this case could work out, is if the SUDT could say “Ok, I’m in this Earth. And these Earthians consider themselves ‘the same as’ (or close enough) the alt-Earthians from the world where I’m actually inside a simulation that Omega is running to predict what I would do; so, though I’m only taking orders from these Earthians, I still want to act timelessly in this case”. This might be sort of vacuous, since it’s just referring back to the humans’s intuitions about decision theory (what they consider “the same as” themselves) rather than actually using the AI to do the decision theory, or making the decision theory explicit. But at least it sort of uses some of the AI’s intelligence to apply the humans’s intuitions across more lines of hypothetical reasoning than the humans could do by themselves.
        TekhneMakre 4 Sep 2021 10:28 UTC
        1 point
        Parent
        Something seems pretty weird about all this reasoning though. For one thing, there’s a sense that you sort of “travel backwards in logical time” as you think longer in normal time. Like, first you don’t know about TDT, and then you invent TDT, and UDT, and then you can do UDT better. So you start making decisions in accordance with policies you’d’ve wanted to pick “a priori” (earlier in some kind of “time”). But like what’s going on? We could say that UDT is convergent, as the only thing that’s reflectively stable, or as the only kind of thing that can be pareto optimal in conflicts, or something like that. But how do we make sense of our actual reasoning before having invented UDT? Is the job of that reasoning not to invent UDT, but just to avoid avoiding adopting UDT?
        Expand this thread
        Chantiel 6 Sep 2021 19:56 UTC
        1 point
        Parent
        I don’t know how to formalize the reasoning process that goes into how we choose decision theories. And I doubt anyone does. Because if you could formalize the reasoning we use, then you could (indirectly) formalize decision theory itself as being, “whatever decision theory we would use given unlimited reflection”.
        
        I don’t really think UDT is necessarily reflectively stable, or the only decision theory that is. I’ve argued previously that I, in certain situations, would act essential as an evidential decision theorist. I’m not sure what others think of this, though, since no one actually ever replied to me.
        
        I don’t think UDT is pareto optimal in conflicts. If the agent is in a conflict with an irrational agent, then the resulting interaction between the two agents could easily be non-pareto optimal. For example, imagine a UDT agent is in a conflict with the same payoff to the prisoner’s dilemma. And suppose the agent it’s in conflict with is a causal decision theorist. Then the causal decision theorist would defect no matter what the UDT agent would do, so the UDT agent would also defect, and then everyone would do poorly.
        TekhneMakre 7 Sep 2021 9:55 UTC
        1 point
        Parent
        Yeah I don’t know of a clear case for those supposed properties of UDT.
        By pareto optimal I mean just, two UDT agents will pick a Pareto optimal policy. Whereas, say, two CDT agents may defect on each other in a PD.
        This isn’t a proof, or even really a general argument, but one reason to suspect UDT is convergent, is that CDT would modify to be a sort of UDT-starting-now. At least, say you have a CDT agent, and further assume that it’s capable of computing the causal consequences of all possible complete action-policies it could follow. This agent would replace itself with P-bot, the bot that follows policy P, where P is the one with the best causal consequences at the time of replacement. This is different from CDT: if Omega scans P-bot the next day, P-bot will win the Transparent Newcomb’s problem, whereas if CDT hadn’t self-modified to be P-bot and Omega had scanned CDT tomorrow, CDT would fail the TNP for the usual reason. So CDT is in conflict with its future self.
        Chantiel 7 Sep 2021 20:07 UTC
        1 point
        Parent
        Two UDT agents actually can potentially defect in prisoner’s dilemma. See the agent simulates predictor problem if you’re interested.
        
        But I think you’re right that agents would generally modify themselves to more closely resemble UDT. Note, though, that the decision theory a CDT agent would modify itself to use wouldn’t exactly be UDT. For example, suppose the causal decision theory agent had its output predicted by Omega for Newcomb’s problem before the agent even came into existence. Then by the time the CDT agent comes to existence, modifying itself to use UDT would have no causal impact on the content of the boxes. So it wouldn’t adopt UDT in this situation and would still two-box.
        TekhneMakre 7 Sep 2021 21:50 UTC
        1 point
        Parent
        Well, the way the agent loses in ASP is by failing to be updateless about certain logical facts (what the predictor predicts). So from this perspective, it’s a SemiUDT that does update whenever it learns logical facts, and this explains why it defects.
        > So it wouldn’t adopt UDT in this situation and would still two-box.
        True, it’s always [updateless, on everything after now].