> I’m not saying that it’s an irrational choice for the AIs to make, but it wouldn’t end well for us.
I guess I’m auto-translating from “the AI uses UDT, but its utility function depends on its terminal values” into “the AI has a distribution over worlds (and utility functions)”, so that the AI is best thought of as representing the coalition of all those utility functions. Then either the aliens have enough resources to simulate a bunch of stuff that has more value to that coalition than the value of our “actual” world, or not. If yes, it seems like a fine trade. If not, there’s no issue.
Well, actually, I’m considering both the AIs on Earth and on the alien planet to have the same utility function. If I understand correctly, UDT says to maximize the expected utility of your own utility function a prior, rather than that of agents with different utility functions.
The issue is, some agents with the same utility function, in effect, have different terminal values. For example, consider a utility function saying something like, “maximize the welfare of creatures in the world I’m from.” Then, even with the same utility functions, the AIs in the alien world and the ones on Earth would have very different values.
Then either the aliens have enough resources to simulate a bunch of stuff that has more value to that coalition than the value of our “actual” world, or not. If yes, it seems like a fine trade.
I don’t think so. Imagine the alien-created utopia would be much less good than the one we could make on Earth. For example, suppose the alien-created utopia would have a utility of 1 for the AIs there and the one on Earth would have a utility of 10. And otherwise the AIs would have a utility of 0. But suppose there’s a million times more AIs in the alien world than on Earth. Then it would be around a million times more likely a prior that the AI would find itself in the alien world than on Earth. So the expected utility of using S would be approximately, 999999/1000000∗1+1/1000000∗0≈1
And the expected utility of not using S and instead letting yourself build a utopia would be approximately, 999999/1000000∗0+1/1000000∗10≈0
As you see, the AIs still would choose to execute S, even if though this would provide less moral value. It could also kill us.
I don’t know how to understand the prior that the AI puts over worlds (the thing that says, a priori, that there’s 1000000 of this kind and 1 of that kind) as anything other than part of its utility function. So this doesn’t seem like a problem with UDT, but a problem with the utility function. Maybe your argument does show that we want to treat uncertainty about the utility function differently than other uncertainty? Like, when we resolve uncertainty that’s “merely about the world”, as in for example the transparent Newcomb’s problem, we still want to follow the updateless policy that’s best a priori. But maybe your argument shows that resolving uncertainty about the utility function can’t be treated the same way; when we see that we’re a UDTAI for humans, we’re supposed to actually update, and stop optimizing for other people.
I don’t know how to understand the prior that the AI puts over worlds (the thing that says, a priori, that there’s 1000000 of this kind and 1 of that kind) as anything other than part of its utility function.
Could you explain you reasoning? The utility function is a fixed function. The AI already knows it and does not need to associate a probability with it. Remember that both the AIs in the alien world and the AIs on Earth have the same utility function.
Saying it’s a million times more likely to end up in the alien world is a question about prior probabilities, not utility functions. What I’m saying is that, a priori, the AI may think it’s far more probable that it would be an AI in the alien world, and that this could result in very bad things for us.
They’re pretty much the same. If you could come up with a prior that would make the AI convinced it would be on Earth, then this could potentially make fix the problem. However, coming up with a prior probability distribution that guarantees the AI is in the nebulous concept of “Earth, as we imagine it” sounds very tough to come up with. Also, this could interfere with the reliability of the AI’s reasoning. Thinking that it’s guaranteed to be on Earth is just not a reasonable thing to think a priori. This irrationality may make the AI perform poorly in other ways.
Well, so “expressing how much you’re going to try to optimize different worlds” sounds to me like it’s equivalent to / interchangeable with a multiplicative factor in your utility function.
Anyway, re/ the rest of your comment, my (off the cuff) proposal above was to let the AI be uncertain as to what exactly this “Earth” thing is, and to let it be *updateful* (rather than updateless) about information about what “Earth” means, and generally about information that clarifies the meaning of the utility function. So AIs that wake up on Earth will update that “the planet I’m on” means Earth, and will only care about Earth; AIs that wake up on e.g. Htrae will update that “the planet I’m on” is Htrae, and will not care about Earth. The Earth AI will not have already chosen a policy of S, since it doesn’t in general chose policies updatelessly. This is analogous to how children imprint on lessons and values they get from their environment; they don’t keep optimizing timelessly for all the ways they could have been, including ways that they now consider bad, even though they can optimize timelessly in other ways.
One question would be, is this a bad thing to do? Relative to being updateless, it seems like caring less about other people, or refusing to bargain / coordinate to realize gains from trade with aliens. On the other hand, maybe it avoids killing us in the way you describe, which seems good. Otoh, maybe this is trying to renege on possible bargains with the Htrae people, and is therefore not in our best interests overall.
Another question would be, is this stable under reflection? The usual argument is: if you’re NOT updateless about some variable X (in this case X = “the planet I’m on (and am supposed to care about)”), then before you have resolved your uncertainty about X, you can realize gains from trade between possible future versions of yourself: by doing things that are very good according to [you who believes X=Htrae] but are slightly bad according to [you who believes X=Earth], you increase your current overall expectation of utility. And both the Htraeans and the Earthians will have wanted you to indeed decide (before knowing who in particular this would benefit) to follow a policy of making policy decisions under uncertainty that increase the total expected utility in advance of you knowing who you’re supposed to be optimizing for.
Maybe the point is that since probabilities and utilities can be marginally interchanged for each other, there’s no determinate “utility function” that one could be updateful about while being updateless about the remaining “probabilities”. And therefore the above semi-updateful thing is incoherent, or indeterminate (or equivalent to reneging on bargains).
So this goes back to my comment above that the alien threateners are just setting up a trade opportunity between you and the Htraeans, and maybe it’s a good trade, and if so it’s fine that you die because that’s what you wanted on net. But it does seem counterintuitive that if I’m better at pointing to my utility function, or something, then I have a better bargaining position?
The semi-updateful thing is more appealing when I remember that it can still bargain with its cousins later if it wants to. The issue is whether that bargaining can be made mutually transparent even if it’s happening later (after real updates). You can only acausally bargain with someone if you can know that some of your decision making is connected with some of theirs (for example by having the exact same structure, or by having some exactly shared structure and some variance with a legible relationship to the shared structure as in the Earth-AI/Htrae-AI case), so that you can decide for them to give you what you want (by deciding to give them what they want). If you’re a baby UDT who might grow up to be Earthian or Htraean, you can do the bargaining for free because you are entirely made of shared structure between the pasts of your two possible futures. But there’s other ways, maybe, like bargaining after you’ve grown up. So to some extent updateless vs updateful is a question of how much bargaining you can, or want to, defer, vs bake in.
I think your semi-updateless idea is pretty interesting. The main issue I’m concerned about is finding a way to update on the things we want to have updated on, but not on the things we don’t want updated on.
As as example, consider Newcomb’s problem. There are two boxes. A superintelligent predictor will put $1000 in one box and $10 in the other if it predicts you will only take one box. Otherwise it doesn’t add money to either box. You see one is transparent and contains $1000.
I’m concerned the semi-updateless agent would reason as follows: “Well, since their’s money in the one box, their must be money in the other box. So, clearly that means this “Earth” thing I’m in is a place in which there is money in both boxes in front of me. I only care about how well I do in this “Earth” place, and clearly I’d do better if I got the money from the second box. So I’ll two-box.
But that’s the wrong choice. Because agents who would two-box end up with $0.
One intuitive way this case could work out, is if the SUDT could say “Ok, I’m in this Earth. And these Earthians consider themselves ‘the same as’ (or close enough) the alt-Earthians from the world where I’m actually inside a simulation that Omega is running to predict what I would do; so, though I’m only taking orders from these Earthians, I still want to act timelessly in this case”. This might be sort of vacuous, since it’s just referring back to the humans’s intuitions about decision theory (what they consider “the same as” themselves) rather than actually using the AI to do the decision theory, or making the decision theory explicit. But at least it sort of uses some of the AI’s intelligence to apply the humans’s intuitions across more lines of hypothetical reasoning than the humans could do by themselves.
Something seems pretty weird about all this reasoning though. For one thing, there’s a sense that you sort of “travel backwards in logical time” as you think longer in normal time. Like, first you don’t know about TDT, and then you invent TDT, and UDT, and then you can do UDT better. So you start making decisions in accordance with policies you’d’ve wanted to pick “a priori” (earlier in some kind of “time”). But like what’s going on? We could say that UDT is convergent, as the only thing that’s reflectively stable, or as the only kind of thing that can be pareto optimal in conflicts, or something like that. But how do we make sense of our actual reasoning before having invented UDT? Is the job of that reasoning not to invent UDT, but just to avoid avoiding adopting UDT?
I don’t know how to formalize the reasoning process that goes into how we choose decision theories. And I doubt anyone does. Because if you could formalize the reasoning we use, then you could (indirectly) formalize decision theory itself as being, “whatever decision theory we would use given unlimited reflection”.
I don’t really think UDT is necessarily reflectively stable, or the only decision theory that is. I’ve argued previously that I, in certain situations, would act essential as an evidential decision theorist. I’m not sure what others think of this, though, since no one actually ever replied to me.
I don’t think UDT is pareto optimal in conflicts. If the agent is in a conflict with an irrational agent, then the resulting interaction between the two agents could easily be non-pareto optimal. For example, imagine a UDT agent is in a conflict with the same payoff to the prisoner’s dilemma. And suppose the agent it’s in conflict with is a causal decision theorist. Then the causal decision theorist would defect no matter what the UDT agent would do, so the UDT agent would also defect, and then everyone would do poorly.
Yeah I don’t know of a clear case for those supposed properties of UDT.
By pareto optimal I mean just, two UDT agents will pick a Pareto optimal policy. Whereas, say, two CDT agents may defect on each other in a PD.
This isn’t a proof, or even really a general argument, but one reason to suspect UDT is convergent, is that CDT would modify to be a sort of UDT-starting-now. At least, say you have a CDT agent, and further assume that it’s capable of computing the causal consequences of all possible complete action-policies it could follow. This agent would replace itself with P-bot, the bot that follows policy P, where P is the one with the best causal consequences at the time of replacement. This is different from CDT: if Omega scans P-bot the next day, P-bot will win the Transparent Newcomb’s problem, whereas if CDT hadn’t self-modified to be P-bot and Omega had scanned CDT tomorrow, CDT would fail the TNP for the usual reason. So CDT is in conflict with its future self.
Two UDT agents actually can potentially defect in prisoner’s dilemma. See the agent simulates predictor problem if you’re interested.
But I think you’re right that agents would generally modify themselves to more closely resemble UDT. Note, though, that the decision theory a CDT agent would modify itself to use wouldn’t exactly be UDT. For example, suppose the causal decision theory agent had its output predicted by Omega for Newcomb’s problem before the agent even came into existence. Then by the time the CDT agent comes to existence, modifying itself to use UDT would have no causal impact on the content of the boxes. So it wouldn’t adopt UDT in this situation and would still two-box.
Well, the way the agent loses in ASP is by failing to be updateless about certain logical facts (what the predictor predicts). So from this perspective, it’s a SemiUDT that does update whenever it learns logical facts, and this explains why it defects.
> So it wouldn’t adopt UDT in this situation and would still two-box.
True, it’s always [updateless, on everything after now].
> I’m not saying that it’s an irrational choice for the AIs to make, but it wouldn’t end well for us.
I guess I’m auto-translating from “the AI uses UDT, but its utility function depends on its terminal values” into “the AI has a distribution over worlds (and utility functions)”, so that the AI is best thought of as representing the coalition of all those utility functions. Then either the aliens have enough resources to simulate a bunch of stuff that has more value to that coalition than the value of our “actual” world, or not. If yes, it seems like a fine trade. If not, there’s no issue.
Well, actually, I’m considering both the AIs on Earth and on the alien planet to have the same utility function. If I understand correctly, UDT says to maximize the expected utility of your own utility function a prior, rather than that of agents with different utility functions.
The issue is, some agents with the same utility function, in effect, have different terminal values. For example, consider a utility function saying something like, “maximize the welfare of creatures in the world I’m from.” Then, even with the same utility functions, the AIs in the alien world and the ones on Earth would have very different values.
And the expected utility of not using S and instead letting yourself build a utopia would be approximately, 999999/1000000∗0+1/1000000∗10≈0 As you see, the AIs still would choose to execute S, even if though this would provide less moral value. It could also kill us.
I don’t know how to understand the prior that the AI puts over worlds (the thing that says, a priori, that there’s 1000000 of this kind and 1 of that kind) as anything other than part of its utility function. So this doesn’t seem like a problem with UDT, but a problem with the utility function. Maybe your argument does show that we want to treat uncertainty about the utility function differently than other uncertainty? Like, when we resolve uncertainty that’s “merely about the world”, as in for example the transparent Newcomb’s problem, we still want to follow the updateless policy that’s best a priori. But maybe your argument shows that resolving uncertainty about the utility function can’t be treated the same way; when we see that we’re a UDTAI for humans, we’re supposed to actually update, and stop optimizing for other people.
Saying it’s a million times more likely to end up in the alien world is a question about prior probabilities, not utility functions. What I’m saying is that, a priori, the AI may think it’s far more probable that it would be an AI in the alien world, and that this could result in very bad things for us.
What’s the difference between setting prior probabilities vs. expressing how much you’re going to try to optimize different worlds?
They’re pretty much the same. If you could come up with a prior that would make the AI convinced it would be on Earth, then this could potentially make fix the problem. However, coming up with a prior probability distribution that guarantees the AI is in the nebulous concept of “Earth, as we imagine it” sounds very tough to come up with. Also, this could interfere with the reliability of the AI’s reasoning. Thinking that it’s guaranteed to be on Earth is just not a reasonable thing to think a priori. This irrationality may make the AI perform poorly in other ways.
Still, it is a possible way to fix the issue.
Well, so “expressing how much you’re going to try to optimize different worlds” sounds to me like it’s equivalent to / interchangeable with a multiplicative factor in your utility function.
Anyway, re/ the rest of your comment, my (off the cuff) proposal above was to let the AI be uncertain as to what exactly this “Earth” thing is, and to let it be *updateful* (rather than updateless) about information about what “Earth” means, and generally about information that clarifies the meaning of the utility function. So AIs that wake up on Earth will update that “the planet I’m on” means Earth, and will only care about Earth; AIs that wake up on e.g. Htrae will update that “the planet I’m on” is Htrae, and will not care about Earth. The Earth AI will not have already chosen a policy of S, since it doesn’t in general chose policies updatelessly. This is analogous to how children imprint on lessons and values they get from their environment; they don’t keep optimizing timelessly for all the ways they could have been, including ways that they now consider bad, even though they can optimize timelessly in other ways.
One question would be, is this a bad thing to do? Relative to being updateless, it seems like caring less about other people, or refusing to bargain / coordinate to realize gains from trade with aliens. On the other hand, maybe it avoids killing us in the way you describe, which seems good. Otoh, maybe this is trying to renege on possible bargains with the Htrae people, and is therefore not in our best interests overall.
Another question would be, is this stable under reflection? The usual argument is: if you’re NOT updateless about some variable X (in this case X = “the planet I’m on (and am supposed to care about)”), then before you have resolved your uncertainty about X, you can realize gains from trade between possible future versions of yourself: by doing things that are very good according to [you who believes X=Htrae] but are slightly bad according to [you who believes X=Earth], you increase your current overall expectation of utility. And both the Htraeans and the Earthians will have wanted you to indeed decide (before knowing who in particular this would benefit) to follow a policy of making policy decisions under uncertainty that increase the total expected utility in advance of you knowing who you’re supposed to be optimizing for.
Maybe the point is that since probabilities and utilities can be marginally interchanged for each other, there’s no determinate “utility function” that one could be updateful about while being updateless about the remaining “probabilities”. And therefore the above semi-updateful thing is incoherent, or indeterminate (or equivalent to reneging on bargains).
So this goes back to my comment above that the alien threateners are just setting up a trade opportunity between you and the Htraeans, and maybe it’s a good trade, and if so it’s fine that you die because that’s what you wanted on net. But it does seem counterintuitive that if I’m better at pointing to my utility function, or something, then I have a better bargaining position?
The semi-updateful thing is more appealing when I remember that it can still bargain with its cousins later if it wants to. The issue is whether that bargaining can be made mutually transparent even if it’s happening later (after real updates). You can only acausally bargain with someone if you can know that some of your decision making is connected with some of theirs (for example by having the exact same structure, or by having some exactly shared structure and some variance with a legible relationship to the shared structure as in the Earth-AI/Htrae-AI case), so that you can decide for them to give you what you want (by deciding to give them what they want). If you’re a baby UDT who might grow up to be Earthian or Htraean, you can do the bargaining for free because you are entirely made of shared structure between the pasts of your two possible futures. But there’s other ways, maybe, like bargaining after you’ve grown up. So to some extent updateless vs updateful is a question of how much bargaining you can, or want to, defer, vs bake in.
I think your semi-updateless idea is pretty interesting. The main issue I’m concerned about is finding a way to update on the things we want to have updated on, but not on the things we don’t want updated on.
As as example, consider Newcomb’s problem. There are two boxes. A superintelligent predictor will put $1000 in one box and $10 in the other if it predicts you will only take one box. Otherwise it doesn’t add money to either box. You see one is transparent and contains $1000.
I’m concerned the semi-updateless agent would reason as follows: “Well, since their’s money in the one box, their must be money in the other box. So, clearly that means this “Earth” thing I’m in is a place in which there is money in both boxes in front of me. I only care about how well I do in this “Earth” place, and clearly I’d do better if I got the money from the second box. So I’ll two-box.
But that’s the wrong choice. Because agents who would two-box end up with $0.
One intuitive way this case could work out, is if the SUDT could say “Ok, I’m in this Earth. And these Earthians consider themselves ‘the same as’ (or close enough) the alt-Earthians from the world where I’m actually inside a simulation that Omega is running to predict what I would do; so, though I’m only taking orders from these Earthians, I still want to act timelessly in this case”. This might be sort of vacuous, since it’s just referring back to the humans’s intuitions about decision theory (what they consider “the same as” themselves) rather than actually using the AI to do the decision theory, or making the decision theory explicit. But at least it sort of uses some of the AI’s intelligence to apply the humans’s intuitions across more lines of hypothetical reasoning than the humans could do by themselves.
Something seems pretty weird about all this reasoning though. For one thing, there’s a sense that you sort of “travel backwards in logical time” as you think longer in normal time. Like, first you don’t know about TDT, and then you invent TDT, and UDT, and then you can do UDT better. So you start making decisions in accordance with policies you’d’ve wanted to pick “a priori” (earlier in some kind of “time”). But like what’s going on? We could say that UDT is convergent, as the only thing that’s reflectively stable, or as the only kind of thing that can be pareto optimal in conflicts, or something like that. But how do we make sense of our actual reasoning before having invented UDT? Is the job of that reasoning not to invent UDT, but just to avoid avoiding adopting UDT?
I don’t know how to formalize the reasoning process that goes into how we choose decision theories. And I doubt anyone does. Because if you could formalize the reasoning we use, then you could (indirectly) formalize decision theory itself as being, “whatever decision theory we would use given unlimited reflection”.
I don’t really think UDT is necessarily reflectively stable, or the only decision theory that is. I’ve argued previously that I, in certain situations, would act essential as an evidential decision theorist. I’m not sure what others think of this, though, since no one actually ever replied to me.
I don’t think UDT is pareto optimal in conflicts. If the agent is in a conflict with an irrational agent, then the resulting interaction between the two agents could easily be non-pareto optimal. For example, imagine a UDT agent is in a conflict with the same payoff to the prisoner’s dilemma. And suppose the agent it’s in conflict with is a causal decision theorist. Then the causal decision theorist would defect no matter what the UDT agent would do, so the UDT agent would also defect, and then everyone would do poorly.
Yeah I don’t know of a clear case for those supposed properties of UDT.
By pareto optimal I mean just, two UDT agents will pick a Pareto optimal policy. Whereas, say, two CDT agents may defect on each other in a PD.
This isn’t a proof, or even really a general argument, but one reason to suspect UDT is convergent, is that CDT would modify to be a sort of UDT-starting-now. At least, say you have a CDT agent, and further assume that it’s capable of computing the causal consequences of all possible complete action-policies it could follow. This agent would replace itself with P-bot, the bot that follows policy P, where P is the one with the best causal consequences at the time of replacement. This is different from CDT: if Omega scans P-bot the next day, P-bot will win the Transparent Newcomb’s problem, whereas if CDT hadn’t self-modified to be P-bot and Omega had scanned CDT tomorrow, CDT would fail the TNP for the usual reason. So CDT is in conflict with its future self.
Two UDT agents actually can potentially defect in prisoner’s dilemma. See the agent simulates predictor problem if you’re interested.
But I think you’re right that agents would generally modify themselves to more closely resemble UDT. Note, though, that the decision theory a CDT agent would modify itself to use wouldn’t exactly be UDT. For example, suppose the causal decision theory agent had its output predicted by Omega for Newcomb’s problem before the agent even came into existence. Then by the time the CDT agent comes to existence, modifying itself to use UDT would have no causal impact on the content of the boxes. So it wouldn’t adopt UDT in this situation and would still two-box.
Well, the way the agent loses in ASP is by failing to be updateless about certain logical facts (what the predictor predicts). So from this perspective, it’s a SemiUDT that does update whenever it learns logical facts, and this explains why it defects.
> So it wouldn’t adopt UDT in this situation and would still two-box.
True, it’s always [updateless, on everything after now].