Broadly agree with this post. Couple of small things:
Then later, it is smart enough to reflect back on that data and ask: “Were the humans pointing me towards the distinction between goodness and badness, with their training data? Or were they pointing me towards the distinction between that-which-they’d-label-goodness and that-which-they’d-label-badness, with things that look deceptively good (but are actually bad) falling into the former bin?” And to test this hypothesis, it would go back to its training data and find some example bad-but-deceptively-good-looking cases, and see that they were labeled “good”, and roll with that.
I feel pretty confused by this. A superintelligence will know what we intended, probably better than we do ourselves. So unless this paragraph is intended in a particularly metaphorical way, it seems straightforwardly wrong.
The nearby thing I do agree with is that it’s difficult to “confirm that this exactly-correct concept occurs in its mental precommitment in the requisite way”. (It’s not totally clear to me that we need to get the concept exactly correct, depending on how natural niceness (in the sense of “giving other agents what they want”) is; but I’ll discuss that in more detail on your other post directly about niceness, if I have time.)
Insofar as I have hope in decision theory leading us to have nice things, it mostly comes via the possibility that a fully-fleshed-out version of UDT would recommend updating “all the way back” to a point where there’s uncertainty about which agent you are. (I haven’t thought about this much and this could be crazy.)
Overall I think the decision between EDT and UDT is difficult. Of course, it’s obvious that you should commit to using something-like-UDT going forward if you can, and so I have no doubts about evaluating decisions from something like my epistemic state in 2012. But it’s not at all obvious whether I should go further than that, or how much. Should I go back to 2011 when I was just starting to think about these arguments? Should I go back to some suitable idealization of my first coherent epistemic state? Should I go back to a position where I’m mostly ignorant about the content of my values? A state where I’m ignorant about basic arithmetic facts?
Insofar as I have hope in decision theory leading us to have nice things, it mostly comes via the possibility that a fully-fleshed-out version of UDT would recommend updating “all the way back” to a point where there’s uncertainty about which agent you are. (I haven’t thought about this much and this could be crazy.)
This was surprising to me. For one, that seems like way too much updatelessness. Do you have in mind an agent self-modifying into something like that? If so, when and why? Plausibly this would be after the point of the agent knowing whether it is aligned or not; and I guess I don’t see why there is an incentive for the agent to go updateless with respect to its values at that point in time.[1] At the very least, I think this would not be an instance of all-upside updatelessness.
Secondly, what you are pointing towards seems to be a very specific type of acausal trade. As far as I know, there are three types of acausal trade that are not based on inducing self-locating uncertainty (definitely might be neglecting something here): (1) mutual simulation stuff; (2) ECL; and (3) the thing you describe where agents are being so updateless that they don’t know what they value. (You can of course imagine combinations of these.) So it seems like you are claiming that (3) is more likely and more important than (1) and (2). (Is that right?)
I think I disagree:
The third type seems to only be a possibility under an extreme form of UDT, whilst the other two are possible, albeit in varying degrees, under (updateless or updateful) EDT, TDT, some variants of (updateful or updateless) CDT[2], and a variety of UDTs (including much less extreme ones than the aforementioned variant) and probably a bunch of other theories that I don’t know about or am forgetting.
And I think that the type of updateless you describe seems particularly unlikely, per my first paragraph.
It is not necessarily the case that your expectation of what you will value in the future, when standing behind the veil of ignorance, nicely maps onto the actual distribution of values in the multiverse (meaning the trade you will be doing is going to be limited).
(I guess you can have the third type of trade, and not the first and second one, under standard CDT coupled with the strong updatelessness you describe; which is a point in favour of the claim I think you are making—although this seems fairly weak.)
One argument might be that your decision to go updateless in that way is correlated with the choice of the other agents to go updateless in the same way, and then you get the gains from trade by both being uncertain about what you value. But if you are already sufficiently correlated, and taking this into account for the purposes of decision-making, it is not clear to me why you are not just doing ECL directly.
I’m not claiming this (again, it’s about relative not absolute likelihood).
I’m confused. I was comparing the likelihood of (3) to the likelihood of (1) and (2); i.e. saying something about relative likelihood, no?
I’m not saying this is likely, just that this is the most plausible path I see by which UDT leads to nice things for us.
I meant for my main argument to be directed at the claim of relative likelihood; sorry if that was not clear. So I guess my question is: do you think the updatelessness-based trade you described is the most plausible type of acausal trade out of the three that I listed? As said, ECL and simulation-based trade arguably require much fewer assumptions about decision theory. To get ECL off the ground, for example, you arguably just need your decision theory to cooperate in the Twin PD, and many theories satisfy this criterion.
(And the topic of this post is how decision theory leads us to have nice things, not UDT specifically. Or at least I think it should be; I don’t think one ought to be so confident that UDT/FDT is clearly the “correct” theory [not saying this is what you believe], especially given how underdeveloped it is compared to the alternatives.)
I feel pretty confused by this. A superintelligence will know what we intended, probably better than we do ourselves. So unless this paragraph is intended in a particularly metaphorical way, it seems straightforwardly wrong.
By “were the humans pointing me towards...” Nate is not asking “did the humans intend to point me towards...” but rather “did the humans actually point me towards...” That is, we’re assuming some classifier or learning function that acts upon the data actually input, rather than a succesful actual fully aligned works-in-real-life DWIM which arrives at the correct answer given wrong data.
I agree that we’ll have a learning function that works on the data actually input, but it seems strange to me to characterize that learned model as “reflecting back on that data” in order to figure out what it cares about (as opposed to just developing preferences that were shaped by the data).
The cogitation here is implicitly hypothesizing an AI that’s explicitly considering the data and trying to compress it, having been successfully anchored on that data’s compression as identifying an ideal utility function. You’re welcome to think of the preferences as a static object shaped by previous unreflective gradient descent; it sure wouldn’t arrive at any better answers that way, and would also of course want to avoid further gradient descent happening to its current preferences.
Insofar as I have hope… fully-fleshed-out version of UDT would recommend…uncertainty about which agent you are. (I haven’t thought about this much and this could be crazy.)
For the record, I have a convergently similar intuition: FDT removes the Cartesian specialness of the ego at the decision nodes (by framing each decision as a mere logical consequence of an agent-neutral nonphysical fact about FDT itself), but retains the Cartesian specialness of the ego at the utility node(s). I’ve thought about this for O(10 hours), and I also believe it could be crazy, but it does align quite well with the conclusions of Compassionate Moral Realism.
That being said, from an orthogonality perspective, I don’t have any intuition (let alone reasoning) that says that this compassionate breed of LDT is necessary for any particular level of universe-remaking power, including the level needed for a decisive strategic advantage over the rest of Earth’s biosphere. If being a compassionate-LDT agent confers advantages over standard-FDT agents from a Darwinian selection perspective, it would have to be via group selection, but our default trajectory is to end up with a singleton, in which case standard-FDT might be reflectively stable. Perhaps eventually some causal or acausal interaction with non-earth-originating superintelligence would prompt a shift, but, as Nate says,
But that’s not us trading with the AI; that’s us destroying all of the value in our universe-shard and getting ourselves killed in the process, and then banking on the competence and compassion of aliens.
So, if some kind of compassionate-LDT is a source of hope about not destroying all the value in our universe-share and getting ourselves killed, then it must be hope about us figuring out such a theory and selecting for AGIs that implement it from the start, rather than that maybe an AGI would likely convergently become that way before taking over the world.
if some kind of compassionate-LDT is a source of hope about not destroying all the value in our universe-share and getting ourselves killed, then it must be hope about us figuring out such a theory and selecting for AGIs that implement it from the start, rather than that maybe an AGI would likely convergently become that way before taking over the world.
I weakly disagree here, mainly because Nate’s argument for very high levels of risk goes through strong generalization/a “sharp left turn” towards being much more coherent + goal-directed. So I find it hard to evaluate whether, if LDT does converge towards compassion, the sharp left turn would get far enough to reach it (although the fact that humans are fairly close to having universe-remaking power without having any form of compassionate LDT is of course a strong argument weighing the other way).
(Also FWIW I feel very skeptical of the “compassionate moral realism” book, based on your link.)
I’m confused by the claim that humans do not have compassionate LDT. It seems to me that a great many humans learn significant approximation to compassionate LDT. however it doesn’t seem to be built in by default and it probably mostly comes from the training data.
Broadly agree with this post. Couple of small things:
I feel pretty confused by this. A superintelligence will know what we intended, probably better than we do ourselves. So unless this paragraph is intended in a particularly metaphorical way, it seems straightforwardly wrong.
The nearby thing I do agree with is that it’s difficult to “confirm that this exactly-correct concept occurs in its mental precommitment in the requisite way”. (It’s not totally clear to me that we need to get the concept exactly correct, depending on how natural niceness (in the sense of “giving other agents what they want”) is; but I’ll discuss that in more detail on your other post directly about niceness, if I have time.)
Insofar as I have hope in decision theory leading us to have nice things, it mostly comes via the possibility that a fully-fleshed-out version of UDT would recommend updating “all the way back” to a point where there’s uncertainty about which agent you are. (I haven’t thought about this much and this could be crazy.)
For those who haven’t read it, I like this related passage from Paul which gets at a similar idea:
This was surprising to me. For one, that seems like way too much updatelessness. Do you have in mind an agent self-modifying into something like that? If so, when and why? Plausibly this would be after the point of the agent knowing whether it is aligned or not; and I guess I don’t see why there is an incentive for the agent to go updateless with respect to its values at that point in time.[1] At the very least, I think this would not be an instance of all-upside updatelessness.
Secondly, what you are pointing towards seems to be a very specific type of acausal trade. As far as I know, there are three types of acausal trade that are not based on inducing self-locating uncertainty (definitely might be neglecting something here): (1) mutual simulation stuff; (2) ECL; and (3) the thing you describe where agents are being so updateless that they don’t know what they value. (You can of course imagine combinations of these.) So it seems like you are claiming that (3) is more likely and more important than (1) and (2). (Is that right?)
I think I disagree:
The third type seems to only be a possibility under an extreme form of UDT, whilst the other two are possible, albeit in varying degrees, under (updateless or updateful) EDT, TDT, some variants of (updateful or updateless) CDT[2], and a variety of UDTs (including much less extreme ones than the aforementioned variant) and probably a bunch of other theories that I don’t know about or am forgetting.
And I think that the type of updateless you describe seems particularly unlikely, per my first paragraph.
It is not necessarily the case that your expectation of what you will value in the future, when standing behind the veil of ignorance, nicely maps onto the actual distribution of values in the multiverse (meaning the trade you will be doing is going to be limited).
(I guess you can have the third type of trade, and not the first and second one, under standard CDT coupled with the strong updatelessness you describe; which is a point in favour of the claim I think you are making—although this seems fairly weak.)
One argument might be that your decision to go updateless in that way is correlated with the choice of the other agents to go updateless in the same way, and then you get the gains from trade by both being uncertain about what you value. But if you are already sufficiently correlated, and taking this into account for the purposes of decision-making, it is not clear to me why you are not just doing ECL directly.
For example, Spohn’s variant of CDT.
Yepp, I agree. I’m not saying this is likely, just that this is the most plausible path I see by which UDT leads to nice things for us.
I’m not claiming this (again, it’s about relative not absolute likelihood).
I’m confused. I was comparing the likelihood of (3) to the likelihood of (1) and (2); i.e. saying something about relative likelihood, no?
I meant for my main argument to be directed at the claim of relative likelihood; sorry if that was not clear. So I guess my question is: do you think the updatelessness-based trade you described is the most plausible type of acausal trade out of the three that I listed? As said, ECL and simulation-based trade arguably require much fewer assumptions about decision theory. To get ECL off the ground, for example, you arguably just need your decision theory to cooperate in the Twin PD, and many theories satisfy this criterion.
(And the topic of this post is how decision theory leads us to have nice things, not UDT specifically. Or at least I think it should be; I don’t think one ought to be so confident that UDT/FDT is clearly the “correct” theory [not saying this is what you believe], especially given how underdeveloped it is compared to the alternatives.)
By “were the humans pointing me towards...” Nate is not asking “did the humans intend to point me towards...” but rather “did the humans actually point me towards...” That is, we’re assuming some classifier or learning function that acts upon the data actually input, rather than a succesful actual fully aligned works-in-real-life DWIM which arrives at the correct answer given wrong data.
I agree that we’ll have a learning function that works on the data actually input, but it seems strange to me to characterize that learned model as “reflecting back on that data” in order to figure out what it cares about (as opposed to just developing preferences that were shaped by the data).
The cogitation here is implicitly hypothesizing an AI that’s explicitly considering the data and trying to compress it, having been successfully anchored on that data’s compression as identifying an ideal utility function. You’re welcome to think of the preferences as a static object shaped by previous unreflective gradient descent; it sure wouldn’t arrive at any better answers that way, and would also of course want to avoid further gradient descent happening to its current preferences.
For the record, I have a convergently similar intuition: FDT removes the Cartesian specialness of the ego at the decision nodes (by framing each decision as a mere logical consequence of an agent-neutral nonphysical fact about FDT itself), but retains the Cartesian specialness of the ego at the utility node(s). I’ve thought about this for O(10 hours), and I also believe it could be crazy, but it does align quite well with the conclusions of Compassionate Moral Realism.
That being said, from an orthogonality perspective, I don’t have any intuition (let alone reasoning) that says that this compassionate breed of LDT is necessary for any particular level of universe-remaking power, including the level needed for a decisive strategic advantage over the rest of Earth’s biosphere. If being a compassionate-LDT agent confers advantages over standard-FDT agents from a Darwinian selection perspective, it would have to be via group selection, but our default trajectory is to end up with a singleton, in which case standard-FDT might be reflectively stable. Perhaps eventually some causal or acausal interaction with non-earth-originating superintelligence would prompt a shift, but, as Nate says,
So, if some kind of compassionate-LDT is a source of hope about not destroying all the value in our universe-share and getting ourselves killed, then it must be hope about us figuring out such a theory and selecting for AGIs that implement it from the start, rather than that maybe an AGI would likely convergently become that way before taking over the world.
I weakly disagree here, mainly because Nate’s argument for very high levels of risk goes through strong generalization/a “sharp left turn” towards being much more coherent + goal-directed. So I find it hard to evaluate whether, if LDT does converge towards compassion, the sharp left turn would get far enough to reach it (although the fact that humans are fairly close to having universe-remaking power without having any form of compassionate LDT is of course a strong argument weighing the other way).
(Also FWIW I feel very skeptical of the “compassionate moral realism” book, based on your link.)
I’m confused by the claim that humans do not have compassionate LDT. It seems to me that a great many humans learn significant approximation to compassionate LDT. however it doesn’t seem to be built in by default and it probably mostly comes from the training data.