What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference? The problem is that you-on-reflection is not immediately available, it takes planning and action to compute it, planning and action taken without the benefit of its guidance, thus by default catastrophically misaligned. So an AI with some decision theory but without further connection to human values might win the capability race by reassembling literally everything into a computer needed to answer the question of whether doing that was good. (It’s still extremely unclear how to get even to that point, whatever decision theory is involved. In particular, this assumes that we can define you-on-reflection and thus we can define you, which is uploading. And what is “preference” specifically, so that it can be a result of a computation, usable by an agent in the real world?)
The way an AI thinks about the world is also the way it might think about predictions of what you-on-reflection says, in order to get a sense of what to do in advance of having computed the results more precisely (and computing them precisely is probably pointless if a useful kind of prediction is possible). So the practical point of decision theory is deconfusion, figuring out how to accomplish things without resorting to an all-devouring black box.
What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference?
On reflection, there probably is not much difference. This is a good point. Still, an AI that just computes what you would want to it do, for example with approval-based AI or mimicry, also seems like a useful way of getting around specifying a decision theory. I haven’t seen much discussion about the issues with the approach, so I’m interested in what problems could occur that using the right decision theory could solve, if any.
The problem is that you-on-reflection is not immediately available, it takes planning and action to compute it, planning and action taken without the benefit of its guidance, thus by default catastrophically misaligned.
True. Note, though, that you-on-reflection is not immediately available to an Ai with the correct decision theory, either. Whether your AI uses the right or wrong decision theory, it still takes effort to figure out what you-on-reflection would want. I don’t see how this is a bigger problem for agents with primitive decision theories, though.
One way to try to deal with this is to have your AI learn a reasonably accurate model of you-on-reflection before it becomes dangerously intelligent, so that way, once it does become superintelligent, it will (hopefully) work reasonably. And again, this works both with a primitive and sophisticated decision theory.
So the practical point of decision theory is deconfusion
Okay. I’m having a hard time thinking concretely about how concretely getting less confused about decision theory would help us, but I intuitively imagine it could help somehow. Do you know, more concretely, of the benefits of this deconfusion?
I’m having a hard time thinking concretely about how concretely getting less confused about decision theory would help us, but I intuitively imagine it could help somehow. Do you know, more concretely, of the benefits of this deconfusion?
It’s a methodology for AI design, the way science is a methodology for engineering, a source of desiderata for what’s important for various purposes. The activity of developing decision theories is itself like the thought experiments it uses, or like apparatus of experimental physics, a way of isolating some consideration from other confusing aspects and magnifying its effects to see more clearly. This teaches lessons that may eventually be used in the separate activity of engineering better devices.
What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference?
On reflection, there probably is not much difference.
Well, there is a huge difference, it’s just not in how the decisions of you-on-reflections get processed by some decision theory vs. repeated without change. The setup of you-on-reflection can be thought of as an algorithm, and the decisions or declared preferences are the results of its computation. Computation of an abstract algorithm doesn’t automatically get to affect the real world, as it may fail to actually get carried out, so it has to be channeled by a process that takes place there. And for the purpose of channeling your decisions, a program that just runs your algorithm is no good, it won’t survive AI x-risks (from other AIs, assuming the risks are not resolved), and so won’t get to channel your decisions. On the other hand, a program that runs a sufficiently sane decision theory might be able to survive (including by destroying everything else potentially dangerous to its survival) and eventually get around to computing your decision and affecting the world with it.
When discussing the idea of a program implementing what you on reflection would do, I think we had different ideas in mind. What I meant was that every action the AI would take would be its best approximation of what you-on-reflection would want. This doesn’t sound dangerous to me. I think that approval-based AI and iterated amplification with HCH would be two ways of making approximations to the output of you-on-reflection. And I don’t think they’re unworkably dangerous.
If the AI is instead allowed to take arbitrarily many unaligned actions before taking the actions you’d recommend, then you are right in that that would be very dangerous. I think this was the idea you had in mind, but feel free to correct me.
If we did misunderstand each other, I apologize. If not, then is there something I’m missing? I would think that a program that faithfully outputs some approximation of “what I’d want on reflection” on every action it takes would not perform devastatingly badly. I on reflection wouldn’t want the world destroyed, so I don’t think it would take actions that would destroy it.
What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference? The problem is that you-on-reflection is not immediately available, it takes planning and action to compute it, planning and action taken without the benefit of its guidance, thus by default catastrophically misaligned. So an AI with some decision theory but without further connection to human values might win the capability race by reassembling literally everything into a computer needed to answer the question of whether doing that was good. (It’s still extremely unclear how to get even to that point, whatever decision theory is involved. In particular, this assumes that we can define you-on-reflection and thus we can define you, which is uploading. And what is “preference” specifically, so that it can be a result of a computation, usable by an agent in the real world?)
The way an AI thinks about the world is also the way it might think about predictions of what you-on-reflection says, in order to get a sense of what to do in advance of having computed the results more precisely (and computing them precisely is probably pointless if a useful kind of prediction is possible). So the practical point of decision theory is deconfusion, figuring out how to accomplish things without resorting to an all-devouring black box.
On reflection, there probably is not much difference. This is a good point. Still, an AI that just computes what you would want to it do, for example with approval-based AI or mimicry, also seems like a useful way of getting around specifying a decision theory. I haven’t seen much discussion about the issues with the approach, so I’m interested in what problems could occur that using the right decision theory could solve, if any.
True. Note, though, that you-on-reflection is not immediately available to an Ai with the correct decision theory, either. Whether your AI uses the right or wrong decision theory, it still takes effort to figure out what you-on-reflection would want. I don’t see how this is a bigger problem for agents with primitive decision theories, though.
One way to try to deal with this is to have your AI learn a reasonably accurate model of you-on-reflection before it becomes dangerously intelligent, so that way, once it does become superintelligent, it will (hopefully) work reasonably. And again, this works both with a primitive and sophisticated decision theory.
Okay. I’m having a hard time thinking concretely about how concretely getting less confused about decision theory would help us, but I intuitively imagine it could help somehow. Do you know, more concretely, of the benefits of this deconfusion?
Rob Bensinger just posted a good summary with references on pragmatic motivations for working on things like decision theory.
It’s a methodology for AI design, the way science is a methodology for engineering, a source of desiderata for what’s important for various purposes. The activity of developing decision theories is itself like the thought experiments it uses, or like apparatus of experimental physics, a way of isolating some consideration from other confusing aspects and magnifying its effects to see more clearly. This teaches lessons that may eventually be used in the separate activity of engineering better devices.
Well, there is a huge difference, it’s just not in how the decisions of you-on-reflections get processed by some decision theory vs. repeated without change. The setup of you-on-reflection can be thought of as an algorithm, and the decisions or declared preferences are the results of its computation. Computation of an abstract algorithm doesn’t automatically get to affect the real world, as it may fail to actually get carried out, so it has to be channeled by a process that takes place there. And for the purpose of channeling your decisions, a program that just runs your algorithm is no good, it won’t survive AI x-risks (from other AIs, assuming the risks are not resolved), and so won’t get to channel your decisions. On the other hand, a program that runs a sufficiently sane decision theory might be able to survive (including by destroying everything else potentially dangerous to its survival) and eventually get around to computing your decision and affecting the world with it.
When discussing the idea of a program implementing what you on reflection would do, I think we had different ideas in mind. What I meant was that every action the AI would take would be its best approximation of what you-on-reflection would want. This doesn’t sound dangerous to me. I think that approval-based AI and iterated amplification with HCH would be two ways of making approximations to the output of you-on-reflection. And I don’t think they’re unworkably dangerous.
If the AI is instead allowed to take arbitrarily many unaligned actions before taking the actions you’d recommend, then you are right in that that would be very dangerous. I think this was the idea you had in mind, but feel free to correct me.
If we did misunderstand each other, I apologize. If not, then is there something I’m missing? I would think that a program that faithfully outputs some approximation of “what I’d want on reflection” on every action it takes would not perform devastatingly badly. I on reflection wouldn’t want the world destroyed, so I don’t think it would take actions that would destroy it.