What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference?
On reflection, there probably is not much difference. This is a good point. Still, an AI that just computes what you would want to it do, for example with approval-based AI or mimicry, also seems like a useful way of getting around specifying a decision theory. I haven’t seen much discussion about the issues with the approach, so I’m interested in what problems could occur that using the right decision theory could solve, if any.
The problem is that you-on-reflection is not immediately available, it takes planning and action to compute it, planning and action taken without the benefit of its guidance, thus by default catastrophically misaligned.
True. Note, though, that you-on-reflection is not immediately available to an Ai with the correct decision theory, either. Whether your AI uses the right or wrong decision theory, it still takes effort to figure out what you-on-reflection would want. I don’t see how this is a bigger problem for agents with primitive decision theories, though.
One way to try to deal with this is to have your AI learn a reasonably accurate model of you-on-reflection before it becomes dangerously intelligent, so that way, once it does become superintelligent, it will (hopefully) work reasonably. And again, this works both with a primitive and sophisticated decision theory.
So the practical point of decision theory is deconfusion
Okay. I’m having a hard time thinking concretely about how concretely getting less confused about decision theory would help us, but I intuitively imagine it could help somehow. Do you know, more concretely, of the benefits of this deconfusion?
I’m having a hard time thinking concretely about how concretely getting less confused about decision theory would help us, but I intuitively imagine it could help somehow. Do you know, more concretely, of the benefits of this deconfusion?
It’s a methodology for AI design, the way science is a methodology for engineering, a source of desiderata for what’s important for various purposes. The activity of developing decision theories is itself like the thought experiments it uses, or like apparatus of experimental physics, a way of isolating some consideration from other confusing aspects and magnifying its effects to see more clearly. This teaches lessons that may eventually be used in the separate activity of engineering better devices.
What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference?
On reflection, there probably is not much difference.
Well, there is a huge difference, it’s just not in how the decisions of you-on-reflections get processed by some decision theory vs. repeated without change. The setup of you-on-reflection can be thought of as an algorithm, and the decisions or declared preferences are the results of its computation. Computation of an abstract algorithm doesn’t automatically get to affect the real world, as it may fail to actually get carried out, so it has to be channeled by a process that takes place there. And for the purpose of channeling your decisions, a program that just runs your algorithm is no good, it won’t survive AI x-risks (from other AIs, assuming the risks are not resolved), and so won’t get to channel your decisions. On the other hand, a program that runs a sufficiently sane decision theory might be able to survive (including by destroying everything else potentially dangerous to its survival) and eventually get around to computing your decision and affecting the world with it.
When discussing the idea of a program implementing what you on reflection would do, I think we had different ideas in mind. What I meant was that every action the AI would take would be its best approximation of what you-on-reflection would want. This doesn’t sound dangerous to me. I think that approval-based AI and iterated amplification with HCH would be two ways of making approximations to the output of you-on-reflection. And I don’t think they’re unworkably dangerous.
If the AI is instead allowed to take arbitrarily many unaligned actions before taking the actions you’d recommend, then you are right in that that would be very dangerous. I think this was the idea you had in mind, but feel free to correct me.
If we did misunderstand each other, I apologize. If not, then is there something I’m missing? I would think that a program that faithfully outputs some approximation of “what I’d want on reflection” on every action it takes would not perform devastatingly badly. I on reflection wouldn’t want the world destroyed, so I don’t think it would take actions that would destroy it.
On reflection, there probably is not much difference. This is a good point. Still, an AI that just computes what you would want to it do, for example with approval-based AI or mimicry, also seems like a useful way of getting around specifying a decision theory. I haven’t seen much discussion about the issues with the approach, so I’m interested in what problems could occur that using the right decision theory could solve, if any.
True. Note, though, that you-on-reflection is not immediately available to an Ai with the correct decision theory, either. Whether your AI uses the right or wrong decision theory, it still takes effort to figure out what you-on-reflection would want. I don’t see how this is a bigger problem for agents with primitive decision theories, though.
One way to try to deal with this is to have your AI learn a reasonably accurate model of you-on-reflection before it becomes dangerously intelligent, so that way, once it does become superintelligent, it will (hopefully) work reasonably. And again, this works both with a primitive and sophisticated decision theory.
Okay. I’m having a hard time thinking concretely about how concretely getting less confused about decision theory would help us, but I intuitively imagine it could help somehow. Do you know, more concretely, of the benefits of this deconfusion?
Rob Bensinger just posted a good summary with references on pragmatic motivations for working on things like decision theory.
It’s a methodology for AI design, the way science is a methodology for engineering, a source of desiderata for what’s important for various purposes. The activity of developing decision theories is itself like the thought experiments it uses, or like apparatus of experimental physics, a way of isolating some consideration from other confusing aspects and magnifying its effects to see more clearly. This teaches lessons that may eventually be used in the separate activity of engineering better devices.
Well, there is a huge difference, it’s just not in how the decisions of you-on-reflections get processed by some decision theory vs. repeated without change. The setup of you-on-reflection can be thought of as an algorithm, and the decisions or declared preferences are the results of its computation. Computation of an abstract algorithm doesn’t automatically get to affect the real world, as it may fail to actually get carried out, so it has to be channeled by a process that takes place there. And for the purpose of channeling your decisions, a program that just runs your algorithm is no good, it won’t survive AI x-risks (from other AIs, assuming the risks are not resolved), and so won’t get to channel your decisions. On the other hand, a program that runs a sufficiently sane decision theory might be able to survive (including by destroying everything else potentially dangerous to its survival) and eventually get around to computing your decision and affecting the world with it.
When discussing the idea of a program implementing what you on reflection would do, I think we had different ideas in mind. What I meant was that every action the AI would take would be its best approximation of what you-on-reflection would want. This doesn’t sound dangerous to me. I think that approval-based AI and iterated amplification with HCH would be two ways of making approximations to the output of you-on-reflection. And I don’t think they’re unworkably dangerous.
If the AI is instead allowed to take arbitrarily many unaligned actions before taking the actions you’d recommend, then you are right in that that would be very dangerous. I think this was the idea you had in mind, but feel free to correct me.
If we did misunderstand each other, I apologize. If not, then is there something I’m missing? I would think that a program that faithfully outputs some approximation of “what I’d want on reflection” on every action it takes would not perform devastatingly badly. I on reflection wouldn’t want the world destroyed, so I don’t think it would take actions that would destroy it.