I was wondering if there has been any work getting around specifying the “correct” decision theory by just using a more limited decision theory and adjusting terminal values to deal with this.
I think we might be able to get an agent that does what we want without formalizing the right decision theory buy instead making a modification to the value loading used. This way, even an AI with a simple, limited decision theory like evidential decision theory could make good choices.
I think that normally when considering value loading, people imagine finding a way to provide the AI answers to the question, “What preference ordering over possible worlds would I have, after sufficient reflection, which I would then use with whatever decision theory I would use upon sufficient reflection?”. My proposal is to instead make an evidential decision theory and change value-loading to instead answer the question, “What preference ordering would I, on sufficient reflection, want an agent that uses evidential decision theory to have”? This could be used with other decision theories, too.
In principle, you could make an evidential-decision-theoretic agent take the same actions an agent with a more sophisticated decision theory would.
One option is to modify the utility function to have a penalty for doing things contrary to your ideal decision theory. For example, suppose you, on reflection, would think that functional decision theory is the “correct” decision theory. Then when specifying the preference ordering for the agent, you could provide a penalty in situations in which the agent does something contrary to what functional decision theory would recommend.
Another option is to include preferences about mathematical objects representing what would have happened in some other logically possible world if the agent did a certain action. Then, the AI could have preferences about what that mathematical construct outputs. To be clear, though the construct is about what would happen in some other possible world, it’s an actual mathematical object, and statements about it are still true or false in the real world.
For example, suppose an AI is considering giving in to xor-extortion. Then the AI could see that, conditioning on it having a given output, AI’s like it in other possible worlds would on average do worse, and preferences against this could be loaded.
I don’t see anything implausible about being able to load preferences like those described in the second question into an AI, nor a clear reason to think is would be harder than loading preferences that answer the first one. Some of the techniques for value-loading I’ve seen involve getting the AI to learn terminal values from training data, and you could modify the learned terminal values by modifying the training data appropriately.
Another potential technique to use in value-loading is to somehow pick out the people in the AI’s world model and then query them for their values. Techniques like this could potentially be used to allow for appropriate loading of terminal values, for example, by querying people’s brains for a question like “what would you, on reflection, want an evidential-decision-theoretic agent to value?”, rather than what “would you, on reflection, what an agent using whatever decision theory you actually use to value?”
The advantage of using a simple decision theory and adjusting value loading is that the AI makes the right choice for what we want by just correct value-loading and just implementing a basic, easy decision theory, like evidential decision theory.
What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference? The problem is that you-on-reflection is not immediately available, it takes planning and action to compute it, planning and action taken without the benefit of its guidance, thus by default catastrophically misaligned. So an AI with some decision theory but without further connection to human values might win the capability race by reassembling literally everything into a computer needed to answer the question of whether doing that was good. (It’s still extremely unclear how to get even to that point, whatever decision theory is involved. In particular, this assumes that we can define you-on-reflection and thus we can define you, which is uploading. And what is “preference” specifically, so that it can be a result of a computation, usable by an agent in the real world?)
The way an AI thinks about the world is also the way it might think about predictions of what you-on-reflection says, in order to get a sense of what to do in advance of having computed the results more precisely (and computing them precisely is probably pointless if a useful kind of prediction is possible). So the practical point of decision theory is deconfusion, figuring out how to accomplish things without resorting to an all-devouring black box.
What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference?
On reflection, there probably is not much difference. This is a good point. Still, an AI that just computes what you would want to it do, for example with approval-based AI or mimicry, also seems like a useful way of getting around specifying a decision theory. I haven’t seen much discussion about the issues with the approach, so I’m interested in what problems could occur that using the right decision theory could solve, if any.
The problem is that you-on-reflection is not immediately available, it takes planning and action to compute it, planning and action taken without the benefit of its guidance, thus by default catastrophically misaligned.
True. Note, though, that you-on-reflection is not immediately available to an Ai with the correct decision theory, either. Whether your AI uses the right or wrong decision theory, it still takes effort to figure out what you-on-reflection would want. I don’t see how this is a bigger problem for agents with primitive decision theories, though.
One way to try to deal with this is to have your AI learn a reasonably accurate model of you-on-reflection before it becomes dangerously intelligent, so that way, once it does become superintelligent, it will (hopefully) work reasonably. And again, this works both with a primitive and sophisticated decision theory.
So the practical point of decision theory is deconfusion
Okay. I’m having a hard time thinking concretely about how concretely getting less confused about decision theory would help us, but I intuitively imagine it could help somehow. Do you know, more concretely, of the benefits of this deconfusion?
I’m having a hard time thinking concretely about how concretely getting less confused about decision theory would help us, but I intuitively imagine it could help somehow. Do you know, more concretely, of the benefits of this deconfusion?
It’s a methodology for AI design, the way science is a methodology for engineering, a source of desiderata for what’s important for various purposes. The activity of developing decision theories is itself like the thought experiments it uses, or like apparatus of experimental physics, a way of isolating some consideration from other confusing aspects and magnifying its effects to see more clearly. This teaches lessons that may eventually be used in the separate activity of engineering better devices.
What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference?
On reflection, there probably is not much difference.
Well, there is a huge difference, it’s just not in how the decisions of you-on-reflections get processed by some decision theory vs. repeated without change. The setup of you-on-reflection can be thought of as an algorithm, and the decisions or declared preferences are the results of its computation. Computation of an abstract algorithm doesn’t automatically get to affect the real world, as it may fail to actually get carried out, so it has to be channeled by a process that takes place there. And for the purpose of channeling your decisions, a program that just runs your algorithm is no good, it won’t survive AI x-risks (from other AIs, assuming the risks are not resolved), and so won’t get to channel your decisions. On the other hand, a program that runs a sufficiently sane decision theory might be able to survive (including by destroying everything else potentially dangerous to its survival) and eventually get around to computing your decision and affecting the world with it.
When discussing the idea of a program implementing what you on reflection would do, I think we had different ideas in mind. What I meant was that every action the AI would take would be its best approximation of what you-on-reflection would want. This doesn’t sound dangerous to me. I think that approval-based AI and iterated amplification with HCH would be two ways of making approximations to the output of you-on-reflection. And I don’t think they’re unworkably dangerous.
If the AI is instead allowed to take arbitrarily many unaligned actions before taking the actions you’d recommend, then you are right in that that would be very dangerous. I think this was the idea you had in mind, but feel free to correct me.
If we did misunderstand each other, I apologize. If not, then is there something I’m missing? I would think that a program that faithfully outputs some approximation of “what I’d want on reflection” on every action it takes would not perform devastatingly badly. I on reflection wouldn’t want the world destroyed, so I don’t think it would take actions that would destroy it.
I was wondering if there has been any work getting around specifying the “correct” decision theory by just using a more limited decision theory and adjusting terminal values to deal with this.
I think we might be able to get an agent that does what we want without formalizing the right decision theory buy instead making a modification to the value loading used. This way, even an AI with a simple, limited decision theory like evidential decision theory could make good choices.
I think that normally when considering value loading, people imagine finding a way to provide the AI answers to the question, “What preference ordering over possible worlds would I have, after sufficient reflection, which I would then use with whatever decision theory I would use upon sufficient reflection?”. My proposal is to instead make an evidential decision theory and change value-loading to instead answer the question, “What preference ordering would I, on sufficient reflection, want an agent that uses evidential decision theory to have”? This could be used with other decision theories, too.
In principle, you could make an evidential-decision-theoretic agent take the same actions an agent with a more sophisticated decision theory would.
One option is to modify the utility function to have a penalty for doing things contrary to your ideal decision theory. For example, suppose you, on reflection, would think that functional decision theory is the “correct” decision theory. Then when specifying the preference ordering for the agent, you could provide a penalty in situations in which the agent does something contrary to what functional decision theory would recommend.
Another option is to include preferences about mathematical objects representing what would have happened in some other logically possible world if the agent did a certain action. Then, the AI could have preferences about what that mathematical construct outputs. To be clear, though the construct is about what would happen in some other possible world, it’s an actual mathematical object, and statements about it are still true or false in the real world.
For example, suppose an AI is considering giving in to xor-extortion. Then the AI could see that, conditioning on it having a given output, AI’s like it in other possible worlds would on average do worse, and preferences against this could be loaded.
I don’t see anything implausible about being able to load preferences like those described in the second question into an AI, nor a clear reason to think is would be harder than loading preferences that answer the first one. Some of the techniques for value-loading I’ve seen involve getting the AI to learn terminal values from training data, and you could modify the learned terminal values by modifying the training data appropriately.
Another potential technique to use in value-loading is to somehow pick out the people in the AI’s world model and then query them for their values. Techniques like this could potentially be used to allow for appropriate loading of terminal values, for example, by querying people’s brains for a question like “what would you, on reflection, want an evidential-decision-theoretic agent to value?”, rather than what “would you, on reflection, what an agent using whatever decision theory you actually use to value?”
The advantage of using a simple decision theory and adjusting value loading is that the AI makes the right choice for what we want by just correct value-loading and just implementing a basic, easy decision theory, like evidential decision theory.
What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference? The problem is that you-on-reflection is not immediately available, it takes planning and action to compute it, planning and action taken without the benefit of its guidance, thus by default catastrophically misaligned. So an AI with some decision theory but without further connection to human values might win the capability race by reassembling literally everything into a computer needed to answer the question of whether doing that was good. (It’s still extremely unclear how to get even to that point, whatever decision theory is involved. In particular, this assumes that we can define you-on-reflection and thus we can define you, which is uploading. And what is “preference” specifically, so that it can be a result of a computation, usable by an agent in the real world?)
The way an AI thinks about the world is also the way it might think about predictions of what you-on-reflection says, in order to get a sense of what to do in advance of having computed the results more precisely (and computing them precisely is probably pointless if a useful kind of prediction is possible). So the practical point of decision theory is deconfusion, figuring out how to accomplish things without resorting to an all-devouring black box.
On reflection, there probably is not much difference. This is a good point. Still, an AI that just computes what you would want to it do, for example with approval-based AI or mimicry, also seems like a useful way of getting around specifying a decision theory. I haven’t seen much discussion about the issues with the approach, so I’m interested in what problems could occur that using the right decision theory could solve, if any.
True. Note, though, that you-on-reflection is not immediately available to an Ai with the correct decision theory, either. Whether your AI uses the right or wrong decision theory, it still takes effort to figure out what you-on-reflection would want. I don’t see how this is a bigger problem for agents with primitive decision theories, though.
One way to try to deal with this is to have your AI learn a reasonably accurate model of you-on-reflection before it becomes dangerously intelligent, so that way, once it does become superintelligent, it will (hopefully) work reasonably. And again, this works both with a primitive and sophisticated decision theory.
Okay. I’m having a hard time thinking concretely about how concretely getting less confused about decision theory would help us, but I intuitively imagine it could help somehow. Do you know, more concretely, of the benefits of this deconfusion?
Rob Bensinger just posted a good summary with references on pragmatic motivations for working on things like decision theory.
It’s a methodology for AI design, the way science is a methodology for engineering, a source of desiderata for what’s important for various purposes. The activity of developing decision theories is itself like the thought experiments it uses, or like apparatus of experimental physics, a way of isolating some consideration from other confusing aspects and magnifying its effects to see more clearly. This teaches lessons that may eventually be used in the separate activity of engineering better devices.
Well, there is a huge difference, it’s just not in how the decisions of you-on-reflections get processed by some decision theory vs. repeated without change. The setup of you-on-reflection can be thought of as an algorithm, and the decisions or declared preferences are the results of its computation. Computation of an abstract algorithm doesn’t automatically get to affect the real world, as it may fail to actually get carried out, so it has to be channeled by a process that takes place there. And for the purpose of channeling your decisions, a program that just runs your algorithm is no good, it won’t survive AI x-risks (from other AIs, assuming the risks are not resolved), and so won’t get to channel your decisions. On the other hand, a program that runs a sufficiently sane decision theory might be able to survive (including by destroying everything else potentially dangerous to its survival) and eventually get around to computing your decision and affecting the world with it.
When discussing the idea of a program implementing what you on reflection would do, I think we had different ideas in mind. What I meant was that every action the AI would take would be its best approximation of what you-on-reflection would want. This doesn’t sound dangerous to me. I think that approval-based AI and iterated amplification with HCH would be two ways of making approximations to the output of you-on-reflection. And I don’t think they’re unworkably dangerous.
If the AI is instead allowed to take arbitrarily many unaligned actions before taking the actions you’d recommend, then you are right in that that would be very dangerous. I think this was the idea you had in mind, but feel free to correct me.
If we did misunderstand each other, I apologize. If not, then is there something I’m missing? I would think that a program that faithfully outputs some approximation of “what I’d want on reflection” on every action it takes would not perform devastatingly badly. I on reflection wouldn’t want the world destroyed, so I don’t think it would take actions that would destroy it.