This paper motivates the study of decision theory as necessary for aligning smarter-than-human artificial systems with human interests. We discuss the shortcomings of two standard formulations of decision theory, and demonstrate that they cannot be used to describe an idealized decision procedure suitable for approximation by artificial systems. We then explore the notions of strategy selection and logical counterfactuals, two recent insights into decision theory that point the way toward promising paths for future research.
Following the Corrigibility paper, this is the second in a series of six papers motivating MIRI’s active research areas. Also included in the series will be a technical agenda, which motivates all six research areas and describes the reasons why we have selected these topics in particular, and an annotated bibliography, which compiles a fair bit of related work. I plan to post one paper every week or two for the next few months.
I’ve decided to start with the decision theory paper, as it’s one of the meatiest. This paper compiles and summarizes quite a bit of work on decision theory that was done right here on LessWrong. There is a lot more to be said on the subject of decision theory than can fit into a single paper, but I think this one does a fairly good job of describing why we’re interested in the field and summarizing some recent work in the area. The introduction is copied below. Enjoy!
As artificially intelligent machines grow more capable and autonomous, the behavior of their decision procedures becomes increasingly important. This is especially true in systems possessing great general intelligence: superintelligent systems could have a massive impact on the world (Bostrom 2014), and if a superintelligent system made poor decisions (by human standards) at a critical juncture, the results could be catastrophic (Yudkowsky 2008). When constructing systems capable of attaining superintelligence, it is important for them to use highly reliable decision procedures.
Verifying that a system works well in test conditions is not sufficient for high confidence. Consider the genetic algorithm of Bird and Layzell (2002), which, if run on a simulated representation of a circuit board, would have evolved an oscillating circuit. Running in reality, the algorithm instead re-purposed the circuit tracks on its motherboard as a makeshift radio to amplified oscillating signals from nearby computers. Smarter-than-human systems acting in reality may encounter situations beyond both the experience and the imagination of the programmers. In order to verify that an intelligent system would make good decisions in the real world, it is important to have a theoretical understanding of why that algorithm, specifically, is expected to make good decisions even in unanticipated scenarios.
What does it mean to “make good decisions”? To formalize the question, it is necessary to precisely define a process that takes a problem description and identifies the best available decision (with respect to some set of preferences1). Such a process could not be run, of course; but it would demonstrate a full understanding of the problem of decision-making. If someone cannot formally state what it means to find the best decision in theory, then they are probably not ready to construct heuristics that attempt to find the best decision in practice.
At first glance, formalizing an idealized process which identifies the best decision in theory may seem trivial: iterate over all available actions, calculate the utility that would be attained in expectation if that action were taken, and select the action which maximizes expected utility. But what are the available actions? And what are the counterfactual universes corresponding to what “would happen” if an action “were taken”? These questions are more difficult than they may seem.
The difficulty is easiest to illustrate in a deterministic setting. Consider a deterministic decision procedure embedded in a deterministic environment. There is exactly one action that the decision procedure is going to select. What, then, are the actions it “could have taken”? Identifying this set may not be easy, especially if the line between agent and environment is blurry. (Recall the genetic algorithm repurposing the motherboard as a radio.) However, action identification is not the focus of this paper.
This paper focuses on the problem of evaluating each action given the action set. The deterministic algorithm will only take one of the available actions; how is the counterfactual environment constructed, in which a deterministic part of the environment does something it doesn’t? Answering this question requires a satisfactory theory of counterfactual reasoning, and that theory does not yet exist.
Many problems are characterized by their idealized solutions, and the problem of decision-making is no exception. To fully describe the problem faced by intelligent agents making decisions, it is necessary to provide an idealized procedure which takes a description of an environment and one of the agents within, and identifies the best action available to that agent. Philosophers have studied candidate procedures for quite some time, under the name of decision theory. The investigation of what is now called decision theory stretches back to Pascal and Bernoulli; more recently decision theory has been studied by Wald (1939), Lehmann (1950), Jeffrey (1965), Lewis (1981), Joyce (1999), Pearl (2000) and many others.
Various formulations of decision theory correspond to different ways of formalizing counterfactual reasoning. Unfortunately, the standard answers from the literature do not allow for the description of an idealized decision procedure. Two common formulations and their shortcomings are discussed in Section 2. Section 3 argues that these shortcomings imply the need for a better theory of counterfactual reasoning to fully describe the problem that artificially intelligent systems face when selecting actions. Sections 4 and 5 discuss two recent insights that give some reason for optimism and point the way toward promising avenues for future research. Nevertheless, Section 6 briefly discusses the pessimistic scenario in which it is not possible to fully formalize the problem of decision-making before the need arises for robust decision-making heuristics. Section 7 concludes by tying this study of decision theory back to the more general problem of aligning smarter-than-human systems with human interests.
1: For simplicity, assume von Neumann-Morgenstern rational preferences, that is, preferences describable by some utility function. The problems of decision theory arise regardless of how preferences are encoded.
New paper from MIRI: “Toward idealized decision theory”
I’m pleased to announce a new paper from MIRI: Toward Idealized Decision Theory.
Abstract:
Following the Corrigibility paper, this is the second in a series of six papers motivating MIRI’s active research areas. Also included in the series will be a technical agenda, which motivates all six research areas and describes the reasons why we have selected these topics in particular, and an annotated bibliography, which compiles a fair bit of related work. I plan to post one paper every week or two for the next few months.
I’ve decided to start with the decision theory paper, as it’s one of the meatiest. This paper compiles and summarizes quite a bit of work on decision theory that was done right here on LessWrong. There is a lot more to be said on the subject of decision theory than can fit into a single paper, but I think this one does a fairly good job of describing why we’re interested in the field and summarizing some recent work in the area. The introduction is copied below. Enjoy!
As artificially intelligent machines grow more capable and autonomous, the behavior of their decision procedures becomes increasingly important. This is especially true in systems possessing great general intelligence: superintelligent systems could have a massive impact on the world (Bostrom 2014), and if a superintelligent system made poor decisions (by human standards) at a critical juncture, the results could be catastrophic (Yudkowsky 2008). When constructing systems capable of attaining superintelligence, it is important for them to use highly reliable decision procedures.
Verifying that a system works well in test conditions is not sufficient for high confidence. Consider the genetic algorithm of Bird and Layzell (2002), which, if run on a simulated representation of a circuit board, would have evolved an oscillating circuit. Running in reality, the algorithm instead re-purposed the circuit tracks on its motherboard as a makeshift radio to amplified oscillating signals from nearby computers. Smarter-than-human systems acting in reality may encounter situations beyond both the experience and the imagination of the programmers. In order to verify that an intelligent system would make good decisions in the real world, it is important to have a theoretical understanding of why that algorithm, specifically, is expected to make good decisions even in unanticipated scenarios.
What does it mean to “make good decisions”? To formalize the question, it is necessary to precisely define a process that takes a problem description and identifies the best available decision (with respect to some set of preferences1). Such a process could not be run, of course; but it would demonstrate a full understanding of the problem of decision-making. If someone cannot formally state what it means to find the best decision in theory, then they are probably not ready to construct heuristics that attempt to find the best decision in practice.
At first glance, formalizing an idealized process which identifies the best decision in theory may seem trivial: iterate over all available actions, calculate the utility that would be attained in expectation if that action were taken, and select the action which maximizes expected utility. But what are the available actions? And what are the counterfactual universes corresponding to what “would happen” if an action “were taken”? These questions are more difficult than they may seem.
The difficulty is easiest to illustrate in a deterministic setting. Consider a deterministic decision procedure embedded in a deterministic environment. There is exactly one action that the decision procedure is going to select. What, then, are the actions it “could have taken”? Identifying this set may not be easy, especially if the line between agent and environment is blurry. (Recall the genetic algorithm repurposing the motherboard as a radio.) However, action identification is not the focus of this paper.
This paper focuses on the problem of evaluating each action given the action set. The deterministic algorithm will only take one of the available actions; how is the counterfactual environment constructed, in which a deterministic part of the environment does something it doesn’t? Answering this question requires a satisfactory theory of counterfactual reasoning, and that theory does not yet exist.
Many problems are characterized by their idealized solutions, and the problem of decision-making is no exception. To fully describe the problem faced by intelligent agents making decisions, it is necessary to provide an idealized procedure which takes a description of an environment and one of the agents within, and identifies the best action available to that agent. Philosophers have studied candidate procedures for quite some time, under the name of decision theory. The investigation of what is now called decision theory stretches back to Pascal and Bernoulli; more recently decision theory has been studied by Wald (1939), Lehmann (1950), Jeffrey (1965), Lewis (1981), Joyce (1999), Pearl (2000) and many others.
Various formulations of decision theory correspond to different ways of formalizing counterfactual reasoning. Unfortunately, the standard answers from the literature do not allow for the description of an idealized decision procedure. Two common formulations and their shortcomings are discussed in Section 2. Section 3 argues that these shortcomings imply the need for a better theory of counterfactual reasoning to fully describe the problem that artificially intelligent systems face when selecting actions. Sections 4 and 5 discuss two recent insights that give some reason for optimism and point the way toward promising avenues for future research. Nevertheless, Section 6 briefly discusses the pessimistic scenario in which it is not possible to fully formalize the problem of decision-making before the need arises for robust decision-making heuristics. Section 7 concludes by tying this study of decision theory back to the more general problem of aligning smarter-than-human systems with human interests.
1: For simplicity, assume von Neumann-Morgenstern rational preferences, that is, preferences describable by some utility function. The problems of decision theory arise regardless of how preferences are encoded.