When I think about the people working on AGI outcomes within academia these days, I think of people like Robin Hanson, Nick Bostrom, Stuart Russell, and Eric Drexler, and it’s not immediately obvious to me that these people have converged more with each other than any of them have with researchers at MIRI.
I see the lack of convergence between people in academia as supporting my position, since I am claiming that MIRI is looking too narrowly. I think AI risk research is still in a brainstorming stage where we still don’t have a good grasp on what all the possibilities are. If all of these people have rather different ideas for how to go about it, was is it just the approaches that Eliezer Yudkowsky likes that are getting all the funding?
I also have specific objections. Let’s take TDT and FDT as an example since they were mentioned in the post. The primary motivation for them is that they handle Newcombe-like dilemmas better. I don’t think Newcombe-like dilemmas are relevant for the reasoning of potentially dangerous AIs, and I don’t think you will get a good holistic understanding of what a good reasoner out of these theories. One secondary motivation for TDT/UDT/FDT is a fallacious argument that it endorses cooperation in the true prisoner’s dilemma. Informal arguments seem to be the load-bearing applying these theories to any particular problem; the technical works seem to be mainly formalizing narrow instances of these theories to agree with the informal intuition. I don’t know about FDT, but a fundamental assumption behind TDT and UDT is the existence of a causal structure behind logical statements, which sounds implausible to me.
I don’t think Newcombe-like dilemmas are relevant for the reasoning of potentially dangerous AIs
When a programmer writes software, it’s because they have a prediction in mind about how the software is likely to behave in the future: we have goals we want software to achieve, and we write the code that we think will behave in the intended way. AGI systems are particularly likely to end up in Newcomblike scenarios if we build them to learn our values by reasoning about their programmers’ intentions and goals, or if the system constructs any intelligent subprocesses or subagents to execute tasks, or executes significant self-modifications at all. In the latter cases, the system itself is then in a position of designing reasoning algorithms based on predictions about how the algorithms will behave in the future.
The same principle holds if two agents are modeling each other in real time, as opposed to predicting a future agent; e.g., two copies of an AGI system, or subsystems of a single AGI system. The copies don’t have to be exact, and the systems don’t have to have direct access to each other’s source code, for the same issues to crop up.
One secondary motivation for TDT/UDT/FDT is a fallacious argument that it endorses cooperation in the true prisoner’s dilemma.
What’s the fallacy you’re claiming?
Informal arguments seem to be the load-bearing applying these theories to any particular problem; the technical works seem to be mainly formalizing narrow instances of these theories to agree with the informal intuition.
This seems wrong, if you’re saying that we can’t formally establish the behavior of different decision theories, or that applying theories to different cases requires ad-hoc emendations; see section 5 of “Functional Decision Theory” (and subsequent sections) for a comparison and step-by-step walkthrough of procedures for FDT, CDT, and EDT. One of the advantages we claim for FDT over CDT and EDT is that it doesn’t require ad-hoc tailoring for different dilemmas (e.g., ad-hoc precommitment methods or ratification procedures, or modifications to the agent’s prior).
I don’t know about FDT, but a fundamental assumption behind TDT and UDT
“UDT” is ambiguous and has been used to refer to a lot of different things, but Wei Dai’s original proposals of UDT are particular instances of FDT. FDT can be thought of as a generalization of Wei Dai’s first versions of UDT, that makes fewer commitments than Wei Dai’s particular approach.
but a fundamental assumption behind TDT and UDT is the existence of a causal structure behind logical statements, which sounds implausible to me.
None of the theories mentioned make any assumption like that; see the FDT paper above.
First, to be clear, I am referring to things such as this description of the prisoner’s dilemma and EY’s claim that TDT endorses cooperation. The published material has been careful to only say that these decision theories endorse cooperation among identical copies running the same source code, but as far as I can tell some researchers at MIRI still believe this stronger claim and this claim has been a major part of the public perception of these decision theories (example here; see section II).
The problem is that when two FDT agent with a different utility functions and different prior knowledge are facing a prisoner’s dilemma with each other, then their decisions are actually two different logical variables X0 and X1. The argument for cooperating is that X0 and X1 are sufficiently similar to one another that in the counterfactual where X0=C we also have X1=C. However, you could just as easily take the opposite premise, where X0 and X1 are sufficiently dissimilar that counterfactually changing X0 will have no effect on X1. Then you are left with the usual CDT analysis of the game. Given the vagueness of logical counterfactuals it is impossible to distinguish these two situations.
Here’s a related question: What does FDT say about the centipede game? There’s no symmetry between the players so I can’t just plug in the formalism. I don’t see how you can give an answer that’s in the spirit of cooperating in the prisoner’s dilemma without reaching the conclusion that FDT involves altruism among all FDT agents through some kind of veil of ignorance argument. And taking that conclusion is counter to the affine-transformation-invariance of utility functions.
“but a fundamental assumption behind TDT and UDT is the existence of a causal structure behind logical statements, which sounds implausible to me.”
None of the theories mentioned make any assumption like that; see the FDT paper above.
Page 14 of the FDT paper:
Instead of a do operator, FDT needs a true operator, which takes a logical sentence φ and updates P to represent the scenario where φ is true...
...Equation (4) works given a graph that accurately describes how changing the value of a logical variable affects other variables, but it is not yet clear how to construct such a thing—nor even whether it can be done in a satisfactory manner within Pearl’s framework.
This seems wrong, if you’re saying that we can’t formally establish the behavior of different decision theories, or that applying theories to different cases requires ad-hoc emendations; see section 5 of “Functional Decision Theory” (and subsequent sections) for a comparison and step-by-step walkthrough of procedures for FDT, CDT, and EDT. One of the advantages we claim for FDT over CDT and EDT is that it doesn’t require ad-hoc tailoring for different dilemmas (e.g., ad-hoc precommitment methods or ratification procedures, or modifications to the agent’s prior).
The main thing that distinguishes FDT from CDT is how the true operator mentioned above functions. As far as I’m aware this is always inserted by hand. This is easy to for situations where entities make perfect simulations of one another, but there aren’t even rough guidelines for what to do when the computations that are done cannot be delineated in such a clean manner. In addition, if this was a rich research field I would expect more “math that bites back”, i.e., substantive results that reduce to clearly-defined mathematical problems whose result wasn’t expected during the formalization.
This point about “load-bearing elements” is at its root an intuitive judgement that might be difficult for me to convey properly.
The research directions that people at MIRI prioritize are already pretty heavily informed by work that was first developed or written up by people like Paul Christiano (at OpenAI), Stuart Armstrong (a MIRI research associate, primary affiliation FHI), Wei Dai, and others. MIRI researchers work on the things that look most promising to them, and those things get added to our agenda if they aren’t already on the agenda.
Different researchers at MIRI have different ideas about what’s most promising; e.g., the AAMLS agenda incorporated a lot of problems from Paul Christiano’s agenda, and reflected the new intuitions and inside-view models brought to the table by the researchers who joined MIRI in early and mid-2015.
I’m guessing our primary disagreement is about how promising various object-level research directions at MIRI are. It might also be that you’re thinking that there’s less back-and-forth between MIRI and researchers at other institutions than actually occurs, or more viewpoint uniformity at MIRI than actually exists. Or you might be thinking that people at MIRI working on similar research problems together reflects top-down decisions by Eliezer, rather than reflecting (a) people with similar methodologies and intuitions wanting to work together, and (b) convergence happening faster between people who share the same physical space.
In this case, I think some of the relevant methodology/intuition differences are on questions like:
Are we currently confused on a fundamental level about what general-purpose good reasoning in physical environments is? Not just “how can we implement this in practice?”, but “what (at a sufficient level of precision) are we even talking about?”
Can we become much less confused, and develop good models of how AGI systems decompose problems into subproblems, allocate cognitive resources to different subproblems, etc.?
Is it a top priority for developers to go into large safety-critical software projects like this with as few fundamental confusions about what they’re doing as possible?
People who answer “yes” to those questions tend to cluster together and reach a lot of similar object-level conclusions, and people who answer “no” form other clusters. Resolving those very basic disagreements is therefore likely to be especially high-value.
I don’t think Newcombe-like dilemmas are relevant for the reasoning of potentially dangerous AIs
The primary reason to try to get a better understanding of realistic counterfactual reasoning (e.g., what an agent’s counterfactuals should look like in a decision problem) is that AGI is in large part about counterfactual reasoning. A generating methodology for a lot of MIRI researchers’ work is that we want to ensure the developers of early AGI systems aren’t “flying blind” with respect to how and why their systems work; we want developers to be able to anticipate the consequences of many design choices before they make them.
The idea isn’t that AGI techniques will look like decision theory, any more than they’ll look like probability theory. The idea is rather that it’s essential to have a basic understanding of what decision-making and probabilistic reasoning are before you build a general-purpose probabilistic reasoner and decision-maker. Newcomb’s problem is important in that context primarily because it’s one of the biggest anomalies in our current understanding of counterfactual reasoning. Zeroing in on anomalies in established theories and paradigms, and tugging on loose threads until we get a sense of why our theories break down at this particular point, is a pretty standard and productive approach in the sciences.
All that said, Newcomblike scenarios are ubiquitous in real life, and would probably be much more so for AGI systems. I’ll say more about this in a second comment.
I see the lack of convergence between people in academia as supporting my position, since I am claiming that MIRI is looking too narrowly. I think AI risk research is still in a brainstorming stage where we still don’t have a good grasp on what all the possibilities are. If all of these people have rather different ideas for how to go about it, was is it just the approaches that Eliezer Yudkowsky likes that are getting all the funding?
I also have specific objections. Let’s take TDT and FDT as an example since they were mentioned in the post. The primary motivation for them is that they handle Newcombe-like dilemmas better. I don’t think Newcombe-like dilemmas are relevant for the reasoning of potentially dangerous AIs, and I don’t think you will get a good holistic understanding of what a good reasoner out of these theories. One secondary motivation for TDT/UDT/FDT is a fallacious argument that it endorses cooperation in the true prisoner’s dilemma. Informal arguments seem to be the load-bearing applying these theories to any particular problem; the technical works seem to be mainly formalizing narrow instances of these theories to agree with the informal intuition. I don’t know about FDT, but a fundamental assumption behind TDT and UDT is the existence of a causal structure behind logical statements, which sounds implausible to me.
When a programmer writes software, it’s because they have a prediction in mind about how the software is likely to behave in the future: we have goals we want software to achieve, and we write the code that we think will behave in the intended way. AGI systems are particularly likely to end up in Newcomblike scenarios if we build them to learn our values by reasoning about their programmers’ intentions and goals, or if the system constructs any intelligent subprocesses or subagents to execute tasks, or executes significant self-modifications at all. In the latter cases, the system itself is then in a position of designing reasoning algorithms based on predictions about how the algorithms will behave in the future.
The same principle holds if two agents are modeling each other in real time, as opposed to predicting a future agent; e.g., two copies of an AGI system, or subsystems of a single AGI system. The copies don’t have to be exact, and the systems don’t have to have direct access to each other’s source code, for the same issues to crop up.
What’s the fallacy you’re claiming?
This seems wrong, if you’re saying that we can’t formally establish the behavior of different decision theories, or that applying theories to different cases requires ad-hoc emendations; see section 5 of “Functional Decision Theory” (and subsequent sections) for a comparison and step-by-step walkthrough of procedures for FDT, CDT, and EDT. One of the advantages we claim for FDT over CDT and EDT is that it doesn’t require ad-hoc tailoring for different dilemmas (e.g., ad-hoc precommitment methods or ratification procedures, or modifications to the agent’s prior).
“UDT” is ambiguous and has been used to refer to a lot of different things, but Wei Dai’s original proposals of UDT are particular instances of FDT. FDT can be thought of as a generalization of Wei Dai’s first versions of UDT, that makes fewer commitments than Wei Dai’s particular approach.
None of the theories mentioned make any assumption like that; see the FDT paper above.
First, to be clear, I am referring to things such as this description of the prisoner’s dilemma and EY’s claim that TDT endorses cooperation. The published material has been careful to only say that these decision theories endorse cooperation among identical copies running the same source code, but as far as I can tell some researchers at MIRI still believe this stronger claim and this claim has been a major part of the public perception of these decision theories (example here; see section II).
The problem is that when two FDT agent with a different utility functions and different prior knowledge are facing a prisoner’s dilemma with each other, then their decisions are actually two different logical variables X0 and X1. The argument for cooperating is that X0 and X1 are sufficiently similar to one another that in the counterfactual where X0=C we also have X1=C. However, you could just as easily take the opposite premise, where X0 and X1 are sufficiently dissimilar that counterfactually changing X0 will have no effect on X1. Then you are left with the usual CDT analysis of the game. Given the vagueness of logical counterfactuals it is impossible to distinguish these two situations.
Here’s a related question: What does FDT say about the centipede game? There’s no symmetry between the players so I can’t just plug in the formalism. I don’t see how you can give an answer that’s in the spirit of cooperating in the prisoner’s dilemma without reaching the conclusion that FDT involves altruism among all FDT agents through some kind of veil of ignorance argument. And taking that conclusion is counter to the affine-transformation-invariance of utility functions.
Page 14 of the FDT paper:
The main thing that distinguishes FDT from CDT is how the true operator mentioned above functions. As far as I’m aware this is always inserted by hand. This is easy to for situations where entities make perfect simulations of one another, but there aren’t even rough guidelines for what to do when the computations that are done cannot be delineated in such a clean manner. In addition, if this was a rich research field I would expect more “math that bites back”, i.e., substantive results that reduce to clearly-defined mathematical problems whose result wasn’t expected during the formalization.
This point about “load-bearing elements” is at its root an intuitive judgement that might be difficult for me to convey properly.
The research directions that people at MIRI prioritize are already pretty heavily informed by work that was first developed or written up by people like Paul Christiano (at OpenAI), Stuart Armstrong (a MIRI research associate, primary affiliation FHI), Wei Dai, and others. MIRI researchers work on the things that look most promising to them, and those things get added to our agenda if they aren’t already on the agenda.
Different researchers at MIRI have different ideas about what’s most promising; e.g., the AAMLS agenda incorporated a lot of problems from Paul Christiano’s agenda, and reflected the new intuitions and inside-view models brought to the table by the researchers who joined MIRI in early and mid-2015.
I’m guessing our primary disagreement is about how promising various object-level research directions at MIRI are. It might also be that you’re thinking that there’s less back-and-forth between MIRI and researchers at other institutions than actually occurs, or more viewpoint uniformity at MIRI than actually exists. Or you might be thinking that people at MIRI working on similar research problems together reflects top-down decisions by Eliezer, rather than reflecting (a) people with similar methodologies and intuitions wanting to work together, and (b) convergence happening faster between people who share the same physical space.
In this case, I think some of the relevant methodology/intuition differences are on questions like:
Are we currently confused on a fundamental level about what general-purpose good reasoning in physical environments is? Not just “how can we implement this in practice?”, but “what (at a sufficient level of precision) are we even talking about?”
Can we become much less confused, and develop good models of how AGI systems decompose problems into subproblems, allocate cognitive resources to different subproblems, etc.?
Is it a top priority for developers to go into large safety-critical software projects like this with as few fundamental confusions about what they’re doing as possible?
People who answer “yes” to those questions tend to cluster together and reach a lot of similar object-level conclusions, and people who answer “no” form other clusters. Resolving those very basic disagreements is therefore likely to be especially high-value.
The primary reason to try to get a better understanding of realistic counterfactual reasoning (e.g., what an agent’s counterfactuals should look like in a decision problem) is that AGI is in large part about counterfactual reasoning. A generating methodology for a lot of MIRI researchers’ work is that we want to ensure the developers of early AGI systems aren’t “flying blind” with respect to how and why their systems work; we want developers to be able to anticipate the consequences of many design choices before they make them.
The idea isn’t that AGI techniques will look like decision theory, any more than they’ll look like probability theory. The idea is rather that it’s essential to have a basic understanding of what decision-making and probabilistic reasoning are before you build a general-purpose probabilistic reasoner and decision-maker. Newcomb’s problem is important in that context primarily because it’s one of the biggest anomalies in our current understanding of counterfactual reasoning. Zeroing in on anomalies in established theories and paradigms, and tugging on loose threads until we get a sense of why our theories break down at this particular point, is a pretty standard and productive approach in the sciences.
All that said, Newcomblike scenarios are ubiquitous in real life, and would probably be much more so for AGI systems. I’ll say more about this in a second comment.