Eliezer, if a smart creature modifies itself in order to gain strategic advantages from committing itself to future actions, it must think could better achieve its goals by doing so. If so, why should we be concerned, if those goals do not conflict with our goals?
Well, there’s a number of answers I could give to this:
*) After you’ve spent some time working in the framework of a decision theory where dynamic inconsistencies naturally Don’t Happen—not because there’s an extra clause forbidding them, but because the simple foundations just don’t give rise to them—then an intertemporal preference reversal starts looking like just another preference reversal.
*) I developed my decision theory using mathematical technology, like Pearl’s causal graphs, that wasn’t around when causal decision theory was invented. (CDT takes counterfactual distributions as fixed givens, but I have to compute them from observation somehow.) So it’s not surprising if I think I can do better.
*) We’re not talking about a patchwork of self-modifications. An AI can easily generally modify its future self once-and-for-all to do what its past self would have wished on future problems even if the past self did not explicitly consider them. Why would I bother to consider the general framework of classical causal decision theory when I don’t expect the AI to work inside that general framework for longer than 50 milliseconds?
*) I did work out what an initially causal-decision-theorist AI would modify itself to, if it booted up on July 11, 2018, and it looks something like this: “Behave like a nonclassical-decision-theorist if you are confronting a Newcomblike problem that was determined by ‘causally’ interacting with you after July 11, 2018, and otherwise behave like a classical causal decision theorist.” Roughly, self-modifying capability in a classical causal decision theorist doesn’t fix the problem that gives rise to the intertemporal preference reversals, it just makes one temporal self win out over all the others.
*) Imagine time spread out before you like a 4D crystal. Now imagine pointing to one point in that crystal, and saying, “The rational decision given information X, and utility function Y, is A”, then pointing to another point in the crystal and saying “The rational decision given information X, and utility function Y, is B”. Of course you have to be careful that all conditions really are exactly identical—the agent has not learned anything over the course of time that changes X, the agent is not selfish with temporal deixis which changes Y. But if all these conditions are fulfilled, I don’t see why an intertemporal inconsistency should be any less disturbing than an interspatial inconsistency. You can’t have 2 + 2 = 4 in Dallas and 2 + 2 = 3 in Minneapolis.
*) What happens if I want to use a computation distributed over a large enough volume that there are lightspeed delays and no objective space of simultaneity? Do the pieces of the program start fighting each other?
*) Classical causal decision theory is just not optimized for the purpose I need a decision theory for, any more than a toaster is likely to work well as a lawnmower. They did not have my design requirements in mind.
*) I don’t have to put up with dynamic inconsistencies. Why should I?
Eliezer, if a smart creature modifies itself in order to gain strategic advantages from committing itself to future actions, it must think could better achieve its goals by doing so. If so, why should we be concerned, if those goals do not conflict with our goals?
Well, there’s a number of answers I could give to this:
*) After you’ve spent some time working in the framework of a decision theory where dynamic inconsistencies naturally Don’t Happen—not because there’s an extra clause forbidding them, but because the simple foundations just don’t give rise to them—then an intertemporal preference reversal starts looking like just another preference reversal.
*) I developed my decision theory using mathematical technology, like Pearl’s causal graphs, that wasn’t around when causal decision theory was invented. (CDT takes counterfactual distributions as fixed givens, but I have to compute them from observation somehow.) So it’s not surprising if I think I can do better.
*) We’re not talking about a patchwork of self-modifications. An AI can easily generally modify its future self once-and-for-all to do what its past self would have wished on future problems even if the past self did not explicitly consider them. Why would I bother to consider the general framework of classical causal decision theory when I don’t expect the AI to work inside that general framework for longer than 50 milliseconds?
*) I did work out what an initially causal-decision-theorist AI would modify itself to, if it booted up on July 11, 2018, and it looks something like this: “Behave like a nonclassical-decision-theorist if you are confronting a Newcomblike problem that was determined by ‘causally’ interacting with you after July 11, 2018, and otherwise behave like a classical causal decision theorist.” Roughly, self-modifying capability in a classical causal decision theorist doesn’t fix the problem that gives rise to the intertemporal preference reversals, it just makes one temporal self win out over all the others.
*) Imagine time spread out before you like a 4D crystal. Now imagine pointing to one point in that crystal, and saying, “The rational decision given information X, and utility function Y, is A”, then pointing to another point in the crystal and saying “The rational decision given information X, and utility function Y, is B”. Of course you have to be careful that all conditions really are exactly identical—the agent has not learned anything over the course of time that changes X, the agent is not selfish with temporal deixis which changes Y. But if all these conditions are fulfilled, I don’t see why an intertemporal inconsistency should be any less disturbing than an interspatial inconsistency. You can’t have 2 + 2 = 4 in Dallas and 2 + 2 = 3 in Minneapolis.
*) What happens if I want to use a computation distributed over a large enough volume that there are lightspeed delays and no objective space of simultaneity? Do the pieces of the program start fighting each other?
*) Classical causal decision theory is just not optimized for the purpose I need a decision theory for, any more than a toaster is likely to work well as a lawnmower. They did not have my design requirements in mind.
*) I don’t have to put up with dynamic inconsistencies. Why should I?