Causal decision theorists don’t self-modify to timeless decision theorists. If you get the decision theory wrong, you can’t rely on it repairing itself.
but you also said:
...if you build an AI that two-boxes on Newcomb’s Problem, it will self-modify to one-box on Newcomb’s Problem, if the AI considers in advance that it might face such a situation.
I can envision several possibilities:
Perhaps you changed your mind and presently disagree with one of the above two statements.
Perhaps you didn’t mean a causal AI in the second quote. In that case I have no idea what you meant.
Perhaps Newcomb’s problem is the wrong example, and there’s some other example motivating TDT that a self-modifying causal agent would deal with incorrectly.
Perhaps you have a model of causal decision theory that makes self-modification impossible in principle. That would make your first statement above true, in a useless sort of way, so I hope you didn’t mean that.
Causal decision theorists self-modify to one-box on Newcomb’s Problem with Omegas that looked at their source code after the self-modification took place; i.e., if the causal decision theorist self-modifies at 7am, it will self-modify to one-box with Omegas that looked at the code after 7am and two-box otherwise. This is not only ugly but also has worse implications for e.g. meeting an alien AI who wants to cooperate with you, or worse, an alien AI that is trying to blackmail you.
Bad decision theories don’t necessarily self-repair correctly.
And in general, every time you throw up your hands in the air and say, “I don’t know how to solve this problem, nor do I understand the exact structure of the calculation my computer program will perform in the course of solving this problem, nor can I state a mathematically precise meta-question, but I’m going to rely on the AI solving it for me ’cause it’s supposed to be super-smart,” you may very possibly be about to screw up really damned hard. I mean, that’s what Eliezer-1999 thought you could say about “morality”.
Okay, thanks for confirming that Newcomb’s problem is a relevant motivating example here.
“I don’t know how to solve this problem, nor do I understand the exact structure of the calculation my computer program will perform in the course of solving this problem, nor can I state a mathematically precise meta-question, but I’m going to rely on the AI solving it for me ’cause it’s supposed to be super-smart,”
I’m not saying that. I’m saying that self-modification solves the problem, assuming the CDT agent moves first, and that it seems simple enough that we can check that a not-very-smart AI solves it correctly on toy examples. If I get around to attempting that, I’ll post to LessWrong.
Assuming the CDT agent moves first seems reasonable. I have no clue whether or when Omega is going to show up, so I feel no need to second-guess the AI about that schedule.
(Quoting out of order)
This is not only ugly...
As you know, we can define a causal decision theory agent in one line of math. I don’t know a way to do that for TDT. Do you? If TDT could be concisely described, I’d agree that it’s the less ugly alternative.
but also has worse implications for e.g. meeting an alien AI who wants to cooperate with you, or worse, an alien AI that is trying to blackmail you.
I’m failing to suspend disbelief here. Do you have motivating examples for TDT that seem likely to happen before Kurzweil’s schedule for the Singularity causes us to either win or lose the game?
As you know, we can define a causal decision theory agent in one line of math.
If you appreciate simplicity/elegance, I suggest looking into UDT. UDT says that when you’re making a choice, you’re deciding the output of a particular computation, and the consequences of any given choice are just the logical consequences of that computation having that output.
CDT in contrast doesn’t answer the question “what am I actually deciding when I make a decision?” nor does it answer “what are the consequences of any particular choice?” even in principle. CDT can only be described in one line of math because the answer to the latter question has to be provided to it via an external parameter.
but also has worse implications for e.g. meeting an alien AI who wants to cooperate with you, or worse, an alien AI that is trying to blackmail you.
I’m failing to suspend disbelief here. Do you have motivating examples for TDT that seem likely to happen before Kurzweil’s schedule for the Singularity causes us to either win or lose the game?
I’m reasonably sure Eliezer meant implications for the would-be friendly AI meeting alien AIs. That could happen at any time in the remaining life span of the universe.
You said:
but you also said:
I can envision several possibilities:
Perhaps you changed your mind and presently disagree with one of the above two statements.
Perhaps you didn’t mean a causal AI in the second quote. In that case I have no idea what you meant.
Perhaps Newcomb’s problem is the wrong example, and there’s some other example motivating TDT that a self-modifying causal agent would deal with incorrectly.
Perhaps you have a model of causal decision theory that makes self-modification impossible in principle. That would make your first statement above true, in a useless sort of way, so I hope you didn’t mean that.
Would you like to clarify?
Causal decision theorists self-modify to one-box on Newcomb’s Problem with Omegas that looked at their source code after the self-modification took place; i.e., if the causal decision theorist self-modifies at 7am, it will self-modify to one-box with Omegas that looked at the code after 7am and two-box otherwise. This is not only ugly but also has worse implications for e.g. meeting an alien AI who wants to cooperate with you, or worse, an alien AI that is trying to blackmail you.
Bad decision theories don’t necessarily self-repair correctly.
And in general, every time you throw up your hands in the air and say, “I don’t know how to solve this problem, nor do I understand the exact structure of the calculation my computer program will perform in the course of solving this problem, nor can I state a mathematically precise meta-question, but I’m going to rely on the AI solving it for me ’cause it’s supposed to be super-smart,” you may very possibly be about to screw up really damned hard. I mean, that’s what Eliezer-1999 thought you could say about “morality”.
Okay, thanks for confirming that Newcomb’s problem is a relevant motivating example here.
I’m not saying that. I’m saying that self-modification solves the problem, assuming the CDT agent moves first, and that it seems simple enough that we can check that a not-very-smart AI solves it correctly on toy examples. If I get around to attempting that, I’ll post to LessWrong.
Assuming the CDT agent moves first seems reasonable. I have no clue whether or when Omega is going to show up, so I feel no need to second-guess the AI about that schedule.
(Quoting out of order)
As you know, we can define a causal decision theory agent in one line of math. I don’t know a way to do that for TDT. Do you? If TDT could be concisely described, I’d agree that it’s the less ugly alternative.
I’m failing to suspend disbelief here. Do you have motivating examples for TDT that seem likely to happen before Kurzweil’s schedule for the Singularity causes us to either win or lose the game?
If you appreciate simplicity/elegance, I suggest looking into UDT. UDT says that when you’re making a choice, you’re deciding the output of a particular computation, and the consequences of any given choice are just the logical consequences of that computation having that output.
CDT in contrast doesn’t answer the question “what am I actually deciding when I make a decision?” nor does it answer “what are the consequences of any particular choice?” even in principle. CDT can only be described in one line of math because the answer to the latter question has to be provided to it via an external parameter.
Thanks, I’ll have a look at UDT.
I certainly agree there.
Maybe this one: “Argmax[A in Actions] in SumO in Outcomes*P(this computation yields A []-> O|rest of universe)”
From this post.
I’m reasonably sure Eliezer meant implications for the would-be friendly AI meeting alien AIs. That could happen at any time in the remaining life span of the universe.