...if you build an AI that two-boxes on Newcomb’s Problem, it will self-modify to one-box on Newcomb’s Problem, if the AI considers in advance that it might face such a situation. Agents with free access to their own source code have access to a cheap method of precommitment.
...
But what does an agent with a disposition generally-well-suited to Newcomblike problems look like? Can this be formally specified?
...
Rational agents should WIN.
It seems to me that if all that is true, and you want to build a Friendly AI, then the rational thing to do here is build it and let it solve all problems like these. That way, you win, at least in the time-management sense. Well, you might lose if you encountered Omega before the FAI was up and running, but that seems unlikely. Am I missing something here?
It will also have to precommit to mere humans who can’t read its source code and can’t predict the future, so solving the problem in the case where you meet Omega doesn’t solve the problem in general.
Causal decision theorists don’t self-modify to timeless decision theorists. If you get the decision theory wrong, you can’t rely on it repairing itself.
Causal decision theorists don’t self-modify to timeless decision theorists. If you get the decision theory wrong, you can’t rely on it repairing itself.
but you also said:
...if you build an AI that two-boxes on Newcomb’s Problem, it will self-modify to one-box on Newcomb’s Problem, if the AI considers in advance that it might face such a situation.
I can envision several possibilities:
Perhaps you changed your mind and presently disagree with one of the above two statements.
Perhaps you didn’t mean a causal AI in the second quote. In that case I have no idea what you meant.
Perhaps Newcomb’s problem is the wrong example, and there’s some other example motivating TDT that a self-modifying causal agent would deal with incorrectly.
Perhaps you have a model of causal decision theory that makes self-modification impossible in principle. That would make your first statement above true, in a useless sort of way, so I hope you didn’t mean that.
Causal decision theorists self-modify to one-box on Newcomb’s Problem with Omegas that looked at their source code after the self-modification took place; i.e., if the causal decision theorist self-modifies at 7am, it will self-modify to one-box with Omegas that looked at the code after 7am and two-box otherwise. This is not only ugly but also has worse implications for e.g. meeting an alien AI who wants to cooperate with you, or worse, an alien AI that is trying to blackmail you.
Bad decision theories don’t necessarily self-repair correctly.
And in general, every time you throw up your hands in the air and say, “I don’t know how to solve this problem, nor do I understand the exact structure of the calculation my computer program will perform in the course of solving this problem, nor can I state a mathematically precise meta-question, but I’m going to rely on the AI solving it for me ’cause it’s supposed to be super-smart,” you may very possibly be about to screw up really damned hard. I mean, that’s what Eliezer-1999 thought you could say about “morality”.
Okay, thanks for confirming that Newcomb’s problem is a relevant motivating example here.
“I don’t know how to solve this problem, nor do I understand the exact structure of the calculation my computer program will perform in the course of solving this problem, nor can I state a mathematically precise meta-question, but I’m going to rely on the AI solving it for me ’cause it’s supposed to be super-smart,”
I’m not saying that. I’m saying that self-modification solves the problem, assuming the CDT agent moves first, and that it seems simple enough that we can check that a not-very-smart AI solves it correctly on toy examples. If I get around to attempting that, I’ll post to LessWrong.
Assuming the CDT agent moves first seems reasonable. I have no clue whether or when Omega is going to show up, so I feel no need to second-guess the AI about that schedule.
(Quoting out of order)
This is not only ugly...
As you know, we can define a causal decision theory agent in one line of math. I don’t know a way to do that for TDT. Do you? If TDT could be concisely described, I’d agree that it’s the less ugly alternative.
but also has worse implications for e.g. meeting an alien AI who wants to cooperate with you, or worse, an alien AI that is trying to blackmail you.
I’m failing to suspend disbelief here. Do you have motivating examples for TDT that seem likely to happen before Kurzweil’s schedule for the Singularity causes us to either win or lose the game?
As you know, we can define a causal decision theory agent in one line of math.
If you appreciate simplicity/elegance, I suggest looking into UDT. UDT says that when you’re making a choice, you’re deciding the output of a particular computation, and the consequences of any given choice are just the logical consequences of that computation having that output.
CDT in contrast doesn’t answer the question “what am I actually deciding when I make a decision?” nor does it answer “what are the consequences of any particular choice?” even in principle. CDT can only be described in one line of math because the answer to the latter question has to be provided to it via an external parameter.
but also has worse implications for e.g. meeting an alien AI who wants to cooperate with you, or worse, an alien AI that is trying to blackmail you.
I’m failing to suspend disbelief here. Do you have motivating examples for TDT that seem likely to happen before Kurzweil’s schedule for the Singularity causes us to either win or lose the game?
I’m reasonably sure Eliezer meant implications for the would-be friendly AI meeting alien AIs. That could happen at any time in the remaining life span of the universe.
Causal decision theorists don’t self-modify to timeless decision theorists.
Why not? A causal decision theorist can have an accurate abstract understanding of both TDT and CDT and can calculate the expected utility of applying either. If TDT produces a better expected outcome in general then it seems like self modifying to become a TDT agent is the correct decision to make. Is there some restriction or injunction assumed to be in place with respect to decision algorithm implementation?
Thinking about it for a a few minutes: It would seem that the CDT agent will reliably update away from CDT but that the new algorithm will be neither CDT or TDT (and not UDT either). It will be able to cooperate with agents when there has been some sort causal entanglement between the modified source code and the other agent but not able to cooperate with complete strangers. The resultant decision algorithm is enough of an attractor that it deserves a name of its own. Does it have one?
Doesn’t have a name as far as I know. But I’m not sure it deserves one; would CDT really be a probable output anywhere besides a verbal theory advocated by human philosophers in our own Everett branch? Maybe, now that I think about it, but even so, does it matter?
A causal decision theorist can have an accurate abstract understanding of both TDT and CDT and can calculate the expected utility of applying either.
But it will calculate that expected value using CDT!expectation, meaning that it won’t see how self-modifying to be a timeless decision theorist could possibly affect what’s already in the box, etcetera.
Doesn’t have a name as far as I know. But I’m not sure it deserves one; would CDT really be a probable output anywhere besides a verbal theory advocated by human philosophers in our own Everett branch?
Yes, because there are lemmas you can prove about (some) decision theory problems which imply that CDT and UDT give the same output. For example, CDT works if there is exists a total ordering over inputs given to the strategy, common to all execution histories, such that the world program invokes the strategy only with increasing, non-repeating inputs on that ordering. There are (relatively) easy algorithms for these cases. CDT in general is then a matter of applying a theorem when one of its preconditions doesn’t hold, which is one of the most common math mistakes ever.
Is that really so bad, if it takes the state of the world at the point before it self-modifies as an unchangeable given, and self-modifies to a decision theory that only considers states from that point on as changeable by its decision theory? For one thing, doesn’t that avoid Roko’s basilisk?
Is that really so bad, if it takes the state of the world at the point before it self-modifies as an unchangeable given, and self-modifies to a decision theory that only considers states from that point on as changeable by its decision theory?
If you do that, you’d be vulnerable to extortion from any other AIs that happen to be created earlier in time and can prove their source code.
I’m inclined to think that in most scenarios the first AGI wins anyway. And leaving solving decision theory to the AGI could mean you get to build it earlier.
I’m inclined to think that in most scenarios the first AGI wins anyway.
I was thinking of meeting alien AIs, post-Singularity.
And leaving solving decision theory to the AGI could mean you get to build it earlier.
Huh? I thought we were supposed to be the good guys here? ;-)
But seriously, “sacrifice safety for speed” is the “defect” option in the game of “let’s build AGI”. I’m not sure how to get the C/C outcome (or rather C/C/C/...), but it seems too early to start talking about defecting already.
Besides, CDT is not well defined enough that you can implement it even if you wanted to. I think if you were forced to implement a “good enough” decision theory and hope for the best, you’d pick UDT at this point. (UDT is also missing a big chunk from its specifications, namely the “math intuition module” but I think that problem has to be solved anyway. It’s hard to see how an AGI can get very far without being able to deal with logical/mathematical uncertainty.)
I was thinking of meeting alien AIs, post-Singularity.
What pre-singularity actions are you worried about them taking?
Huh? I thought we were supposed to be the good guys here? ;-)
What I was thinking was that a CDT-seeded AI might actually be safer precisely because it won’t try to change pre-Singularity events, and if it’s first the new decision theory will be in place in time for any post-Singularity events.
Besides, CDT is not well defined enough that you can implement it even if you wanted to.
That’s surprising to me—what should I read in order to understand this point better? EDIT: strike that, you answer that above.
What pre-singularity actions are you worried about them taking?
They could modify themselves so that if they ever encounter a CDT-descended AI they’ll start a war (even if it means mutual destruction) unless the CDT-descended AI gives them 99% of its resources.
They could modify themselves so that if they ever encounter a CDT-descended AI they’ll start a war (even if it means mutual destruction) unless the CDT-descended AI gives them 99% of its resources.
They could also modify themselves to make the analogous threat if they encounter a UDT-descended AI, or a descendant of an AI designed by TIm Freeman, or a descendant of an AI designed by Wei Dai, or a descendant of an AI designed using ideas mentioned on LessWrong. I would hope that any of those AI’s would hand over 99% of their resources if the extortionist could prove its source code and prove that war would be worse. I assume you’re saying that CDT is special in this regard. How is it special?
(Thanks for the pointer to the James Joyce book, I’ll have a look at it.)
I assume you’re saying that CDT is special in this regard. How is it special?
If the alien AI computes the expected utility of “provably modify myself to start a war against CDT-AI unless it gives me 99% of its resources”, it’s certain to get a high value, whereas if it computes the expected utility of “provably modify myself to start a war against UDT-AI unless it gives me 99% of its resources” it might possibly get a low value (not sure because UDT isn’t fully specified), because the UDT-AI, when choosing what to do when faced with this kind of threat, would take into account the logical correlation between its decision and the alien AI’s prediction of its decision.
...if it computes the expected utility of “provably modify myself to start a war against UDT-AI unless it gives me 99% of its resources” it might possibly get a low value (not sure because UDT isn’t fully specified), because the UDT-AI, when choosing what to do when faced with this kind of threat, would take into account the logical correlation between its decision and the alien AI’s prediction of its decision.
Well, that’s plausible. I’ll have to work through some UDT examples to understand fully.
What model do you have of how entity X can prove to entity Y that X is running specific source code?
The proof that I can imagine is entity Y gives some secure hardware Z to X, and then X allows Z to observe the process of X self-modifying to run the specified source code, and then X gives the secure hardware back to Y. Both X and Y can observe the creation of Z, so Y can know that it’s secure and X can know that it’s a passive observer rather than a bomb or something.
This model breaks the scenario, since a CDT playing the role of Y could self-modify any time before it hands over Z and play the game competently.
Now, if there’s some way for X to create proofs of X’s source code that will be convincing to Y without giving advance notice to Y, I can imagine a problem for Y here. Does anyone know how to do that?
(I acknowledge that if nobody knows how to do that, that means we don’t know how to do that, not that it can’t be done.)
Hmm, this explains my aversion to knowing the details of what other people are thinking. It can put me at a disadvantage in negotiations unless I am able to lie convincingly and say I do not know.
I think I″ll stop here for now, because you already seem intrigued enough to want to learn about UDT in detail. I’m guessing that once you do, you won’t be so motivated to think up reasons why CDT isn’t really so bad. :) Let me know if that turns out not to be the case though.
What model do you have of how entity X can prove to entity Y that X is running specific source code?
On second thought, I should answer this question because it’s of independent interest. If Y is sufficiently powerful, it may be able to deduce the laws of physics and the initial conditions of the universe, and then obtain X’s source code by simulating the universe up to when X is created. Note that Y may do this not because it wants to know X’s source code in some anthropomorphic sense, but simply due to how its decision-making algorithm works.
If Y is sufficiently powerful, it may be able to deduce the laws of physics and the initial conditions of the universe, and then obtain X’s source code by simulating the universe up to when X is created.
Unless there have been some specific assumptions made about the universe that will not work. Simulating the entire universe does not tell Y which part of the universe it inhabits. It will give Y a set of possible parts of the universe which match Y’s observations. While the simulation strategy will allow the best possible prediction about what X’s source code is given what Y already knows it does not give evidence to Y that it didn’t already have.
You’re right, the model assumes that we live in a universe such that superintelligent AIs would “naturally” have enough evidence to infer the source code of other AIs. (That seems quite plausible, although by no means certain, to me.) Also, since this is a thread about the relative merits of CDT, I should point out that there are some games in which CDT seems to win relative to TDT or UDT, which is a puzzle that is still open.
Also, since this is a thread about the relative merits of CDT, I should point out that there are some games in which CDT seems to win relative to TDT or UDT, which is a puzzle that is still open.
It’s an interesting problem, but my impression when reading was somewhat similar to that of Eliezer in the replies. At the core it is the question of “How do you deal with constructs made by other agents?” I don’t think TDT has any particular weakness there.
If Y is sufficiently powerful, it may be able to deduce the laws of physics and the initial conditions of the universe, and then obtain X’s source code by simulating the universe up to when X is created.
Quantum mechanics seems to be pretty clear that true random number generators are available, and probably happen naturally. I don’t understand why you consider that scenario probable enough to be worth talking about.
It’s hard to see how an AGI can get very far without being able to deal with logical/mathematical uncertainty.
Do you have an intuition as to how it would do this without contradicting itself? I tried to ask a similar question but got it wrong in the first draft and afaict did not receive an answer to the relevant part.
I just want to know if my own intuition fails in the obvious way.
Besides, CDT is not well defined enough that you can implement it even if you wanted to. I think if you were forced to implement a “good enough” decision theory and hope for the best, you’d pick UDT at this point.
Really? That’s surprising. My assumption had been that CDT would be much simpler to implement—but just give undesirable outcomes in whole classes of circumstance.
CDT uses a “causal probability function” to evaluate the expected utilities of various choices, where this causal probability function is different from the epistemic probability function you use to update beliefs. (In EDT they are one and the same.) There is no agreement amongst CDT theorists how to formulate this function, and I’m not aware of any specific proposal that can be straightforwardly implemented. For more details see James Joyce’s The foundations of causal decision theory.
There is no agreement amongst CDT theorists how to formulate this function, and I’m not aware of any specific proposal that can be straightforwardly implemented.
I understand AIXI reasonably well and had assumed it was a specific implementation of CDT, perhaps with some tweaks so the reward values are generated internally instead of being observed in the environment. Perhaps AIXI isn’t close to an implementation of CDT, perhaps it’s perceived as not specific or straightforward enough, or perhaps it’s not counted as an implementation. Why isn’t AIXI a counterexample?
You may be right that AIXI can be thought of as an instance of CDT. Hutter himself cites “sequential decision theory” from a 1957 paper which certainly predates CDT, but CDT is general enough that SDT could probably fit into its formalism. (Like EDT can be considered an instance of CDT with the causal probability function set to be the same as the epistemic probability function.) I guess I hadn’t considered AIXI as a serious candidate due to its other major problems.
The first one is the claim that AIXI wouldn’t have a proper understanding of its body because its thoughts are defined mathematically. This is just wrong, IMO; my refutation, for a machine that’s similar enough to AIXI for this issue to work the same, is here. Nobody has engaged me in serious conversation about that, so I don’t know how well it will stand up. (If I’m right on this, then I’ve seen Eliezer, Tim Tyler, and you make the same error. What other false consensuses do we have?)
The second one is fixed if we do the tweak I mentioned in the grandparent of this comment.
If you take the fix described above for the second one, what’s left of the third one is the claim that instantaneous human (or AI) experience is too nuanced to fit in a single cell of a Turing machine. According to the original paper, page 8, the symbols on the reward tape are drawn from an alphabet R of arbitrary but fixed size. All you need is a very large alphabet and this one goes away.
I agree with the facts asserted in Tyler’s fourth problem, but I do not agree that it is a problem. He’s saying that Kolmogorov complexity is ill-defined because the programming language used is undefined. I agree that rational agents might disagree on priors because they’re using different programming languages to represent their explanations. In general, a problem may have multiple solutions. Practical solutions to the problems we’re faced with will require making indefensible arbitrary choices of one potential solution over another. Picking the programming language for priors is going to be one of those choices.
The first one is the claim that AIXI wouldn’t have a proper understanding of its body because its thoughts are defined mathematically. This is just wrong, IMO; my refutation, for a machine that’s similar enough to AIXI for this issue to work the same, is here.
I don’t see how your refutation applies to AIXI. Let me just try to explain in detail why I think AIXI will not properly protect its body. Consider an AIXI that arises in a simple universe, i.e., one computed by a short program P. AIXI has a probability distribution not over universes, but instead over environments where an environment is a TM whose output tape is AIXI’s input tape and whose input tape is AIXI’s output tape. What’s the simplest environment that fits AIXI’s past inputs/outputs? Presumably it’s E = P plus some additional pieces of code that injects E’s inputs into where AIXI’s physical output ports are located in the universe (that is, overrides the universe’s natural evolution using E’s inputs), and extracts E’s outputs from where AIXI’s physical input ports are located.
What happens when AIXI considers an action that destroys its physical body in the universe computed by P? As long as the input/output ports are not also destroyed, AIXI would expect that the environment E (with its “supernatural” injection/extraction code) will continue to receive its outputs and provide it with inputs.
Consider an AIXI that arises in a simple universe, i.e., one computed by a short program P.
An implementation of AIXI would be fairly complex. If P is too simple, then AIXI could not really have a body in the universe, so it would be correct in guessing that some irregularity in the laws of physics was causing its behaviors to be spliced into the behavior of the world.
However, if AIXI has observed enough of the inner workings of other similar machines, or enough of the laws of physics in general, or enough of its own inner workings, the simplest model will be that AIXI’s outputs really do emerge from the laws of physics in the real universe, since we are assuming that that is indeed the case and that Kolmogorov induction eventually works. At that point, imagining that AIXI’s behaviors are a consequence of a bunch of exceptions to the laws of physics is just extra complexity and won’t be part of the simplest hypothesis. It will be part of some less likely hypotheses, and the AI would have to take that risk into account when deciding whether to self-improve.
Tim, I think you’re probably not getting my point about the distinction between our concept of a computable universe, and AIXI’s formal concept of a computable environment. AIXI requires that the environment be a TM whose inputs match AIXI’s past outputs and whose outputs match AIXI’s past inputs. A candidate environment must have the additional code to inject/extract those inputs/outputs and place them on the input/output tapes, or AIXI will exclude it from its expected utility calculations.
The candidate environment must have the additional code to inject/extract those inputs/outputs and place them on the input/output tapes, or AIXI will exclude it from its expected utility calculations.
I agree that the candidate environment will need to have code to handle the inputs. However, if the candidate environment can compute the outputs on its own, without needing to be given the AI’s outputs, the candidate environment does not need code to inject the AI’s outputs into it.
Even if the AI can only partially predict its own behavior based on the behavior of the hardware it observes in the world, it can use that information to more efficiently encode its outputs in the candidate environment, so it can have some understanding of its position in the world even without being able to perfectly predict its own behavior from first principles.
If the AI manages to destroy itself, it will expect its outputs to be disconnected from the world and have no consequences, since anything else would violate its expectations about the laws of physics.
This back-and-forth appears to be useless. I should probably do some Python experiments and we then can change this from a debate to a programming problem, which would be much more pleasant.
However, if the candidate environment can compute the outputs on its own, without needing to be given the AI’s outputs, the candidate environment does not need code to inject the AI’s outputs into it.
If a candidate environment has no special code to inject AIXI’s outputs, then when AIXI computes expected utilities, it will find that all actions have equal utility in that environment, so that environment will play no role in its decisions.
I should probably do some Python experiments and we then can change this from a debate to a programming problem, which would be much more pleasant.
Ok, but try not to destroy the world while you’re at it. :) Also, please take a closer look at UDT first. Again, I think there’s a strong possibility that you’ll end up thinking “why did I waste my time defending CDT/AIXI?”
FYI, generating reward values internally—instead of them being observed in the environment—makes no difference whatsoever to the wirehead problem.
AIXI digging into its brains with its own mining claws is quite plausible. It won’t reason as you suggest—since it has no idea that it is instantiated in the real world. So, its exploratory mining claws may plunge in. Hopefully it will get suitably negatively reinforced for that—though much will depend on which part of its brain it causes damage too. It could find that ripping out its own inhibition circuits is very rewarding.
A larger set of symbols for rewards makes no difference—since the reward signal is a scalar. If you compare with an animal, that has millions of pain sensors that operate in parallel. The animal is onto something there—something to do with a-priori knowledge about the common causes of pain. Having lots of pain sensors has positive aspects—e.g. it saves you experimenting to figure out what hurts.
As for the reference machine issue, I do say: “This problem is also not very serious.”
Not very serious unless you are making claims about your agent being “the most intelligent unbiased agent possible”. Then this kind of thing starts to make a difference...
A larger set of symbols for rewards makes no difference—since the reward signal is a scalar. If you compare with an animal, that has millions of pain sensors that operate in parallel. The animal is onto something there—something to do with a-priori knowledge about the common causes of pain. Having lots of pain sensors has positive aspects—e.g. it saves you experimenting to figure out what hurts.
You can encode 16 64 bit integers in a 1024 bit integer. The scalar/parallel distinction is bogus.
(Edit: I original wrote “5 32 bit integers” when I meant “2**5 32 bit integers”. Changed to “16 64 bit integers” because “32 32 bit integers” looked too much like a typo.)
Not very serious unless you are making claims about your agent being “the most intelligent unbiased agent possible”. Then this kind of thing starts to make a difference...
Strawman argument. The only claim made is that it’s the most intelligent up to a constant factor, and a bunch of other conditions are thrown in. When Hutter’s involved, you can bet that some of the constant factors are large compared to the size of the universe.
You can encode 5 32 bit integers in a 1024 bit integer. The scalar/parallel distinction is bogus.
Er, not if you are adding the rewards together and maximising the results, you can’t! That is exactly what happens to the rewards used by AIXI.
Not very serious unless you are making claims about your agent being “the most intelligent unbiased agent possible”. Then this kind of thing starts to make a difference...
Strawman argument. The only claim made is that it’s the most intelligent up to a constant factor, and a bunch of other conditions are thrown in.
Actually Hutter says this sort of thing all over the place (I was quoting him above) - and it seems pretty irritating and misleading to me. I’m not saying the claims he makes in the fine print are wrong, but rather that the marketing headlines are misleading.
You can encode 5 32 bit integers in a 1024 bit integer. The scalar/parallel distinction is bogus.
Er, not if you are adding the rewards together and maximising the results, you can’t! That is exactly what happens to the rewards used by AIXI.
You’re right there, I’m confusing AIXI with another design I’ve been working with in a similar idiom. For AIXI to work, you have to combine together all the environmental stuff and compute a utility, make the code for doing the combining part of the environment (not the AI), and then use that resulting utility as the input to AIXI.
For more details see James Joyce’s The foundations of causal decision theory.
Thankyou for the reference, and the explanation.
I am prompted to ask myself a question analogous to the one Eliezer recently asked:
Doesn’t have a name as far as I know. But I’m not sure it deserves one; would CDT really be a probable output anywhere besides a verbal theory advocated by human philosophers in our own Everett branch? Maybe, now that I think about it, but even so, does it matter?
Is it worth my while exploring the details of CDT formalization beyond just the page you linked to? There seems to be some advantage to understanding the details and conventions of how such concepts are described. At the same time revising CDT thinking in too much detail may eliminate some entirely justifiable confusion as to why anyone would think it is a good idea! “Causal Expected Utiluty”? “Causal Tendencies”? What the? I only care about what will get me the best outcome!
Is it worth my while exploring the details of CDT formalization beyond just the page you linked to?
Probably not. I only learned it by accident myself. I had come up with a proto-UDT that was motivated purely by anthropic reasoning paradoxes (as opposed to Newcomb-type problems like CDT and TDT), and wanted to learn how existing decision theories were formalized so I could do something similar. James Joyce’s book was the most prominent such book available at the time.
ETA: Sorry, I think the above is probably not entirely clear or helpful. It’s a bit hard for me to put myself in your position and try to figure out what may or may not be worthwhile for you. The fact is that Joyce’s book is the decision theory book I read, and quite possibly it influenced me more than I realize, or is more useful for understanding the motivation for or the formulation of UDT than I think. It couldn’t hurt to grab a copy of it and read a few chapters to see how useful it is to you.
Thanks for the edit/update. For reference it may be worthwhile to make such additions as a new comment, either as a reply to yourself or the parent. It was only by chance that I spotted the new part!
I was thinking of meeting alien AIs, post-Singularity.
What pre-singularity actions are you worried about them taking?
Huh? I thought we were supposed to be the good guys here? ;-)
What I was thinking was that a CDT-seeded AI might actually be safer precisely because it won’t try to change pre-Singularity events, and if it’s first the new decision theory will be in place in time for any post-Singularity events.
Besides, CDT is not well defined enough that you can implement it even if you wanted to.
That’s surprising to me—what should I read in order to understand this point better?
But I’m not sure it deserves one; would CDT really be a probable output anywhere besides a verbal theory advocated by human philosophers in our own Everett branch? Maybe, now that I think about it, but even so, does it matter?
Yes, for reasons of game theory and of practical singularity strategy.
Game theory, because things in Everett branches that are ‘closest’ to us might be the ones it’s most important to be able to interact with, since they’re easier to simulate and their preferences are more likely to have interesting overlap with ours. Knowing very roughly what to expect from our neighbors is useful.
And singularity strategy, because if you can show that architectures like AIXI-tl have some non-negligible chance of converging to whatever an FAI would have converged to, as far as actual policies go, then that is a very important thing to know; especially if a non-uFAI existential risk starts to look imminent (but the game theory in that case is crazy). It is not probable but there’s a hell of a lot of structural uncertainty and Omohundro’s AI drives are still pretty informal. I am still not absolutely sure I know how a self-modifying superintelligence would interpret or reflect on its utility function or terms therein (or how it would reflect on its implicit policy for interpreting or reflecting on utility functions or terms therein). The apparent rigidity of Goedel machines might constitute a disproof in theory (though I’m not sure about that), but when some of the terms are sequences of letters like “makeHumansHappy” or formally manipulable correlated markers of human happiness, then I don’t know how the syntax gets turned into semantics (or fails entirely to get turned into semantics, as they case may well be).
But it will calculate that expected value using CDT!expectation, meaning that it won’t see how self-modifying to be a timeless decision theorist could possibly affect what’s already in the box, etcetera.
This implies that the actually-implemented-CDT agent has a single level of abstraction/granularity at like the naive realist physical level at which it’s proving things about causal relationships. Like, it can’t/shouldn’t prove causal relationships at the level of string theory, and yet it’s still confident that its actions are causing things despite that structural uncertainty, and yet despite the symmetry it for some reason cannot possibly see how switching a few transistors or changing its decision policy might affect things via relationships that are ultimately causal but currently unknown for reasons of boundedness and not speculative metaphysics. It’s plausible, but I think letting a universal hypothesis space or maybe even just Goedelian limitations enter the decision calculus at any point is going to make such rigidity unlikely. (This is related to how a non-hypercomputation-driven decision theory in general might reason about the possibility of hypercomputation, or the risk of self-diagonalization, I think.)
But it will calculate that expected value using CDT!expectation, meaning that it won’t see how self-modifying to be a timeless decision theorist could possibly affect what’s already in the box, etcetera.
The CDT is making a decision about whether to self-modify even before it meets the alien, based on its expectation of meeting the alien. How does CDT!expectation differ from Eliezer!expectation before we meet the alien?
Doesn’t have a name as far as I know. But I’m not sure it deserves one; would CDT really be a probable output anywhere besides a verbal theory advocated by human philosophers in our own Everett branch? Maybe, now that I think about it, but even so, does it matter?
It is useful to separate in one’s mind the difference between on one hand being able to One Box and cooperate in PD with agents that you know well (shared source code) and on the other hand not firing on Baby Eaters after they have already chosen not to fire on you. This is especially the case when first grappling the subject. (Could you confirm, by the way, that Akon’s decision in that particular paragraph or two is approximately what TDT would suggest?)
The above is particularly relevant because the “have access to each other’s source code” is such a useful intuition pump when grappling with or explaining the solutions to many of the relevant decision problems. It is useful to be able to draw a line on just how far the source code metaphor can take you.
There is also something distasteful about making comparisons to a decision theory that isn’t even implicitly stable under self modification. A CDT agent will change to CDT++ unless there is an additional flaw in the agent beyond the poor decision making strategy. If I create a CDT agent, give it time to think and then give it Newcomb’s problem it will One Box (and also no longer be a CDT agent). It is the errors in the agent that still remain after that time that need TDT or UDT to fix.
But it will calculate that expected value using CDT!expectation, meaning that it won’t see how self-modifying to be a timeless decision theorist could possibly affect what’s already in the box, etcetera.
*nod* This is just the ‘new rules starting now’ option. What the CDT agent does when it wakes up in an empty, boring room and does some introspection.
Surely the important thing is that it will self-modify to whatever decision theory has the best consequences?
The new algorithm will not exactly be TDT, because it won’t try to change decisions that have already been made the way TDT does. In particular this means that there’s no risk from Roko’s basilisk.
Disclaimer: I’m not very confident of anything I say about decision theory.
Eliezer says elsewhere that current decision theory doesn’t let us prove a self-modifying AI would choose to keep the goals we program into it. He wants to develop a proof before even starting work on the AI.
It’s easy to contrive situations where a self-modifying AI would choose not to keep the goals programmed into it, even without precommitment issues. Just contrive the circumstances so it gets paid to change. Unless there’s something wrong with the argument there, TDT etc. won’t be enough to ensure that the goals are kept.
...
...
It seems to me that if all that is true, and you want to build a Friendly AI, then the rational thing to do here is build it and let it solve all problems like these. That way, you win, at least in the time-management sense. Well, you might lose if you encountered Omega before the FAI was up and running, but that seems unlikely. Am I missing something here?
It will also have to precommit to mere humans who can’t read its source code and can’t predict the future, so solving the problem in the case where you meet Omega doesn’t solve the problem in general.
Causal decision theorists don’t self-modify to timeless decision theorists. If you get the decision theory wrong, you can’t rely on it repairing itself.
You said:
but you also said:
I can envision several possibilities:
Perhaps you changed your mind and presently disagree with one of the above two statements.
Perhaps you didn’t mean a causal AI in the second quote. In that case I have no idea what you meant.
Perhaps Newcomb’s problem is the wrong example, and there’s some other example motivating TDT that a self-modifying causal agent would deal with incorrectly.
Perhaps you have a model of causal decision theory that makes self-modification impossible in principle. That would make your first statement above true, in a useless sort of way, so I hope you didn’t mean that.
Would you like to clarify?
Causal decision theorists self-modify to one-box on Newcomb’s Problem with Omegas that looked at their source code after the self-modification took place; i.e., if the causal decision theorist self-modifies at 7am, it will self-modify to one-box with Omegas that looked at the code after 7am and two-box otherwise. This is not only ugly but also has worse implications for e.g. meeting an alien AI who wants to cooperate with you, or worse, an alien AI that is trying to blackmail you.
Bad decision theories don’t necessarily self-repair correctly.
And in general, every time you throw up your hands in the air and say, “I don’t know how to solve this problem, nor do I understand the exact structure of the calculation my computer program will perform in the course of solving this problem, nor can I state a mathematically precise meta-question, but I’m going to rely on the AI solving it for me ’cause it’s supposed to be super-smart,” you may very possibly be about to screw up really damned hard. I mean, that’s what Eliezer-1999 thought you could say about “morality”.
Okay, thanks for confirming that Newcomb’s problem is a relevant motivating example here.
I’m not saying that. I’m saying that self-modification solves the problem, assuming the CDT agent moves first, and that it seems simple enough that we can check that a not-very-smart AI solves it correctly on toy examples. If I get around to attempting that, I’ll post to LessWrong.
Assuming the CDT agent moves first seems reasonable. I have no clue whether or when Omega is going to show up, so I feel no need to second-guess the AI about that schedule.
(Quoting out of order)
As you know, we can define a causal decision theory agent in one line of math. I don’t know a way to do that for TDT. Do you? If TDT could be concisely described, I’d agree that it’s the less ugly alternative.
I’m failing to suspend disbelief here. Do you have motivating examples for TDT that seem likely to happen before Kurzweil’s schedule for the Singularity causes us to either win or lose the game?
If you appreciate simplicity/elegance, I suggest looking into UDT. UDT says that when you’re making a choice, you’re deciding the output of a particular computation, and the consequences of any given choice are just the logical consequences of that computation having that output.
CDT in contrast doesn’t answer the question “what am I actually deciding when I make a decision?” nor does it answer “what are the consequences of any particular choice?” even in principle. CDT can only be described in one line of math because the answer to the latter question has to be provided to it via an external parameter.
Thanks, I’ll have a look at UDT.
I certainly agree there.
Maybe this one: “Argmax[A in Actions] in SumO in Outcomes*P(this computation yields A []-> O|rest of universe)”
From this post.
I’m reasonably sure Eliezer meant implications for the would-be friendly AI meeting alien AIs. That could happen at any time in the remaining life span of the universe.
Why not? A causal decision theorist can have an accurate abstract understanding of both TDT and CDT and can calculate the expected utility of applying either. If TDT produces a better expected outcome in general then it seems like self modifying to become a TDT agent is the correct decision to make. Is there some restriction or injunction assumed to be in place with respect to decision algorithm implementation?
Thinking about it for a a few minutes: It would seem that the CDT agent will reliably update away from CDT but that the new algorithm will be neither CDT or TDT (and not UDT either). It will be able to cooperate with agents when there has been some sort causal entanglement between the modified source code and the other agent but not able to cooperate with complete strangers. The resultant decision algorithm is enough of an attractor that it deserves a name of its own. Does it have one?
Doesn’t have a name as far as I know. But I’m not sure it deserves one; would CDT really be a probable output anywhere besides a verbal theory advocated by human philosophers in our own Everett branch? Maybe, now that I think about it, but even so, does it matter?
But it will calculate that expected value using CDT!expectation, meaning that it won’t see how self-modifying to be a timeless decision theorist could possibly affect what’s already in the box, etcetera.
Yes, because there are lemmas you can prove about (some) decision theory problems which imply that CDT and UDT give the same output. For example, CDT works if there is exists a total ordering over inputs given to the strategy, common to all execution histories, such that the world program invokes the strategy only with increasing, non-repeating inputs on that ordering. There are (relatively) easy algorithms for these cases. CDT in general is then a matter of applying a theorem when one of its preconditions doesn’t hold, which is one of the most common math mistakes ever.
Is that really so bad, if it takes the state of the world at the point before it self-modifies as an unchangeable given, and self-modifies to a decision theory that only considers states from that point on as changeable by its decision theory? For one thing, doesn’t that avoid Roko’s basilisk?
If you do that, you’d be vulnerable to extortion from any other AIs that happen to be created earlier in time and can prove their source code.
I’m inclined to think that in most scenarios the first AGI wins anyway. And leaving solving decision theory to the AGI could mean you get to build it earlier.
I was thinking of meeting alien AIs, post-Singularity.
Huh? I thought we were supposed to be the good guys here? ;-)
But seriously, “sacrifice safety for speed” is the “defect” option in the game of “let’s build AGI”. I’m not sure how to get the C/C outcome (or rather C/C/C/...), but it seems too early to start talking about defecting already.
Besides, CDT is not well defined enough that you can implement it even if you wanted to. I think if you were forced to implement a “good enough” decision theory and hope for the best, you’d pick UDT at this point. (UDT is also missing a big chunk from its specifications, namely the “math intuition module” but I think that problem has to be solved anyway. It’s hard to see how an AGI can get very far without being able to deal with logical/mathematical uncertainty.)
What pre-singularity actions are you worried about them taking?
What I was thinking was that a CDT-seeded AI might actually be safer precisely because it won’t try to change pre-Singularity events, and if it’s first the new decision theory will be in place in time for any post-Singularity events.
That’s surprising to me—what should I read in order to understand this point better? EDIT: strike that, you answer that above.
They could modify themselves so that if they ever encounter a CDT-descended AI they’ll start a war (even if it means mutual destruction) unless the CDT-descended AI gives them 99% of its resources.
They could also modify themselves to make the analogous threat if they encounter a UDT-descended AI, or a descendant of an AI designed by TIm Freeman, or a descendant of an AI designed by Wei Dai, or a descendant of an AI designed using ideas mentioned on LessWrong. I would hope that any of those AI’s would hand over 99% of their resources if the extortionist could prove its source code and prove that war would be worse. I assume you’re saying that CDT is special in this regard. How is it special?
(Thanks for the pointer to the James Joyce book, I’ll have a look at it.)
If the alien AI computes the expected utility of “provably modify myself to start a war against CDT-AI unless it gives me 99% of its resources”, it’s certain to get a high value, whereas if it computes the expected utility of “provably modify myself to start a war against UDT-AI unless it gives me 99% of its resources” it might possibly get a low value (not sure because UDT isn’t fully specified), because the UDT-AI, when choosing what to do when faced with this kind of threat, would take into account the logical correlation between its decision and the alien AI’s prediction of its decision.
Well, that’s plausible. I’ll have to work through some UDT examples to understand fully.
What model do you have of how entity X can prove to entity Y that X is running specific source code?
The proof that I can imagine is entity Y gives some secure hardware Z to X, and then X allows Z to observe the process of X self-modifying to run the specified source code, and then X gives the secure hardware back to Y. Both X and Y can observe the creation of Z, so Y can know that it’s secure and X can know that it’s a passive observer rather than a bomb or something.
This model breaks the scenario, since a CDT playing the role of Y could self-modify any time before it hands over Z and play the game competently.
Now, if there’s some way for X to create proofs of X’s source code that will be convincing to Y without giving advance notice to Y, I can imagine a problem for Y here. Does anyone know how to do that?
(I acknowledge that if nobody knows how to do that, that means we don’t know how to do that, not that it can’t be done.)
Hmm, this explains my aversion to knowing the details of what other people are thinking. It can put me at a disadvantage in negotiations unless I am able to lie convincingly and say I do not know.
I think I″ll stop here for now, because you already seem intrigued enough to want to learn about UDT in detail. I’m guessing that once you do, you won’t be so motivated to think up reasons why CDT isn’t really so bad. :) Let me know if that turns out not to be the case though.
On second thought, I should answer this question because it’s of independent interest. If Y is sufficiently powerful, it may be able to deduce the laws of physics and the initial conditions of the universe, and then obtain X’s source code by simulating the universe up to when X is created. Note that Y may do this not because it wants to know X’s source code in some anthropomorphic sense, but simply due to how its decision-making algorithm works.
Unless there have been some specific assumptions made about the universe that will not work. Simulating the entire universe does not tell Y which part of the universe it inhabits. It will give Y a set of possible parts of the universe which match Y’s observations. While the simulation strategy will allow the best possible prediction about what X’s source code is given what Y already knows it does not give evidence to Y that it didn’t already have.
You’re right, the model assumes that we live in a universe such that superintelligent AIs would “naturally” have enough evidence to infer the source code of other AIs. (That seems quite plausible, although by no means certain, to me.) Also, since this is a thread about the relative merits of CDT, I should point out that there are some games in which CDT seems to win relative to TDT or UDT, which is a puzzle that is still open.
It’s an interesting problem, but my impression when reading was somewhat similar to that of Eliezer in the replies. At the core it is the question of “How do you deal with constructs made by other agents?” I don’t think TDT has any particular weakness there.
Quantum mechanics seems to be pretty clear that true random number generators are available, and probably happen naturally. I don’t understand why you consider that scenario probable enough to be worth talking about.
Do you have an intuition as to how it would do this without contradicting itself? I tried to ask a similar question but got it wrong in the first draft and afaict did not receive an answer to the relevant part.
I just want to know if my own intuition fails in the obvious way.
Really? That’s surprising. My assumption had been that CDT would be much simpler to implement—but just give undesirable outcomes in whole classes of circumstance.
CDT uses a “causal probability function” to evaluate the expected utilities of various choices, where this causal probability function is different from the epistemic probability function you use to update beliefs. (In EDT they are one and the same.) There is no agreement amongst CDT theorists how to formulate this function, and I’m not aware of any specific proposal that can be straightforwardly implemented. For more details see James Joyce’s The foundations of causal decision theory.
I understand AIXI reasonably well and had assumed it was a specific implementation of CDT, perhaps with some tweaks so the reward values are generated internally instead of being observed in the environment. Perhaps AIXI isn’t close to an implementation of CDT, perhaps it’s perceived as not specific or straightforward enough, or perhaps it’s not counted as an implementation. Why isn’t AIXI a counterexample?
You may be right that AIXI can be thought of as an instance of CDT. Hutter himself cites “sequential decision theory” from a 1957 paper which certainly predates CDT, but CDT is general enough that SDT could probably fit into its formalism. (Like EDT can be considered an instance of CDT with the causal probability function set to be the same as the epistemic probability function.) I guess I hadn’t considered AIXI as a serious candidate due to its other major problems.
Four problems are listed there.
The first one is the claim that AIXI wouldn’t have a proper understanding of its body because its thoughts are defined mathematically. This is just wrong, IMO; my refutation, for a machine that’s similar enough to AIXI for this issue to work the same, is here. Nobody has engaged me in serious conversation about that, so I don’t know how well it will stand up. (If I’m right on this, then I’ve seen Eliezer, Tim Tyler, and you make the same error. What other false consensuses do we have?)
The second one is fixed if we do the tweak I mentioned in the grandparent of this comment.
If you take the fix described above for the second one, what’s left of the third one is the claim that instantaneous human (or AI) experience is too nuanced to fit in a single cell of a Turing machine. According to the original paper, page 8, the symbols on the reward tape are drawn from an alphabet R of arbitrary but fixed size. All you need is a very large alphabet and this one goes away.
I agree with the facts asserted in Tyler’s fourth problem, but I do not agree that it is a problem. He’s saying that Kolmogorov complexity is ill-defined because the programming language used is undefined. I agree that rational agents might disagree on priors because they’re using different programming languages to represent their explanations. In general, a problem may have multiple solutions. Practical solutions to the problems we’re faced with will require making indefensible arbitrary choices of one potential solution over another. Picking the programming language for priors is going to be one of those choices.
I don’t see how your refutation applies to AIXI. Let me just try to explain in detail why I think AIXI will not properly protect its body. Consider an AIXI that arises in a simple universe, i.e., one computed by a short program P. AIXI has a probability distribution not over universes, but instead over environments where an environment is a TM whose output tape is AIXI’s input tape and whose input tape is AIXI’s output tape. What’s the simplest environment that fits AIXI’s past inputs/outputs? Presumably it’s E = P plus some additional pieces of code that injects E’s inputs into where AIXI’s physical output ports are located in the universe (that is, overrides the universe’s natural evolution using E’s inputs), and extracts E’s outputs from where AIXI’s physical input ports are located.
What happens when AIXI considers an action that destroys its physical body in the universe computed by P? As long as the input/output ports are not also destroyed, AIXI would expect that the environment E (with its “supernatural” injection/extraction code) will continue to receive its outputs and provide it with inputs.
Does that make sense?
(Responding out of order)
Yes, but it makes some unreasonable assumptions.
An implementation of AIXI would be fairly complex. If P is too simple, then AIXI could not really have a body in the universe, so it would be correct in guessing that some irregularity in the laws of physics was causing its behaviors to be spliced into the behavior of the world.
However, if AIXI has observed enough of the inner workings of other similar machines, or enough of the laws of physics in general, or enough of its own inner workings, the simplest model will be that AIXI’s outputs really do emerge from the laws of physics in the real universe, since we are assuming that that is indeed the case and that Kolmogorov induction eventually works. At that point, imagining that AIXI’s behaviors are a consequence of a bunch of exceptions to the laws of physics is just extra complexity and won’t be part of the simplest hypothesis. It will be part of some less likely hypotheses, and the AI would have to take that risk into account when deciding whether to self-improve.
Tim, I think you’re probably not getting my point about the distinction between our concept of a computable universe, and AIXI’s formal concept of a computable environment. AIXI requires that the environment be a TM whose inputs match AIXI’s past outputs and whose outputs match AIXI’s past inputs. A candidate environment must have the additional code to inject/extract those inputs/outputs and place them on the input/output tapes, or AIXI will exclude it from its expected utility calculations.
I agree that the candidate environment will need to have code to handle the inputs. However, if the candidate environment can compute the outputs on its own, without needing to be given the AI’s outputs, the candidate environment does not need code to inject the AI’s outputs into it.
Even if the AI can only partially predict its own behavior based on the behavior of the hardware it observes in the world, it can use that information to more efficiently encode its outputs in the candidate environment, so it can have some understanding of its position in the world even without being able to perfectly predict its own behavior from first principles.
If the AI manages to destroy itself, it will expect its outputs to be disconnected from the world and have no consequences, since anything else would violate its expectations about the laws of physics.
This back-and-forth appears to be useless. I should probably do some Python experiments and we then can change this from a debate to a programming problem, which would be much more pleasant.
If a candidate environment has no special code to inject AIXI’s outputs, then when AIXI computes expected utilities, it will find that all actions have equal utility in that environment, so that environment will play no role in its decisions.
Ok, but try not to destroy the world while you’re at it. :) Also, please take a closer look at UDT first. Again, I think there’s a strong possibility that you’ll end up thinking “why did I waste my time defending CDT/AIXI?”
FYI, generating reward values internally—instead of them being observed in the environment—makes no difference whatsoever to the wirehead problem.
AIXI digging into its brains with its own mining claws is quite plausible. It won’t reason as you suggest—since it has no idea that it is instantiated in the real world. So, its exploratory mining claws may plunge in. Hopefully it will get suitably negatively reinforced for that—though much will depend on which part of its brain it causes damage too. It could find that ripping out its own inhibition circuits is very rewarding.
A larger set of symbols for rewards makes no difference—since the reward signal is a scalar. If you compare with an animal, that has millions of pain sensors that operate in parallel. The animal is onto something there—something to do with a-priori knowledge about the common causes of pain. Having lots of pain sensors has positive aspects—e.g. it saves you experimenting to figure out what hurts.
As for the reference machine issue, I do say: “This problem is also not very serious.”
Not very serious unless you are making claims about your agent being “the most intelligent unbiased agent possible”. Then this kind of thing starts to make a difference...
You can encode 16 64 bit integers in a 1024 bit integer. The scalar/parallel distinction is bogus.
(Edit: I original wrote “5 32 bit integers” when I meant “2**5 32 bit integers”. Changed to “16 64 bit integers” because “32 32 bit integers” looked too much like a typo.)
Strawman argument. The only claim made is that it’s the most intelligent up to a constant factor, and a bunch of other conditions are thrown in. When Hutter’s involved, you can bet that some of the constant factors are large compared to the size of the universe.
Er, not if you are adding the rewards together and maximising the results, you can’t! That is exactly what happens to the rewards used by AIXI.
Actually Hutter says this sort of thing all over the place (I was quoting him above) - and it seems pretty irritating and misleading to me. I’m not saying the claims he makes in the fine print are wrong, but rather that the marketing headlines are misleading.
You’re right there, I’m confusing AIXI with another design I’ve been working with in a similar idiom. For AIXI to work, you have to combine together all the environmental stuff and compute a utility, make the code for doing the combining part of the environment (not the AI), and then use that resulting utility as the input to AIXI.
Thankyou for the reference, and the explanation.
I am prompted to ask myself a question analogous to the one Eliezer recently asked:
Is it worth my while exploring the details of CDT formalization beyond just the page you linked to? There seems to be some advantage to understanding the details and conventions of how such concepts are described. At the same time revising CDT thinking in too much detail may eliminate some entirely justifiable confusion as to why anyone would think it is a good idea! “Causal Expected Utiluty”? “Causal Tendencies”? What the? I only care about what will get me the best outcome!
Probably not. I only learned it by accident myself. I had come up with a proto-UDT that was motivated purely by anthropic reasoning paradoxes (as opposed to Newcomb-type problems like CDT and TDT), and wanted to learn how existing decision theories were formalized so I could do something similar. James Joyce’s book was the most prominent such book available at the time.
ETA: Sorry, I think the above is probably not entirely clear or helpful. It’s a bit hard for me to put myself in your position and try to figure out what may or may not be worthwhile for you. The fact is that Joyce’s book is the decision theory book I read, and quite possibly it influenced me more than I realize, or is more useful for understanding the motivation for or the formulation of UDT than I think. It couldn’t hurt to grab a copy of it and read a few chapters to see how useful it is to you.
Thanks for the edit/update. For reference it may be worthwhile to make such additions as a new comment, either as a reply to yourself or the parent. It was only by chance that I spotted the new part!
What pre-singularity actions are you worried about them taking?
What I was thinking was that a CDT-seeded AI might actually be safer precisely because it won’t try to change pre-Singularity events, and if it’s first the new decision theory will be in place in time for any post-Singularity events.
That’s surprising to me—what should I read in order to understand this point better?
Yes, for reasons of game theory and of practical singularity strategy.
Game theory, because things in Everett branches that are ‘closest’ to us might be the ones it’s most important to be able to interact with, since they’re easier to simulate and their preferences are more likely to have interesting overlap with ours. Knowing very roughly what to expect from our neighbors is useful.
And singularity strategy, because if you can show that architectures like AIXI-tl have some non-negligible chance of converging to whatever an FAI would have converged to, as far as actual policies go, then that is a very important thing to know; especially if a non-uFAI existential risk starts to look imminent (but the game theory in that case is crazy). It is not probable but there’s a hell of a lot of structural uncertainty and Omohundro’s AI drives are still pretty informal. I am still not absolutely sure I know how a self-modifying superintelligence would interpret or reflect on its utility function or terms therein (or how it would reflect on its implicit policy for interpreting or reflecting on utility functions or terms therein). The apparent rigidity of Goedel machines might constitute a disproof in theory (though I’m not sure about that), but when some of the terms are sequences of letters like “makeHumansHappy” or formally manipulable correlated markers of human happiness, then I don’t know how the syntax gets turned into semantics (or fails entirely to get turned into semantics, as they case may well be).
This implies that the actually-implemented-CDT agent has a single level of abstraction/granularity at like the naive realist physical level at which it’s proving things about causal relationships. Like, it can’t/shouldn’t prove causal relationships at the level of string theory, and yet it’s still confident that its actions are causing things despite that structural uncertainty, and yet despite the symmetry it for some reason cannot possibly see how switching a few transistors or changing its decision policy might affect things via relationships that are ultimately causal but currently unknown for reasons of boundedness and not speculative metaphysics. It’s plausible, but I think letting a universal hypothesis space or maybe even just Goedelian limitations enter the decision calculus at any point is going to make such rigidity unlikely. (This is related to how a non-hypercomputation-driven decision theory in general might reason about the possibility of hypercomputation, or the risk of self-diagonalization, I think.)
The CDT is making a decision about whether to self-modify even before it meets the alien, based on its expectation of meeting the alien. How does CDT!expectation differ from Eliezer!expectation before we meet the alien?
It is useful to separate in one’s mind the difference between on one hand being able to One Box and cooperate in PD with agents that you know well (shared source code) and on the other hand not firing on Baby Eaters after they have already chosen not to fire on you. This is especially the case when first grappling the subject. (Could you confirm, by the way, that Akon’s decision in that particular paragraph or two is approximately what TDT would suggest?)
The above is particularly relevant because the “have access to each other’s source code” is such a useful intuition pump when grappling with or explaining the solutions to many of the relevant decision problems. It is useful to be able to draw a line on just how far the source code metaphor can take you.
There is also something distasteful about making comparisons to a decision theory that isn’t even implicitly stable under self modification. A CDT agent will change to CDT++ unless there is an additional flaw in the agent beyond the poor decision making strategy. If I create a CDT agent, give it time to think and then give it Newcomb’s problem it will One Box (and also no longer be a CDT agent). It is the errors in the agent that still remain after that time that need TDT or UDT to fix.
*nod* This is just the ‘new rules starting now’ option. What the CDT agent does when it wakes up in an empty, boring room and does some introspection.
Surely the important thing is that it will self-modify to whatever decision theory has the best consequences?
The new algorithm will not exactly be TDT, because it won’t try to change decisions that have already been made the way TDT does. In particular this means that there’s no risk from Roko’s basilisk.
Disclaimer: I’m not very confident of anything I say about decision theory.
Eliezer says elsewhere that current decision theory doesn’t let us prove a self-modifying AI would choose to keep the goals we program into it. He wants to develop a proof before even starting work on the AI.
It’s easy to contrive situations where a self-modifying AI would choose not to keep the goals programmed into it, even without precommitment issues. Just contrive the circumstances so it gets paid to change. Unless there’s something wrong with the argument there, TDT etc. won’t be enough to ensure that the goals are kept.