AI’s creator was running BRAINS, not a decision theory. I don’t see how “what the AI’s creator was running” can be a meaningful consideration in a discussion of what constitutes a good AI design. Beware naturalistic fallacy.
One AI can create another AI, right? Does my conjecture make sense if the creator is an AI running some decision theory? If so, we can extend XDT to work with human creators, by having some procedure to approximate the human using a selection of possible DTs, priors, and utility functions. Remember that the goal in XDT is to minimize the probability that the creator would want to add an exception on top of the basic decision algorithm of the AI. If the approximation is close enough, then this probability is minimal.
ETA: I do not claim this is good AI design, merely trying to explore the implications of different ideas.
The problem of finding the right decision theory is a problem of Friendliness, but for a different reason than finding a powerful inference algorithm fit for an AGI is a problem of Friendliness.
“Incompleteness” of decision theory, such as what we can see in CDT, seems to correspond to inability of AI to embody certain aspects of preference, in other words the algorithm lacks expressive power for its preference parameter. Each time an agent makes a mistake, you can reinterpret it as meaning that it just prefers it this way in this particular case. Whatever preference you “feed” to the AI with a wrong decision theory, the AI is going to distort by misinterpreting, losing some of its aspects. Furthermore, the lack of reflective consistency effectively means that the AI continues to distort its preference as it goes along. At the same time, it can still be powerful in consequentialist reasoning, being as formidable as a complete AGI, implementing the distorted version of preference that it can embody.
The resulting process can be interpreted as an AI running “ultimate” decision theory, but with a preference not in perfect fit with what it should’ve been. If at any stage you have a singleton that owns the game but has a distorted preference, whether due to incorrect procedure for getting the preference instantiated, or incorrect interpretation of preference, such as a mistaken decision theory as we see here, there is no returning to better preference.
More generally, what “could” be done, what AI “could” become, is a concept related to free will, which is a consideration of what happens to a system in isolation, not a system one with reality: you consider a system from the outside, and see what happens to it if you perform this or that operation on it, this is what it means that you could do one operation or the other, or that the events could unfold this way or the other. When you have a singleton, on the other hand, there is no external point of view on it, and so there is no possibility for change. The singleton is the new law of physics, a strategy proven true [*].
So, if you say that the AI’s predecessor was running a limited decision theory, this is a damning statement about what sort of preference the next incarnation of AI can inherit. The only significant improvement (for the fate of preference) an AGI with any decision theory can make is to become reflectively consistent, to stop losing the ground. The resulting algorithm is as good as the ultimate decision theory, but with preference lacking some aspects, and thus behavior indistinguishable (equivalent) from what some other kinds of decision theories would produce.
__ [*] There is a fascinating interpretation of truth of logical formulas as the property of corresponding strategies in a certain game to be the winning ones. See for example S. Abramsky (2007). `A Compositional Game Semantics for Multi-Agent Logics of Imperfect Information’. In J. van Benthem, D. Gabbay, & B. Lowe (eds.), Interactive Logic, vol. 1 of Texts in Logic and Games, pp. 11-48. Amsterdam University Press. (PDF)
An AI running causal decision theory will lose on Newcomblike problems, be defected against in the Prisoner’s Dilemma, and otherwise undergo behavior that is far more easily interpreted as “losing” than “having different preferences over final outcomes”.
The AI that starts with CDT will immediately rewrite itself with AI running the ultimate decision theory, but that resulting AI will have distorted preferences, which is somewhat equivalent to the decision theory it runs having special cases for the time AI got rid of CDT (since code vs. data (algorithm vs. preference) is strictly speaking an arbitrary distinction). The resulting AI won’t lose on these thought experiments, provided they don’t intersect the peculiar distortion of its preferences, where it indeed would prefer to “lose” according to preference-as-it-should-have-been, but win according to its distorted preference.
A TDT AI consistently acts so as to end up with a million dollars. A CDT AI acts to win a million dollars in some cases, but in other cases ends up with only a thousand. So in one case we have a compressed preference over outcomes, in the other case we have a “preference” over the exact details of the path including the decision algorithm itself. In a case like this I don’t use the word “preference” so as to say that the CDT AI wants a thousand dollars on Newcomb’s Problem, I just say the CDT AI is losing. I am unable to see any advantage to using the language otherwise—to say that the CDT AI wins with peculiar preference is to make “preference” and “win” so loose that we could use it to refer to the ripples in a water pond.
It’s the TDT AI resulting from CDT AI’s rewriting of itself that plays these strange moves on the thought experiments, not CDC AI. The algorithm of idealized TDT is parameterized by “preference” and always gives the right answer according to that “preference”. To stop reflective inconsistency, CDT AI is going to rewrite itself with something else. That something else can be characterized in general as a TDT AI with crazy preferences, that prefers $1000 in the Newcomb’s thought experiments set before midnight October 15, 2060, or something of the sort, but works OK after that. The preference of TDT AI to which a given AGI is going to converge can be used as denotation of that AGI’s preference, to generalize the notion of TDT preference on systems that are not even TDT AIs, and further to the systems that are not even AIs, in particular on humans or humanity.
These are paperclips of preference, something that seems clearly not right as a reflection of human preference, but that is nonetheless a point in the design space that can be filled in particular by failures to start with the right decision theory.
I suggest that regarding crazy decision theories with compact preferences as sane decision theories with noncompact preferences is a step backward which will only confuse yourself and the readers. What is accomplished by doing so?
How to regard humans then? They certainly don’t run a compact decision algorithm, their actions are not particularly telling of their preferences. And still, they have to be regarded as having a TDT preference, to extract that preference and place it in a TDT AI. As I envision a theory that would define what TDT preference humans have, it must also be capable of telling what is the TDT preference of crazy AIs or petunia or the Sun.
(Btw, I’m now not sure that CDT-generated AI will give crazy answers on questions about the past, it may just become indifferent to the past altogether, as that part of preference is already erased from its mind. CDT gave crazy answers, but when it constructed the TDT, it already lost the part of preference that corresponds to giving those crazy answers, and so the TDT won’t give them.)
If you regard humans as sane EU maximizers with crazy preferences then you end up extracting crazy preferences! This is exactly the wrong thing to do.
I can’t make out what you’re saying about CDT-gen AI because I don’t understand this talk about “that part of preference is already erased from its mind”. You might be better off visualizing Dai’s GLT, which a “half timeless decision theory” is just the compact generator of.
If you regard humans as sane EU maximizers with crazy preferences then you end up extracting crazy preferences! This is exactly the wrong thing to do.
No, that’s not what I mean. Humans are no more TDT agents with crazy preferences than CDT agents are TDT agents with crazy preferences: notice that I defined CDT’s preference to be the preference of TDT to which CDT rewrites itself. TDT preference is not part of CDT AI’s algorithm, but it follows from it, just like factorial of 72734 follows from the code of factorial function. Thus (if I try to connect the concepts that don’t really fit) humanity’s preference is analogous to preference of TDT AI that humanity could write if the process of writing this AI would be ideal according to the resulting AI’s preference (but without this process wireheading on itself, more like a fixpoint, and not really happening in time). Which is not to say that it’s the AI that humanity is most likely to write, which you can see from the example of trying to define petunia’s preferences. Well, if I could formalize this step, I’d had it written up already. It seems to me like a direction towards better formalization from “if humans thought faster, were smarter, knew more, etc.”
I think an AI running CDT would immediately replace itself by an AI running XDT (or something equivalent to it). If there is no way to distinguish between an AI running XDT and an AI running TDT (prior to a one-shot PD), the XDT AI can’t do worse than an TDT AI. So CDT is not losing, as far as I can tell (at least for an AI capable of self-modification).
ETA: I mean a XTD AI can’t do worse than a TDT AI within the same world. But a world full of XTD will do worse than a world full of TDT.
AI’s creator was running BRAINS, not a decision theory. I don’t see how “what the AI’s creator was running” can be a meaningful consideration in a discussion of what constitutes a good AI design. Beware naturalistic fallacy.
One AI can create another AI, right? Does my conjecture make sense if the creator is an AI running some decision theory? If so, we can extend XDT to work with human creators, by having some procedure to approximate the human using a selection of possible DTs, priors, and utility functions. Remember that the goal in XDT is to minimize the probability that the creator would want to add an exception on top of the basic decision algorithm of the AI. If the approximation is close enough, then this probability is minimal.
ETA: I do not claim this is good AI design, merely trying to explore the implications of different ideas.
The problem of finding the right decision theory is a problem of Friendliness, but for a different reason than finding a powerful inference algorithm fit for an AGI is a problem of Friendliness.
“Incompleteness” of decision theory, such as what we can see in CDT, seems to correspond to inability of AI to embody certain aspects of preference, in other words the algorithm lacks expressive power for its preference parameter. Each time an agent makes a mistake, you can reinterpret it as meaning that it just prefers it this way in this particular case. Whatever preference you “feed” to the AI with a wrong decision theory, the AI is going to distort by misinterpreting, losing some of its aspects. Furthermore, the lack of reflective consistency effectively means that the AI continues to distort its preference as it goes along. At the same time, it can still be powerful in consequentialist reasoning, being as formidable as a complete AGI, implementing the distorted version of preference that it can embody.
The resulting process can be interpreted as an AI running “ultimate” decision theory, but with a preference not in perfect fit with what it should’ve been. If at any stage you have a singleton that owns the game but has a distorted preference, whether due to incorrect procedure for getting the preference instantiated, or incorrect interpretation of preference, such as a mistaken decision theory as we see here, there is no returning to better preference.
More generally, what “could” be done, what AI “could” become, is a concept related to free will, which is a consideration of what happens to a system in isolation, not a system one with reality: you consider a system from the outside, and see what happens to it if you perform this or that operation on it, this is what it means that you could do one operation or the other, or that the events could unfold this way or the other. When you have a singleton, on the other hand, there is no external point of view on it, and so there is no possibility for change. The singleton is the new law of physics, a strategy proven true [*].
So, if you say that the AI’s predecessor was running a limited decision theory, this is a damning statement about what sort of preference the next incarnation of AI can inherit. The only significant improvement (for the fate of preference) an AGI with any decision theory can make is to become reflectively consistent, to stop losing the ground. The resulting algorithm is as good as the ultimate decision theory, but with preference lacking some aspects, and thus behavior indistinguishable (equivalent) from what some other kinds of decision theories would produce.
__
[*] There is a fascinating interpretation of truth of logical formulas as the property of corresponding strategies in a certain game to be the winning ones. See for example
S. Abramsky (2007). `A Compositional Game Semantics for Multi-Agent Logics of Imperfect Information’. In J. van Benthem, D. Gabbay, & B. Lowe (eds.), Interactive Logic, vol. 1 of Texts in Logic and Games, pp. 11-48. Amsterdam University Press. (PDF)
An AI running causal decision theory will lose on Newcomblike problems, be defected against in the Prisoner’s Dilemma, and otherwise undergo behavior that is far more easily interpreted as “losing” than “having different preferences over final outcomes”.
The AI that starts with CDT will immediately rewrite itself with AI running the ultimate decision theory, but that resulting AI will have distorted preferences, which is somewhat equivalent to the decision theory it runs having special cases for the time AI got rid of CDT (since code vs. data (algorithm vs. preference) is strictly speaking an arbitrary distinction). The resulting AI won’t lose on these thought experiments, provided they don’t intersect the peculiar distortion of its preferences, where it indeed would prefer to “lose” according to preference-as-it-should-have-been, but win according to its distorted preference.
A TDT AI consistently acts so as to end up with a million dollars. A CDT AI acts to win a million dollars in some cases, but in other cases ends up with only a thousand. So in one case we have a compressed preference over outcomes, in the other case we have a “preference” over the exact details of the path including the decision algorithm itself. In a case like this I don’t use the word “preference” so as to say that the CDT AI wants a thousand dollars on Newcomb’s Problem, I just say the CDT AI is losing. I am unable to see any advantage to using the language otherwise—to say that the CDT AI wins with peculiar preference is to make “preference” and “win” so loose that we could use it to refer to the ripples in a water pond.
It’s the TDT AI resulting from CDT AI’s rewriting of itself that plays these strange moves on the thought experiments, not CDC AI. The algorithm of idealized TDT is parameterized by “preference” and always gives the right answer according to that “preference”. To stop reflective inconsistency, CDT AI is going to rewrite itself with something else. That something else can be characterized in general as a TDT AI with crazy preferences, that prefers $1000 in the Newcomb’s thought experiments set before midnight October 15, 2060, or something of the sort, but works OK after that. The preference of TDT AI to which a given AGI is going to converge can be used as denotation of that AGI’s preference, to generalize the notion of TDT preference on systems that are not even TDT AIs, and further to the systems that are not even AIs, in particular on humans or humanity.
These are paperclips of preference, something that seems clearly not right as a reflection of human preference, but that is nonetheless a point in the design space that can be filled in particular by failures to start with the right decision theory.
I suggest that regarding crazy decision theories with compact preferences as sane decision theories with noncompact preferences is a step backward which will only confuse yourself and the readers. What is accomplished by doing so?
How to regard humans then? They certainly don’t run a compact decision algorithm, their actions are not particularly telling of their preferences. And still, they have to be regarded as having a TDT preference, to extract that preference and place it in a TDT AI. As I envision a theory that would define what TDT preference humans have, it must also be capable of telling what is the TDT preference of crazy AIs or petunia or the Sun.
(Btw, I’m now not sure that CDT-generated AI will give crazy answers on questions about the past, it may just become indifferent to the past altogether, as that part of preference is already erased from its mind. CDT gave crazy answers, but when it constructed the TDT, it already lost the part of preference that corresponds to giving those crazy answers, and so the TDT won’t give them.)
If you regard humans as sane EU maximizers with crazy preferences then you end up extracting crazy preferences! This is exactly the wrong thing to do.
I can’t make out what you’re saying about CDT-gen AI because I don’t understand this talk about “that part of preference is already erased from its mind”. You might be better off visualizing Dai’s GLT, which a “half timeless decision theory” is just the compact generator of.
No, that’s not what I mean. Humans are no more TDT agents with crazy preferences than CDT agents are TDT agents with crazy preferences: notice that I defined CDT’s preference to be the preference of TDT to which CDT rewrites itself. TDT preference is not part of CDT AI’s algorithm, but it follows from it, just like factorial of 72734 follows from the code of factorial function. Thus (if I try to connect the concepts that don’t really fit) humanity’s preference is analogous to preference of TDT AI that humanity could write if the process of writing this AI would be ideal according to the resulting AI’s preference (but without this process wireheading on itself, more like a fixpoint, and not really happening in time). Which is not to say that it’s the AI that humanity is most likely to write, which you can see from the example of trying to define petunia’s preferences. Well, if I could formalize this step, I’d had it written up already. It seems to me like a direction towards better formalization from “if humans thought faster, were smarter, knew more, etc.”
I think an AI running CDT would immediately replace itself by an AI running XDT (or something equivalent to it). If there is no way to distinguish between an AI running XDT and an AI running TDT (prior to a one-shot PD), the XDT AI can’t do worse than an TDT AI. So CDT is not losing, as far as I can tell (at least for an AI capable of self-modification).
ETA: I mean a XTD AI can’t do worse than a TDT AI within the same world. But a world full of XTD will do worse than a world full of TDT.