The AI that starts with CDT will immediately rewrite itself with AI running the ultimate decision theory, but that resulting AI will have distorted preferences, which is somewhat equivalent to the decision theory it runs having special cases for the time AI got rid of CDT (since code vs. data (algorithm vs. preference) is strictly speaking an arbitrary distinction). The resulting AI won’t lose on these thought experiments, provided they don’t intersect the peculiar distortion of its preferences, where it indeed would prefer to “lose” according to preference-as-it-should-have-been, but win according to its distorted preference.
A TDT AI consistently acts so as to end up with a million dollars. A CDT AI acts to win a million dollars in some cases, but in other cases ends up with only a thousand. So in one case we have a compressed preference over outcomes, in the other case we have a “preference” over the exact details of the path including the decision algorithm itself. In a case like this I don’t use the word “preference” so as to say that the CDT AI wants a thousand dollars on Newcomb’s Problem, I just say the CDT AI is losing. I am unable to see any advantage to using the language otherwise—to say that the CDT AI wins with peculiar preference is to make “preference” and “win” so loose that we could use it to refer to the ripples in a water pond.
It’s the TDT AI resulting from CDT AI’s rewriting of itself that plays these strange moves on the thought experiments, not CDC AI. The algorithm of idealized TDT is parameterized by “preference” and always gives the right answer according to that “preference”. To stop reflective inconsistency, CDT AI is going to rewrite itself with something else. That something else can be characterized in general as a TDT AI with crazy preferences, that prefers $1000 in the Newcomb’s thought experiments set before midnight October 15, 2060, or something of the sort, but works OK after that. The preference of TDT AI to which a given AGI is going to converge can be used as denotation of that AGI’s preference, to generalize the notion of TDT preference on systems that are not even TDT AIs, and further to the systems that are not even AIs, in particular on humans or humanity.
These are paperclips of preference, something that seems clearly not right as a reflection of human preference, but that is nonetheless a point in the design space that can be filled in particular by failures to start with the right decision theory.
I suggest that regarding crazy decision theories with compact preferences as sane decision theories with noncompact preferences is a step backward which will only confuse yourself and the readers. What is accomplished by doing so?
How to regard humans then? They certainly don’t run a compact decision algorithm, their actions are not particularly telling of their preferences. And still, they have to be regarded as having a TDT preference, to extract that preference and place it in a TDT AI. As I envision a theory that would define what TDT preference humans have, it must also be capable of telling what is the TDT preference of crazy AIs or petunia or the Sun.
(Btw, I’m now not sure that CDT-generated AI will give crazy answers on questions about the past, it may just become indifferent to the past altogether, as that part of preference is already erased from its mind. CDT gave crazy answers, but when it constructed the TDT, it already lost the part of preference that corresponds to giving those crazy answers, and so the TDT won’t give them.)
If you regard humans as sane EU maximizers with crazy preferences then you end up extracting crazy preferences! This is exactly the wrong thing to do.
I can’t make out what you’re saying about CDT-gen AI because I don’t understand this talk about “that part of preference is already erased from its mind”. You might be better off visualizing Dai’s GLT, which a “half timeless decision theory” is just the compact generator of.
If you regard humans as sane EU maximizers with crazy preferences then you end up extracting crazy preferences! This is exactly the wrong thing to do.
No, that’s not what I mean. Humans are no more TDT agents with crazy preferences than CDT agents are TDT agents with crazy preferences: notice that I defined CDT’s preference to be the preference of TDT to which CDT rewrites itself. TDT preference is not part of CDT AI’s algorithm, but it follows from it, just like factorial of 72734 follows from the code of factorial function. Thus (if I try to connect the concepts that don’t really fit) humanity’s preference is analogous to preference of TDT AI that humanity could write if the process of writing this AI would be ideal according to the resulting AI’s preference (but without this process wireheading on itself, more like a fixpoint, and not really happening in time). Which is not to say that it’s the AI that humanity is most likely to write, which you can see from the example of trying to define petunia’s preferences. Well, if I could formalize this step, I’d had it written up already. It seems to me like a direction towards better formalization from “if humans thought faster, were smarter, knew more, etc.”
The AI that starts with CDT will immediately rewrite itself with AI running the ultimate decision theory, but that resulting AI will have distorted preferences, which is somewhat equivalent to the decision theory it runs having special cases for the time AI got rid of CDT (since code vs. data (algorithm vs. preference) is strictly speaking an arbitrary distinction). The resulting AI won’t lose on these thought experiments, provided they don’t intersect the peculiar distortion of its preferences, where it indeed would prefer to “lose” according to preference-as-it-should-have-been, but win according to its distorted preference.
A TDT AI consistently acts so as to end up with a million dollars. A CDT AI acts to win a million dollars in some cases, but in other cases ends up with only a thousand. So in one case we have a compressed preference over outcomes, in the other case we have a “preference” over the exact details of the path including the decision algorithm itself. In a case like this I don’t use the word “preference” so as to say that the CDT AI wants a thousand dollars on Newcomb’s Problem, I just say the CDT AI is losing. I am unable to see any advantage to using the language otherwise—to say that the CDT AI wins with peculiar preference is to make “preference” and “win” so loose that we could use it to refer to the ripples in a water pond.
It’s the TDT AI resulting from CDT AI’s rewriting of itself that plays these strange moves on the thought experiments, not CDC AI. The algorithm of idealized TDT is parameterized by “preference” and always gives the right answer according to that “preference”. To stop reflective inconsistency, CDT AI is going to rewrite itself with something else. That something else can be characterized in general as a TDT AI with crazy preferences, that prefers $1000 in the Newcomb’s thought experiments set before midnight October 15, 2060, or something of the sort, but works OK after that. The preference of TDT AI to which a given AGI is going to converge can be used as denotation of that AGI’s preference, to generalize the notion of TDT preference on systems that are not even TDT AIs, and further to the systems that are not even AIs, in particular on humans or humanity.
These are paperclips of preference, something that seems clearly not right as a reflection of human preference, but that is nonetheless a point in the design space that can be filled in particular by failures to start with the right decision theory.
I suggest that regarding crazy decision theories with compact preferences as sane decision theories with noncompact preferences is a step backward which will only confuse yourself and the readers. What is accomplished by doing so?
How to regard humans then? They certainly don’t run a compact decision algorithm, their actions are not particularly telling of their preferences. And still, they have to be regarded as having a TDT preference, to extract that preference and place it in a TDT AI. As I envision a theory that would define what TDT preference humans have, it must also be capable of telling what is the TDT preference of crazy AIs or petunia or the Sun.
(Btw, I’m now not sure that CDT-generated AI will give crazy answers on questions about the past, it may just become indifferent to the past altogether, as that part of preference is already erased from its mind. CDT gave crazy answers, but when it constructed the TDT, it already lost the part of preference that corresponds to giving those crazy answers, and so the TDT won’t give them.)
If you regard humans as sane EU maximizers with crazy preferences then you end up extracting crazy preferences! This is exactly the wrong thing to do.
I can’t make out what you’re saying about CDT-gen AI because I don’t understand this talk about “that part of preference is already erased from its mind”. You might be better off visualizing Dai’s GLT, which a “half timeless decision theory” is just the compact generator of.
No, that’s not what I mean. Humans are no more TDT agents with crazy preferences than CDT agents are TDT agents with crazy preferences: notice that I defined CDT’s preference to be the preference of TDT to which CDT rewrites itself. TDT preference is not part of CDT AI’s algorithm, but it follows from it, just like factorial of 72734 follows from the code of factorial function. Thus (if I try to connect the concepts that don’t really fit) humanity’s preference is analogous to preference of TDT AI that humanity could write if the process of writing this AI would be ideal according to the resulting AI’s preference (but without this process wireheading on itself, more like a fixpoint, and not really happening in time). Which is not to say that it’s the AI that humanity is most likely to write, which you can see from the example of trying to define petunia’s preferences. Well, if I could formalize this step, I’d had it written up already. It seems to me like a direction towards better formalization from “if humans thought faster, were smarter, knew more, etc.”