If you regard humans as sane EU maximizers with crazy preferences then you end up extracting crazy preferences! This is exactly the wrong thing to do.
No, that’s not what I mean. Humans are no more TDT agents with crazy preferences than CDT agents are TDT agents with crazy preferences: notice that I defined CDT’s preference to be the preference of TDT to which CDT rewrites itself. TDT preference is not part of CDT AI’s algorithm, but it follows from it, just like factorial of 72734 follows from the code of factorial function. Thus (if I try to connect the concepts that don’t really fit) humanity’s preference is analogous to preference of TDT AI that humanity could write if the process of writing this AI would be ideal according to the resulting AI’s preference (but without this process wireheading on itself, more like a fixpoint, and not really happening in time). Which is not to say that it’s the AI that humanity is most likely to write, which you can see from the example of trying to define petunia’s preferences. Well, if I could formalize this step, I’d had it written up already. It seems to me like a direction towards better formalization from “if humans thought faster, were smarter, knew more, etc.”
No, that’s not what I mean. Humans are no more TDT agents with crazy preferences than CDT agents are TDT agents with crazy preferences: notice that I defined CDT’s preference to be the preference of TDT to which CDT rewrites itself. TDT preference is not part of CDT AI’s algorithm, but it follows from it, just like factorial of 72734 follows from the code of factorial function. Thus (if I try to connect the concepts that don’t really fit) humanity’s preference is analogous to preference of TDT AI that humanity could write if the process of writing this AI would be ideal according to the resulting AI’s preference (but without this process wireheading on itself, more like a fixpoint, and not really happening in time). Which is not to say that it’s the AI that humanity is most likely to write, which you can see from the example of trying to define petunia’s preferences. Well, if I could formalize this step, I’d had it written up already. It seems to me like a direction towards better formalization from “if humans thought faster, were smarter, knew more, etc.”