Wei Dai comments on Policy Alignment

Wei Dai 2 Jul 2018 5:14 UTC
LW: 13 AF: 9
AF
What about calling it “policy alignment” in analogy with “value alignment”?
So, the AI still needs to figure out what is “irrational” and what is “real” in $P_{H}$ , just like value-learning needs to do for $U_{H}$ .
Since I’m very confused about what my $P_{H}$ should be (I may be happy to change it in any number of ways if someone gave me the correct solutions to a bunch of philosophical problems), there may not be anything “real” in my $P_{H}$ that I’d want an AI to learn and use in an uncritical way. It seems like this mostly comes down to what probabilities really are: if probabilities are something objective like “how real” or “how much existence” each possible world is/has, then I’d want an AI to use its greater intellect to figure out what is the correct prior and use that, but if probabilities are something subjective like how much I care about each possible world, then maybe I’d want the AI to learn and use my $P_{H}$ . I’m kind of confused that you give a bunch of what seem to me to be less important considerations on whether the AI should use my probability function or its own to make decisions, and don’t mention this one.
- abramdemski 13 Jul 2018 1:02 UTC
  LW: 3 AF: 2
  AF Parent
  “Policy alignment” seems like an improvement, especially since “policy approval” invokes government policy.
  With respect to the rest:
  On the one hand, I’m tempted to say that to the extent you recognize how confused you are about what probabilities are, and that this confusion has to do with how you reason in the real world, your $P_{H}$ is going to change a lot when updated on certain philosophical arguments. As a result, optimizing a strategy updatelessly via $P_{H}$ is going to take that into account, shifting behavior significantly in contingencies in which various philosophical arguments emerge, and potentially putting a significant amount of processing power toward searching for such arguments.
  On the other hand, I buy my “policy alignment” proposal only to the extent that I buy UDT, which is not entirely. I don’t know how to think about UDT together with the shifting probabilities which come from logical induction. The problem is similar to the one you outline: just as it is unclear that a human should think its own $P_{H}$ has any useful content which should be locked in forever in an updateless reasoner, it is similarly unclear that a fixed logical inductor state (after running for a finite amount of time) has any useful content which one would want to lock in forever.
  I don’t yet know how to think about this problem. I suspect there’s something non-obvious to be said about the extent to which $P_{H}$ trusts other belief distributions (IE, something at least a bit more compelling than the answer I gave first, but not entirely different in form).