Rohin Shah comments on Occam’s Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann

Rohin Shah 7 Oct 2019 23:24 UTC
LW: 13 AF: 9
AF
Some objections:
- The thing that you can’t do is decompose behavior into planner and reward. If you just want to predict behavior, you can totally do that. Similarly, you can predict future events with physics.
- You do need to do the decomposition to run counterfactuals. And indeed I buy the claim that if you literally try to find some input $I$ and some dynamics $D$ such that $D (I)$ is the world trajectory, selecting only by Kolmogorov complexity and accuracy at predicting data, you probably won’t be able to use the resulting $D$ to run counterfactuals. Even ignoring the malign universal prior argument.
- If it turns out you can run counterfactuals with $D$ , I would strongly expect that to be because physics “actually” works by some simple $D$ that is “invariant” to the input state. In contrast, I would be astonished if humans “actually” have some reward $R$ in their head that they are trying to maximize, and that is what drives behavior.
I don’t feel much better about the speed prior than the regular Solomonoff prior.
- Daniel Kokotajlo 8 Oct 2019 1:06 UTC
  LW: 6 AF: 4
  AF Parent
  Thanks! I’m not sure I follow you. Here’s what I think you are saying:
  --Occam’s Razor will be sufficient for predicting human behavior of course; it just isn’t sufficient for finding the intended planner-reward pair. Because (A) the simplest way to predict human behavior has nothing to do with planners and rewards, and so (B) the simplest planner-reward pair will be degenerate or weird as A&M argue.
  --You agree that this argument also works for Laws+Initial Conditions; Occam’s Razor is generally insufficient, not just insufficient for inferring preferences of irrational agents!
  --You think the argument is more likely to work for inferring preferences than for Laws+Initial Conditions though.
  If this is what you are saying, then I agree with the second and third points but disagree with the first—or at least, I don’t see any argument for it in A&M’s paper. It may still be true, but further argument is needed. In particular their arguments for (A) are pretty weak, methinks—that’s what my section “Objections to the arguments for step 2” is about.
  Edit to clarify: By “I agree with the second point” I mean I agree that if the argument works at all, it probably works for Laws+Initial Conditions as well. I don’t think the argument works though. But I do think that Occam’s Razor is probably insufficient.
  - Rohin Shah 9 Oct 2019 7:11 UTC
    LW: 3 AF: 2
    AF Parent
    That’s an accurate summary of what I’m saying.
    at least, I don’t see any argument for it in A&M’s paper. It may still be true, but further argument is needed.
    If you are picking randomly out of a set of N possibilities, the chance that you pick the “correct” one is 1/N. It seems like in any decomposition (whether planner/reward or initial conditions/dynamics), there will be N decompositions, with N >> 1, where I’d say “yeah, that probably has similar complexity as the correct one”. The chance that the correct one is also the simplest one out of all of these seems basically like 1/N, which is ~0.
    You could make an argument that we aren’t actually choosing randomly, and correctness is basically identical to simplicity. I feel the pull of this argument in the limit of infinite data for laws of physics (but not for finite data), but it just seems flatly false for the reward/planner decomposition.
    - Daniel Kokotajlo 9 Oct 2019 22:45 UTC
      LW: 1 AF: 1
      AF Parent
      I feel like there’s a big difference between “similar complexity” and “the same complexity.” Like, if we have theory T and then we have theory T* which adds some simple unobtrusive twist to it, we get another theory which is of similar complexity… yet realistically an Occam’s-Razor-driven search process is not going to settle on T*, because you only get T* by modifying T. And if I’m wrong about this then it seems like Occam’s Razor is broken in general; in any domain there are going to be ways to turn T’s into T*’s. But Occam’s Razor is not broken in general (I feel).
      Maybe this is the argument you anticipate above with ”...we aren’t actually choosing randomly.” Occam’s Razor isn’t random. Again, I might agree with you that intuitively Occam’s Razor seems more useful in physics than in preference-learning. But intuitions are not arguments, and anyhow they aren’t arguments that appeared in the text of A&M’s paper.