riceissa comments on Occam’s Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann

riceissa 7 Oct 2019 22:00 UTC
5 points
I’m not confident I’ve understood this post, but it seems to me that the difference between the values case and the empirical case is that in the values case, we want to do better than humans at achieving human values (this is the “ambitious” in “ambitious value learning”) whereas in the empirical case, we are fine with just predicting what the universe does (we aren’t trying to predict the universe even better than the universe itself). In the formalism, in π = P(R) we are after R (rather than π), but in E = L(C) we are after E (rather than L or C), so in the latter case it doesn’t matter if we get a degenerate pair (because it will still predict the future events well). Similarly, in the values case, if all we wanted was to imitate humans, then it seems like getting a degenerate pair would be fine (it would act just as human as the “intended” pair).

If we use Occam’s Razor alone to find law-condition pairs that fit all the world’s events, we’ll settle on one of the degenerate ones (or something else entirely) rather than a reasonable one. This could be very dangerous if we are e.g. building an AI to do science for us and answer counterfactual questions like “If we had posted the nuclear launch codes on the Internet, would any nukes have been launched?”

I don’t understand how this conclusion follows (unless it’s about the malign prior, which seems not relevant here). Could you give more details on why answering counterfactual questions like this would be dangerous?
What links here?
- riceissa's comment on Occam’s Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann by Daniel Kokotajlo (9 Oct 2019 21:14 UTC; 7 points)
- Daniel Kokotajlo 8 Oct 2019 13:59 UTC
  5 points
  Parent
  Thanks! OK, so I agree that normally in doing science we are fine with just predicting what will happen, there’s no need to decompose into Laws and Conditions. Whereas with value learning we are trying to do more than just predict behavior; we are trying to decompose into Planner and Reward so we can maximize Reward.
  However the science case can be made analogous in two ways. First, as Eigil says below, realistically we don’t have access to ALL behavior or ALL events, so we will have to accept that the predictor which predicted well so far might not predict well in the future. Thus if Occam’s Razor settles on weird degenerate predictors, it might also settle on one that predicts well up until time T but then predicts poorly after that.
  Second, (this is the way I went, with counterfactuals) science isn’t all about prediction. Part of science is about answering counterfactual questions like “what would have happened if...” And typically the way to answer these questions is by decomposing into Laws + Conditions and then doing a surgical intervention on the conditions and then applying the same Laws to the new conditions.
  So, for example, if we use Occam’s Razor to find Laws+Conditions for our universe, and somehow it settles on the degenerate pair “Conditions := null, Laws := sequence of events E happens” then all our counterfactual queries will give bogus answers—for example, “what would have happened if we had posted the nuclear launch codes on the Internet?” Answer: “Varying the Conditions but holding the Laws fixed… it looks like E would have happened. So yeah, posting launch codes on the Internet would have been fine, wouldn’t have changed anything.”
  - riceissa 9 Oct 2019 4:40 UTC
    1 point
    Parent
    Thanks for the explanation, I think I understand this better now.
    
    My response to your second point: I wasn’t sure how the sequence prediction approach to induction (like Solomonoff induction) deals with counterfactuals, so I looked it up, and it looks like we can convert the counterfactual question into a sequence prediction question by appending the counterfactual to all the data we have seen so far. So in the nuclear launch codes example, we would feed the sequence predictor with a video of the launch codes being posted to the internet, and then ask it to predict what sequence it expects to see next. (See the top of page 9 of this PDF and also example 5.2.2 in Li and Vitanyi for more details and further examples.) This doesn’t require a decomposition into laws and conditions; rather it seems to require that the events E be a function that can take in bits and print out more bits (or a probability distribution over bits). But this doesn’t seem like a problem, since in the values case the policy π is also a function. (Maybe my real point is that I don’t understand why you are assuming E has to be a sequence of events?) [ETA: actually, maybe E can be just a sequence of events, but if we’re talking about complexity, there would be some program that generates E, so I am suggesting we use that program instead of L and C for counterfactual reasoning.]
    
    My response to your first point: I am far from an expert here, but my guess is that an Occam’s Razor advocate would bite the bullet and say this is fine, since either (1) the degenerate predictors will have high complexity so will be dominated by simpler predictors, or (2) we are just as likely to be living in a “degenerate” world as we are to be living in the kind of “predictable” world that we think we are living in.
  - TAG 8 Oct 2019 14:41 UTC
    1 point
    Parent
    
    Thanks! OK, so I agree that normally in doing science we are fine with just predicting what will happen, there’s no need to decompose into Laws and Conditions.
    
    Where we can predict, we do so by feeding a set of conditions into laws.
    
    Second, (this is the way I went, with counterfactuals) science isn’t all about prediction. Part of science is about answering counterfactual questions like “what would have happened if...” And typically the way to answer these questions is by decomposing into Laws + Conditions and then doing a surgical intervention on the conditions and then applying the same Laws to the new conditions.
    
    Methodologically, counterfactuals and predictions are almost the same thing. In the case of a prediction , you feed an actual condition into your laws, in the case of a counterfactual, you feed in a non-actual. one.
- Eigil Rischel 7 Oct 2019 22:52 UTC
  2 points
  Parent
  A simple remark: we don’t have access to all of $E$ , only $E$ up until the current time. So we have to make sure that we don’t get a degenerate pair which diverges wildly from the actual universe at some point in the future.
  
  Maybe this is similar to the fact that we don’t want AIs to diverge from human values once we go off-distribution? But you’re definitely right that there’s a difference: we do want AIs to diverge from human behaviour (even in common situations).