TurnTrout comments on Reward is not the optimization target

TurnTrout 7 Aug 2022 16:37 UTC
LW: 3 AF: 3
0
AF
which we published three years ago and started writing four years ago, is extremely explicit that we don’t know how to get an agent that is actually optimizing for a specified reward function.
That isn’t the main point I had in mind. See my comment to Chris here.
EDIT:
This is precisely the point I make in “How do we become confident in the safety of a machine learning system is making,” btw.
Yup, the training story regime sounds good by my lights. Am I intended to conclude something further from this remark of yours, though?
- evhub 9 Aug 2022 18:48 UTC
  LW: 4 AF: 4
  3
  AF Parent
  
  That isn’t the main point I had in mind. See my comment to Chris here.
  
  Left a comment.
  
  Yup, the training story regime sounds good by my lights. Am I intended to conclude something further from this remark of yours, though?
  
  Nope, just wanted to draw your attention to another instance of alignment researchers already understanding this point.
  
  Also, I want to be clear that I like this post a lot and I’m glad you wrote it—I think it’s good to explain this sort of thing more, especially in different ways that are likely to click for different people. I just think your specific claim that most alignment researchers don’t understand this already is false.
  - TurnTrout 14 Aug 2022 20:30 UTC
    LW: 7 AF: 5
    3
    AF Parent
    I just think your specific claim that most alignment researchers don’t understand this already is false.
    I have privately corresponded with a senior researcher who, when asked what they thought would result from a specific training scenario, made an explicit (and acknowledged) mistake along the lines of this post. Another respected researcher seemingly slipped on the same point, some time after already discussing this post with them. I am still not sure whether I’m on the same page with Paul, as well (I have general trouble understanding what he believes, though). And Rohin also has this experience of explaining the points in OP on a regular basis. All this among many other private communication events I’ve experienced.
    (Out of everyone I would expect to already have understood this post, I think you and Rohin would be at the top of the list.)
    So basically, the above screens off “Who said what in past posts?”, because whoever said whatever, it’s still producing my weekly experiences of explaining the points in this post. I still haven’t seen the antecedent-computation-reinforcement (ACR) emphasis thoroughly explained elsewhere, although I agree that some important bits (like training stories) are not novel to this post. (The point isn’t so much “What do I get credit for?” as much as “I am concerned about this situation.”)
    Here’s more speculation. I think alignment theorists mostly reason via selection-level arguments. While they might answer correctly on “Reward is? optimization target” when pressed, and implicitly use ACR to reason about what’s going on in their ML training runs, I’d guess that probably don’t engage in mechanistic ACR reasoning in their day-to-day theorizing. (Again, I can only speculate, because I am not a mind-reader, but I do still have beliefs on the matter.)
    What links here?
    TurnTrout's comment on Reward is not the optimization target by TurnTrout (15 Aug 2022 3:30 UTC; 4 points)
    - Rohin Shah 15 Aug 2022 9:27 UTC
      LW: 21 AF: 13
      10
      AF Parent
      (Just wanted to echo that I agree with TurnTrout that I find myself explaining the point that reward may not be the optimization target a lot, and I think I disagree somewhat with Ajeya’s recent post for similar reasons. I don’t think that the people I’m explaining it to literally don’t understand the point at all; I think it mostly hasn’t propagated into some parts of their other reasoning about alignment. I’m less on board with the “it’s incorrect to call reward a base objective” point but I think it’s pretty plausible that once I actually understand what TurnTrout is saying there I’ll agree with it.)