David Scott Krueger (formerly: capybaralet) comments on (My understanding of) What Everyone in Technical Alignment is Doing and Why

David Scott Krueger (formerly: capybaralet) 1 Sep 2022 20:27 UTC
LW: 15 AF: 5
7
AF
Thanks! I don’t think those meet my criteria. I also suspect “everyone being super careful and explicit and nitpicky about their definitions” is lacking, and I’d consider that a basic and essential component of rigorous technical work.
- Daniel Kokotajlo 1 Sep 2022 22:32 UTC
  LW: 8 AF: 4
  −1
  AF Parent
  Agreed!
  
  Got an argument that reward is the optimization target?
  - David Scott Krueger (formerly: capybaralet) 2 Sep 2022 19:26 UTC
    LW: 9 AF: 2
    5
    AF Parent
    I don’t think this framing of it being the optimization target or not is very helpful. It’s like asking “does SGD converge?” or “will my supervised learning model learn the true hypothesis?” The answer will depend on a number of factors, and it’s often not best thought of as a binary thing.
    
    e.g. for agents that do planning based on optimizing a reward function, it seems appropriate to say that reward is the optimization target.
    
    Here’s another argument: maybe it’s the field of RL, and not Alex Turner, who is right about this: https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target#Appendix__The_field_of_RL_thinks_reward_optimization_target
    (I’m not sure Alex characterizes the field’s beliefs correctly, and I’m sort of playing devil’s advocate with that one (not a big fan of “outside views”), but it’s a bit odd to act like the burden of proof is on someone who agrees with the relevant academic field).
    - Daniel Kokotajlo 2 Sep 2022 20:37 UTC
      LW: 6 AF: 2
      −1
      AF Parent
      Thanks!
      
      I’m not sure the framing is helpful either, but reading Turner’s linked appendix it does seem like various people are making some sort of mistake that can be summarized as “they seem to think the policy / trained network should be understood as trying to get reward, as preferring higher-reward outcomes, as targeting reward...” (And Turner says he himself was one of them despite doing a PhD in RL theory) Like I said above I think that probably there’s room for improvement here—if everyone defined their terms better this problem would clear up and go away. I see Turner’s post as movement in this direction but by no means the end of the journey.
      
      Re your first argument: If I understand you correctly, you are saying that if your AI design involves something like monte-carlo tree search using a reward-estimator module (Idk what the technical term for that is) and the reward-estimator module is just trained to predict reward, then it’s fair to describe the system as optimizing for the goal of reward. Yep that seems right to me, modulo concerns about inner alignment failures in the reward-estimator module. I don’t see this as contradicting Alex Turner’s claims but maybe it does.
      
      Re your second argument, the appeal to authority: I suppose in a vacuum, not having thought about it myself or heard any halfway decent arguments, I’d defer to the RL field on this matter. But I have thought about it a bit myself and I have heard some decent arguments, and that effect is stronger than the deference effect for me, and I think this is justified.
      - David Scott Krueger (formerly: capybaralet) 3 Sep 2022 20:58 UTC
        LW: 4 AF: 3
        1
        AF Parent
        RE appeal to authority: I mostly mentioned it because you asked for an argument and I figured I would just provide any decent ones I thought of OTMH. But I have not provided anything close to my full thoughts on the matter, and probably won’t, due to bandwidth.
    - Steven Byrnes 5 Sep 2022 17:12 UTC
      LW: 5 AF: 2
      1
      AF Parent
      e.g. for agents that do planning based on optimizing a reward function, it seems appropriate to say that reward is the optimization target.
      Often, when an RL agent imagines a possible future roll-out, it does not evaluate whether that possible future is good or bad by querying an external ground-truth reward function; instead, it queries a learned value function. When that’s the case, the thing that the agent is foresightedly “trying” / “planning” to do is to optimize the learned value function, not the reward function. Right?
      For example, I believe AlphaZero can be described this way—it explores some number of possible future scenarios (I’m hazy on the details), and evaluates how good they are based on querying the learned value function, not querying the external ground-truth reward function, except in rare cases where the game is just about to end.
      I claim that, if we make AGI via model-based RL (as I expect), it will almost definitely be like that too. If an AGI has a (nonverbal) idea along the lines of “What if I try to invent a new microscope using (still-somewhat-vague but innovative concept)”, I can’t imagine how on earth you would build an external ground-truth reward function that can be queried with that kind of abstract hypothetical. But I find it very easy to imagine how a learned value function could be queried with that kind of abstract hypothetical.
      (You can say “OK fine but the learned value function will asymptotically approach the external ground-truth reward function”. However, that might or might not be true. It depends on the algorithm and environment. I expect AGIs to be in a nonstationary environment with vastly too large an action space to fully explore, and full of irreversible actions that make full exploration impossible anyway. In that case, we cannot assume that there’s no important difference between “trying” to maximize the learned value function versus “trying” to maximize the reward function.)
      Sorry if I’m misunderstanding. (My own discussion of this topic, in the context of a specific model-based RL architecture, is Section 9.5 here.)