Depends on your standards for “rigorous technical work” and “establishing.” In some sense nothing on this topic is sufficiently rigorous, and in some sense nothing on this topic has been established yet. I think the Risks from Learned Optimization paper might be what you are looking for. There’s also evhub’s recent talk. And of course, TurnTrouts post that was linked above. And again I just pull these out of the top of my head, the ideas in them have been floating around for a while.
I’d be interested to hear an argument that reward is the optimization target, if you’ve got one!
I suspect that this is an issue that will be cleared up by everyone being super careful and explicit and nitpicky about their definitions. (Because I think a big part of what’s going on here is that people aren’t doing that and so they are getting subtly confused and equivocating between importantly different statements, and then on top of that other people are misunderstanding their words)
Thanks! I don’t think those meet my criteria. I also suspect “everyone being super careful and explicit and nitpicky about their definitions” is lacking, and I’d consider that a basic and essential component of rigorous technical work.
I don’t think this framing of it being the optimization target or not is very helpful. It’s like asking “does SGD converge?” or “will my supervised learning model learn the true hypothesis?” The answer will depend on a number of factors, and it’s often not best thought of as a binary thing.
e.g. for agents that do planning based on optimizing a reward function, it seems appropriate to say that reward is the optimization target.
I’m not sure the framing is helpful either, but reading Turner’s linked appendix it does seem like various people are making some sort of mistake that can be summarized as “they seem to think the policy / trained network should be understood as trying to get reward, as preferring higher-reward outcomes, as targeting reward...” (And Turner says he himself was one of them despite doing a PhD in RL theory) Like I said above I think that probably there’s room for improvement here—if everyone defined their terms better this problem would clear up and go away. I see Turner’s post as movement in this direction but by no means the end of the journey.
Re your first argument: If I understand you correctly, you are saying that if your AI design involves something like monte-carlo tree search using a reward-estimator module (Idk what the technical term for that is) and the reward-estimator module is just trained to predict reward, then it’s fair to describe the system as optimizing for the goal of reward. Yep that seems right to me, modulo concerns about inner alignment failures in the reward-estimator module. I don’t see this as contradicting Alex Turner’s claims but maybe it does.
Re your second argument, the appeal to authority: I suppose in a vacuum, not having thought about it myself or heard any halfway decent arguments, I’d defer to the RL field on this matter. But I have thought about it a bit myself and I have heard some decent arguments, and that effect is stronger than the deference effect for me, and I think this is justified.
RE appeal to authority: I mostly mentioned it because you asked for an argument and I figured I would just provide any decent ones I thought of OTMH. But I have not provided anything close to my full thoughts on the matter, and probably won’t, due to bandwidth.
e.g. for agents that do planning based on optimizing a reward function, it seems appropriate to say that reward is the optimization target.
Often, when an RL agent imagines a possible future roll-out, it does not evaluate whether that possible future is good or bad by querying an external ground-truth reward function; instead, it queries a learned value function. When that’s the case, the thing that the agent is foresightedly “trying” / “planning” to do is to optimize the learned value function, not the reward function. Right?
For example, I believe AlphaZero can be described this way—it explores some number of possible future scenarios (I’m hazy on the details), and evaluates how good they are based on querying the learned value function, not querying the external ground-truth reward function, except in rare cases where the game is just about to end.
I claim that, if we make AGI via model-based RL (as I expect), it will almost definitely be like that too. If an AGI has a (nonverbal) idea along the lines of “What if I try to invent a new microscope using (still-somewhat-vague but innovative concept)”, I can’t imagine how on earth you would build an external ground-truth reward function that can be queried with that kind of abstract hypothetical. But I find it very easy to imagine how a learned value function could be queried with that kind of abstract hypothetical.
(You can say “OK fine but the learned value function will asymptotically approach the external ground-truth reward function”. However, that might or might not be true. It depends on the algorithm and environment. I expect AGIs to be in a nonstationary environment with vastly too large an action space to fully explore, and full of irreversible actions that make full exploration impossible anyway. In that case, we cannot assume that there’s no important difference between “trying” to maximize the learned value function versus “trying” to maximize the reward function.)
Sorry if I’m misunderstanding. (My own discussion of this topic, in the context of a specific model-based RL architecture, is Section 9.5 here.)
Depends on your standards for “rigorous technical work” and “establishing.” In some sense nothing on this topic is sufficiently rigorous, and in some sense nothing on this topic has been established yet. I think the Risks from Learned Optimization paper might be what you are looking for. There’s also evhub’s recent talk. And of course, TurnTrouts post that was linked above. And again I just pull these out of the top of my head, the ideas in them have been floating around for a while.
I’d be interested to hear an argument that reward is the optimization target, if you’ve got one!
I suspect that this is an issue that will be cleared up by everyone being super careful and explicit and nitpicky about their definitions. (Because I think a big part of what’s going on here is that people aren’t doing that and so they are getting subtly confused and equivocating between importantly different statements, and then on top of that other people are misunderstanding their words)
Thanks! I don’t think those meet my criteria. I also suspect “everyone being super careful and explicit and nitpicky about their definitions” is lacking, and I’d consider that a basic and essential component of rigorous technical work.
Agreed!
Got an argument that reward is the optimization target?
I don’t think this framing of it being the optimization target or not is very helpful. It’s like asking “does SGD converge?” or “will my supervised learning model learn the true hypothesis?” The answer will depend on a number of factors, and it’s often not best thought of as a binary thing.
e.g. for agents that do planning based on optimizing a reward function, it seems appropriate to say that reward is the optimization target.
Here’s another argument: maybe it’s the field of RL, and not Alex Turner, who is right about this: https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target#Appendix__The_field_of_RL_thinks_reward_optimization_target
(I’m not sure Alex characterizes the field’s beliefs correctly, and I’m sort of playing devil’s advocate with that one (not a big fan of “outside views”), but it’s a bit odd to act like the burden of proof is on someone who agrees with the relevant academic field).
Thanks!
I’m not sure the framing is helpful either, but reading Turner’s linked appendix it does seem like various people are making some sort of mistake that can be summarized as “they seem to think the policy / trained network should be understood as trying to get reward, as preferring higher-reward outcomes, as targeting reward...” (And Turner says he himself was one of them despite doing a PhD in RL theory) Like I said above I think that probably there’s room for improvement here—if everyone defined their terms better this problem would clear up and go away. I see Turner’s post as movement in this direction but by no means the end of the journey.
Re your first argument: If I understand you correctly, you are saying that if your AI design involves something like monte-carlo tree search using a reward-estimator module (Idk what the technical term for that is) and the reward-estimator module is just trained to predict reward, then it’s fair to describe the system as optimizing for the goal of reward. Yep that seems right to me, modulo concerns about inner alignment failures in the reward-estimator module. I don’t see this as contradicting Alex Turner’s claims but maybe it does.
Re your second argument, the appeal to authority: I suppose in a vacuum, not having thought about it myself or heard any halfway decent arguments, I’d defer to the RL field on this matter. But I have thought about it a bit myself and I have heard some decent arguments, and that effect is stronger than the deference effect for me, and I think this is justified.
RE appeal to authority: I mostly mentioned it because you asked for an argument and I figured I would just provide any decent ones I thought of OTMH. But I have not provided anything close to my full thoughts on the matter, and probably won’t, due to bandwidth.
Often, when an RL agent imagines a possible future roll-out, it does not evaluate whether that possible future is good or bad by querying an external ground-truth reward function; instead, it queries a learned value function. When that’s the case, the thing that the agent is foresightedly “trying” / “planning” to do is to optimize the learned value function, not the reward function. Right?
For example, I believe AlphaZero can be described this way—it explores some number of possible future scenarios (I’m hazy on the details), and evaluates how good they are based on querying the learned value function, not querying the external ground-truth reward function, except in rare cases where the game is just about to end.
I claim that, if we make AGI via model-based RL (as I expect), it will almost definitely be like that too. If an AGI has a (nonverbal) idea along the lines of “What if I try to invent a new microscope using (still-somewhat-vague but innovative concept)”, I can’t imagine how on earth you would build an external ground-truth reward function that can be queried with that kind of abstract hypothetical. But I find it very easy to imagine how a learned value function could be queried with that kind of abstract hypothetical.
(You can say “OK fine but the learned value function will asymptotically approach the external ground-truth reward function”. However, that might or might not be true. It depends on the algorithm and environment. I expect AGIs to be in a nonstationary environment with vastly too large an action space to fully explore, and full of irreversible actions that make full exploration impossible anyway. In that case, we cannot assume that there’s no important difference between “trying” to maximize the learned value function versus “trying” to maximize the reward function.)
Sorry if I’m misunderstanding. (My own discussion of this topic, in the context of a specific model-based RL architecture, is Section 9.5 here.)