I think there are some subtleties here regarding the distinction between RL as a type of reward signal, and RL as a specific algorithm. You can take the exact same reward signal and use it either to update all computations in the entire AI (with some slightly magical credit assignment scheme) as in this post, or you can use it to update a reward prediction model in a model-based RL agent that acts a lot more like a maximizer.
I’d also like to hear your opinion on the effect of information leakage. For example, if reward only correlates with getting to the goal state 99.5% of the time, but always correlates with the button, what do you expect to happen (for the sort of algorithm you talk about, but maybe with different possible levels of resources).
You can take the exact same reward signal and use it either to update all computations in the entire AI (with some slightly magical credit assignment scheme) as in this post,
Gradients are magical?
or you can use it to update a reward prediction model in a model-based RL agent that acts a lot more like a maximizer.
The arguments apply in this case as well.
if reward only correlates with getting to the goal state 99.5% of the time, but always correlates with the button, what do you expect to happen (for the sort of algorithm you talk about, but maybe with different possible levels of resources).
Yeah, what if half of the time, getting to the goal doesn’t give a reward? I think the arguments go through just fine, just training might be slower. Rewarding non-goal completions probably train other contextual computations / “values” into the agent. If reward is always given by hitting the button, I think it doesn’t affect the analysis, unless the agent is exploring into the button early in training, in which case it “values” hitting the button, or some correlate thereof (i.e. develops contextually activated cognition which reliably steers it into a world where the button has been pressed).
You can take the exact same reward signal and use it either to update all computations in the entire AI (with some slightly magical credit assignment scheme) as in this post,
Gradients are magical?
Gradients through the entire AI are a pretty bad way to do credit assignment. For a functioning AGI I suspect you’d have to do something better, but I don’t know what it is (hence “magic”).
if reward only correlates with getting to the goal state 99.5% of the time, but always correlates with the button, what do you expect to happen (for the sort of algorithm you talk about, but maybe with different possible levels of resources).
Yeah, what if half of the time, getting to the goal doesn’t give a reward? I think the arguments go through just fine, just training might be slower. Rewarding non-goal completions probably train other contextual computations / “values” into the agent. If reward is always given by hitting the button, I think it doesn’t affect the analysis, unless the agent is exploring into the button early in training, in which case it “values” hitting the button, or some correlate thereof (i.e. develops contextually activated cognition which reliably steers it into a world where the button has been pressed).
Hmm, it seems like there’s something we could bet on here, especially if you’re just imagining gradient descent.
Maybe we could imagine a fully observable gridworld where the agent does (or fails at) a simple task that’s close to its starting location, and then, after a while, in a different part of the grid an automated system toggles a pattern of buttons. The pattern of buttons at the end of the episode is what actually determines the reward, but the rule mapping button-pattern onto reward is a slightly nontrivial classification rule, so the agent isn’t supposed to catch on too quickly. Also, 99% of the time the button-pattern is chosen to match the task-completion reward, and 1% of the time it’s chosen to give random reward.
I would expect a full-gradient-descent RL agent to learn the task and then never learn to manipulate the buttons, with very high probability so long as randomly flipping the buttons has a high probability of giving very bad reward. If flipping the buttons at random is relatively neutral, I expect a sizeable fraction of gradient descent RL agents to learn to mess with the buttons rather than doing the task, and from there slowly learn to put the buttons into good states.
For a model-based RL agent (e.g. EfficientZero), I would expect a sizeable fraction to learn to manipulate the buttons, even if setting them wrong gives very bad reward, though that fraction might depend on how well-learned the easy task is, and how different the policies are for doing the task vs. going over to the buttons.
Then for an agent deliberately optimized for learning about the world and solving problems that might be hard for gradient descent (e.g. Agent 57), I would expect it to be much more successful about exploring the button-related policies, building a model of them, and learning to get that extra 1% reward by setting the buttons.
These all sound somewhat like predictions I would make? My intended point is that if the button is out of the agent’s easy reach, and the agent doesn’t explore into the button early in training, by the time it’s smart enough to model the effects of the distant reward button, the agent won’t want to go mash the button as fast as possible.
But Agent 57 (or its successor) would go mash the button once it figured out how to do it. Kinda like the salt-starved rats from that one Steve Byrnes post. Put another way, my claim is that the architectural tweaks that let you beat Montezuma’s Revenge with RL are very similar to the architectural tweaks that make your agent act like it really is motivated by reward, across a broader domain.
(Haven’t checked out Agent 57 in particular, but expect it to not have the “actually optimizes reward” property in the cases I argue against in the post.)
I think there are some subtleties here regarding the distinction between RL as a type of reward signal, and RL as a specific algorithm. You can take the exact same reward signal and use it either to update all computations in the entire AI (with some slightly magical credit assignment scheme) as in this post, or you can use it to update a reward prediction model in a model-based RL agent that acts a lot more like a maximizer.
I’d also like to hear your opinion on the effect of information leakage. For example, if reward only correlates with getting to the goal state 99.5% of the time, but always correlates with the button, what do you expect to happen (for the sort of algorithm you talk about, but maybe with different possible levels of resources).
Gradients are magical?
The arguments apply in this case as well.
Yeah, what if half of the time, getting to the goal doesn’t give a reward? I think the arguments go through just fine, just training might be slower. Rewarding non-goal completions probably train other contextual computations / “values” into the agent. If reward is always given by hitting the button, I think it doesn’t affect the analysis, unless the agent is exploring into the button early in training, in which case it “values” hitting the button, or some correlate thereof (i.e. develops contextually activated cognition which reliably steers it into a world where the button has been pressed).
Gradients through the entire AI are a pretty bad way to do credit assignment. For a functioning AGI I suspect you’d have to do something better, but I don’t know what it is (hence “magic”).
Hmm, it seems like there’s something we could bet on here, especially if you’re just imagining gradient descent.
Maybe we could imagine a fully observable gridworld where the agent does (or fails at) a simple task that’s close to its starting location, and then, after a while, in a different part of the grid an automated system toggles a pattern of buttons. The pattern of buttons at the end of the episode is what actually determines the reward, but the rule mapping button-pattern onto reward is a slightly nontrivial classification rule, so the agent isn’t supposed to catch on too quickly. Also, 99% of the time the button-pattern is chosen to match the task-completion reward, and 1% of the time it’s chosen to give random reward.
I would expect a full-gradient-descent RL agent to learn the task and then never learn to manipulate the buttons, with very high probability so long as randomly flipping the buttons has a high probability of giving very bad reward. If flipping the buttons at random is relatively neutral, I expect a sizeable fraction of gradient descent RL agents to learn to mess with the buttons rather than doing the task, and from there slowly learn to put the buttons into good states.
For a model-based RL agent (e.g. EfficientZero), I would expect a sizeable fraction to learn to manipulate the buttons, even if setting them wrong gives very bad reward, though that fraction might depend on how well-learned the easy task is, and how different the policies are for doing the task vs. going over to the buttons.
Then for an agent deliberately optimized for learning about the world and solving problems that might be hard for gradient descent (e.g. Agent 57), I would expect it to be much more successful about exploring the button-related policies, building a model of them, and learning to get that extra 1% reward by setting the buttons.
These all sound somewhat like predictions I would make? My intended point is that if the button is out of the agent’s easy reach, and the agent doesn’t explore into the button early in training, by the time it’s smart enough to model the effects of the distant reward button, the agent won’t want to go mash the button as fast as possible.
But Agent 57 (or its successor) would go mash the button once it figured out how to do it. Kinda like the salt-starved rats from that one Steve Byrnes post. Put another way, my claim is that the architectural tweaks that let you beat Montezuma’s Revenge with RL are very similar to the architectural tweaks that make your agent act like it really is motivated by reward, across a broader domain.
(Haven’t checked out Agent 57 in particular, but expect it to not have the “actually optimizes reward” property in the cases I argue against in the post.)