These all sound somewhat like predictions I would make? My intended point is that if the button is out of the agent’s easy reach, and the agent doesn’t explore into the button early in training, by the time it’s smart enough to model the effects of the distant reward button, the agent won’t want to go mash the button as fast as possible.
But Agent 57 (or its successor) would go mash the button once it figured out how to do it. Kinda like the salt-starved rats from that one Steve Byrnes post. Put another way, my claim is that the architectural tweaks that let you beat Montezuma’s Revenge with RL are very similar to the architectural tweaks that make your agent act like it really is motivated by reward, across a broader domain.
(Haven’t checked out Agent 57 in particular, but expect it to not have the “actually optimizes reward” property in the cases I argue against in the post.)
These all sound somewhat like predictions I would make? My intended point is that if the button is out of the agent’s easy reach, and the agent doesn’t explore into the button early in training, by the time it’s smart enough to model the effects of the distant reward button, the agent won’t want to go mash the button as fast as possible.
But Agent 57 (or its successor) would go mash the button once it figured out how to do it. Kinda like the salt-starved rats from that one Steve Byrnes post. Put another way, my claim is that the architectural tweaks that let you beat Montezuma’s Revenge with RL are very similar to the architectural tweaks that make your agent act like it really is motivated by reward, across a broader domain.
(Haven’t checked out Agent 57 in particular, but expect it to not have the “actually optimizes reward” property in the cases I argue against in the post.)