Speculation on why it doesn’t work: In practice, your model of the world only makes good predictions for states and actions that you have already experienced. So searching over actions for the best one either gives you something you have already experienced, or some nonsense action (sort of like an adversarial example for the world model).
Interesting, I wonder how humans avoid generating nonsense actions like this.
I don’t know how you would train an agent end-to-end such that it explicitly learns to search over actions
I was thinking you could train the world model separately at first, manually implement an initial action selection method as a neural network or some other kind of differentiable program, and then let RL act on the agent to optimize it as a whole.
(as opposed to an implicit search that model-free RL algorithms might already be doing)
What kind of implicit search are model-free RL algorithms already doing? If we just keep scaling up model-free RL, can they eventually become goal-directed agents through this kind of implicit search?
Interesting, I wonder how humans avoid generating nonsense actions like this.
Some hypotheses that are very speculative:
Something something explicit reasoning?
Our environment is sufficiently harsh and complex that everything is in-distribution
Our brains are so small and our environment is so harsh and complex that the only way that they can get good performance is to have structured, modular representations, which lead to worse performance in distribution but better generalization
Some system that lets us know what we know, and only generates actions for consideration where we know what the consequences will be
What kind of implicit search are model-free RL algorithms already doing?
I don’t know. This is mostly an expression of uncertainty about what model-free RL agents are doing. Maybe some of the multiplications and additions going on in there turn out to be equivalent to a search over actions. Maybe not.
My intuition says “nah, our current environments are all simple enough that you can solve them by using heuristics to compute actions, and the training process is going to distill those heuristics into the policy rather than turning the policy into a search algorithm”. But even if I trust that intuition, there is some level of environment complexity at which this would stop being true, and I don’t trust my intuition on what that level is.
If we just keep scaling up model-free RL, can they eventually become goal-directed agents through this kind of implicit search?
Plausibly, but plausibly not. I have conflicting not-well-formed intuitions that pull in both directions.
Interesting, I wonder how humans avoid generating nonsense actions like this.
I was thinking you could train the world model separately at first, manually implement an initial action selection method as a neural network or some other kind of differentiable program, and then let RL act on the agent to optimize it as a whole.
What kind of implicit search are model-free RL algorithms already doing? If we just keep scaling up model-free RL, can they eventually become goal-directed agents through this kind of implicit search?
Some hypotheses that are very speculative:
Something something explicit reasoning?
Our environment is sufficiently harsh and complex that everything is in-distribution
Our brains are so small and our environment is so harsh and complex that the only way that they can get good performance is to have structured, modular representations, which lead to worse performance in distribution but better generalization
Some system that lets us know what we know, and only generates actions for consideration where we know what the consequences will be
I don’t know. This is mostly an expression of uncertainty about what model-free RL agents are doing. Maybe some of the multiplications and additions going on in there turn out to be equivalent to a search over actions. Maybe not.
My intuition says “nah, our current environments are all simple enough that you can solve them by using heuristics to compute actions, and the training process is going to distill those heuristics into the policy rather than turning the policy into a search algorithm”. But even if I trust that intuition, there is some level of environment complexity at which this would stop being true, and I don’t trust my intuition on what that level is.
Plausibly, but plausibly not. I have conflicting not-well-formed intuitions that pull in both directions.