This seems like a very good research direction. I’m not too familiar with RL, so I probably won’t pursue it myself, though. I do have three suggestions:
For testing out of distribution transfer, one option is to move the agents to a different environmental simulator. I expect this will do a better job of representing the distributional shift incurred by deploying into the real world.
Consider feeding the agents their goals via natural language instruction, and accompany their RL training with BERT-like language modeling. Agents trained to follow natural language instructions seem vastly more useful (and slightly more aligned) than agents limited to receiving instructions via bitcodes, which is what DeepMind did. I expect future versions of XLand-like RL pretraining will do something like this.
Consider using the Perceiver IO architecture (https://arxiv.org/abs/2107.14795). It’s a new variant transformer with linear time complexity in its input length, is explicitly designed to smoothly handle inputs of arbitrary dimensions and mixed modalities, and can also produce outputs of arbitrary dimensions and mixed modalities. I think it will turn out to be far more flexible than current transformers/CNNs.
I agree that switching the simulator could be useful where feasible (you’d need another simulator with compatible state- and action-spaces and somewhat similar dynamics.)
It indeed seems pretty plausible that instructions will be given in natural language in the future. However, I am not sure that would affect scaling very much, so I’d focus scaling experiments on the simpler case without NLP for which learning has already been shown to work.
IIRC, transformers can be quite difficult to get to work in an RL setting. Perhaps this is different for PIO, but I cannot find any statements about this in the paper you link.
This seems like a very good research direction. I’m not too familiar with RL, so I probably won’t pursue it myself, though. I do have three suggestions:
For testing out of distribution transfer, one option is to move the agents to a different environmental simulator. I expect this will do a better job of representing the distributional shift incurred by deploying into the real world.
Consider feeding the agents their goals via natural language instruction, and accompany their RL training with BERT-like language modeling. Agents trained to follow natural language instructions seem vastly more useful (and slightly more aligned) than agents limited to receiving instructions via bitcodes, which is what DeepMind did. I expect future versions of XLand-like RL pretraining will do something like this.
Consider using the Perceiver IO architecture (https://arxiv.org/abs/2107.14795). It’s a new variant transformer with linear time complexity in its input length, is explicitly designed to smoothly handle inputs of arbitrary dimensions and mixed modalities, and can also produce outputs of arbitrary dimensions and mixed modalities. I think it will turn out to be far more flexible than current transformers/CNNs.
Thank you!
I agree that switching the simulator could be useful where feasible (you’d need another simulator with compatible state- and action-spaces and somewhat similar dynamics.)
It indeed seems pretty plausible that instructions will be given in natural language in the future. However, I am not sure that would affect scaling very much, so I’d focus scaling experiments on the simpler case without NLP for which learning has already been shown to work.
IIRC, transformers can be quite difficult to get to work in an RL setting. Perhaps this is different for PIO, but I cannot find any statements about this in the paper you link.