Quintin Pope comments on DeepMind: Generally capable agents emerge from open-ended play

Quintin Pope 28 Jul 2021 2:29 UTC
11 points
Very impressive results! I’m particularly glad to see the agents incorporating text descriptions of their goals in the agents’ inputs. It’s a step forward in training agents that flexibly follow human instructions.
However, it currently looks like the agents are just using the text instructions as a source of information about how to acquire reward from their explicit reward functions, so this approach won’t produce corrigible agents. Hopefully, we can combine XLand with something like the cooperative inverse reinforcement learning paradigm.
E.g., we could add CIRL agent to the XLand environments whose objective is to assist the standard RL agents. Then we’d have:
- An RL agent
  - whose inputs are the text description of its goal and its RGB vision + other sensors
  - that gets direct reward signals
- A CIRL agent
  - whose inputs are the text description of the RL agent’s goals and the CIRL agent’s own RGB vision + other sensors
  - that has to infer the RL agent’s true reward from the RL agent’s behavior
Then, apply XLand open ended training where each RL agent has a variable number of CIRL agents assigned as assistants. Hopefully, we’ll get a CIRL agent that can receive instructions via text and watch the behavior of the agent it’s assisting to further refine its beliefs about its current objective.
- Past Account 29 Jul 2021 17:20 UTC
  4 points
  Parent
  [Deleted]
  - Quintin Pope 29 Jul 2021 22:01 UTC
    5 points
    Parent
    The summary says they use text and a search for “text” in the paper gives this on page 32:
    
    “In these past works, the goal usually consists of the position of the agent or a target observation to reach, however some previous work uses text goals (Colas et al., 2020) for the agent similarly to this work.”
    
    So I thought they provided goals as text. I’ll be disappointed if they don’t. Hopefully, future work will do so (and potentially use pretrained LMs to process the goal texts).
    - Past Account 30 Jul 2021 16:53 UTC
      1 point
      Parent
      [Deleted]
  - brp 9 Aug 2021 4:03 UTC
    3 points
    Parent
    What’s the practical difference between “text” and one-hots of said “text”? One-hots are the standard for inputting text into models. It is only recently that we expect models to learn their preferred encoding for raw text (cf. transformers). By taking a small shortcut, the authors of this paper get to show off their agent work without loss of generality: one could still give one-hot instructions to an agent that is learning to act in the real life.