Rohin Shah comments on [AN #155]: A Minecraft benchmark for algorithms that learn without reward functions

Rohin Shah 9 Jul 2021 21:25 UTC
2 points
On the basis that it’s in Minecraft, if things were broken down into sub-tasks, some automated evaluation could be reasonable.
I agree this is possible in Minecraft. The point I was trying to make is that we were trying to make environments where you can’t have automated evaluation, because as soon as you have automated evaluation, you can make an automated reward function—and the whole point was to create tasks where you can’t make an automated reward function.
(Technically, since we ban participants from using internal state, we could have tried to create tasks with automated evaluation based on internal state. But when we thought through this we didn’t think of tasks we liked as much.)
TrueSkill is patented
Huh, I hadn’t noticed that, will have to look into it (though we are sponsored by Microsoft).
The conjunction of these sounds hilarious—if it’s not the same evaluators, then is it an object measurement/score?
It is not replicable, i.e. you cannot run the same evaluation process and get the same number out. It should be reproducible, in that if you rerun the evaluation process you should get similar results. (The paper has some discussion of this.)
Getting AI to the point where it is judged via subjective, arbitrary standards is maybe more life like.
Yeah, I agree with this.
This might be changeable if you’re willing to take Pong and change it.
I agree things like this could be done, but usually I’m not very excited about the result, because it will still feel unrealistic to me on some axis that I think matters. (Though I expect you could easily improve over Atari / MuJoCo, which really are not good benchmarks for LfHF approaches.)
It’s not immediately clear what is meant by this statement.
From the blog post that’s being summarized:
A sketch of such an approach would be:
1. Create a dataset of YouTube videos paired with their automatically generated captions, and train a model that predicts the next video frame from previous video frames and captions.
2. Train a policy that takes actions which lead to observations predicted by the generative model (effectively learning to imitate human behavior, conditioned on previous video frames and the caption).
3. Design a “caption prompt” for each BASALT task that induces the policy to solve that task.
There’s also the ’glitch to get huge reward issue. Or is that an issue?
That would be an issue if LfHF approaches tended to discover the glitch. However, LfHF approaches usually don’t do this, since humans don’t give feedback that pushes the agent towards the glitch.
Because they’ll change over time? People will say ‘that’s better than not dying, it’s doing fine, it’s just getting started’. And then later ‘this is boring. Move. Do something!’
We’re just talking about evaluation here (i.e. human judgments of the final trained agent’s performance). If you ask a human to judge whether the robot is moving quickly to the right, and then they see the “standing still” policy, they are going to assign that policy a very low score.
You could also try amateur time.
Yeah, that would also be interesting. We didn’t mention it because that’s more commonly done in the existing literature. (Assuming that by “amateur time” you mean “feedback from humans who are amateurs at the game”.)
As compared to the fact that frequent submitting an in progress comment requires hitting save, then scrolling up, then clicking edit, then scrolling down...every. single. time.
Protip: when you find yourself doing this, consider opening a duplicate tab—one for reading and one for writing the comment.
I’m not sure what demos means in that sentence, but if it means benchmarks...
It meant demonstrations of the task, sorry (I probably should have used the full word “demonstrations”). So, for example, this could be a video of a human creating a waterfall (along with a record of what keys they pressed to do it).
- Pattern 9 Jul 2021 22:11 UTC
  2 points
  Parent
  and the whole point was to create tasks where you can’t make an automated reward function.
  I was gesturing towards partially automatable—the whole can’t. (Specifically find water—unless that can be crafted now, or mods are in play. (Find ice might also work though.)) Handcrafted: move, don’t hold still.
  Protip: when you find yourself doing this, consider opening a duplicate tab—one for reading and one for writing the comment.
  I was refering to the ‘editing a comment process.’ Though thanks for the tip, I do use that a lot.
  Pong
  Yeah. I mostly found it interesting to think about because it seems like a simpler environment (and might be easier to train), but the results would be a lot less interesting. And at the limit of modification, Edited Pong becomes like Minecraft 2.
  Arguably Minecraft is the sort of game that embedding a mini game (say via a primitive red stone controller going to some sort of arcade machine mockup), could kind of work within. (Agents might be uninterested in such a machine if it doesn’t affect its environment at all—and simple reward distribution setups could just be found by disassembling it.)