FeepingCreature comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

FeepingCreature 3 Nov 2021 11:36 UTC
4 points
What would stump a (naive) exploration-based AI? One may imagine a game as such: the player starts on the left side of a featureless room. If they go to the right side of the room, they win. In the middle of the room is a terminal. If one interacts with the terminal, one is kicked into an embedded copy of the original Doom.

An exploration-based agent would probably discern that Doom is way more interesting than the featureless room, whereas a human would probably put it aside at some point to “finish” exploring the starter room first. I think this demands a sort of mixed breadth-depth exploration?
- Razied 3 Nov 2021 13:08 UTC
  14 points
  Parent
  The famous problem here is the “noisy TV problem”. If your AI is driven to go towards regions of uncertainty then it will be completely captivated by a TV on the wall showing random images, no need for a copy of Doom, any random giberish that the AI can’t predict will work.
  - maximkazhenkov 4 Nov 2021 3:27 UTC
    3 points
    Parent
    OpenAI claims to have already solved the noisy TV problem via Random Network Distillation, although I’m still skeptical of it. I think it’s a clever hack that only solves a specific subclass of this problem that is relatively superficial.
  - FeepingCreature 3 Nov 2021 16:09 UTC
    2 points
    Parent
    Well, one may develop an AI that handles noisy TV by learning that it can’t predict the noisy TV. The idea was to give it a space that is filled with novelty reward, but doesn’t lead to a performance payoff.
- gwern 3 Nov 2021 14:13 UTC
  8 points
  Parent
  Even defining what is a ‘featureless room’ in full generality is difficult. After all, the literal pixel array will be different at most timesteps (and even if ALE games are discrete enough for that to not be true, there are plenty of environments with continuous state variables that never repeat exactly). That describes the opening room of Montezuma’s Revenge: you have to go in a long loop around the room, timing a jump over a monster that will kill you, before you get near the key which will give you the first reward after hundreds of timesteps. Go-Explore can solve MR and doesn’t suffer from the noisy TV problem because it does in fact do basically breadth+depth exploration (iterative widening), but it also relies on a human-written hack for deciding what states/nodes are novel or different from each other and potentially worth using as a starting point for exploration.
- maximkazhenkov 4 Nov 2021 3:42 UTC
  3 points
  Parent
  You could certainly engineer an adversarial learning environment to stump an exploration-based AI, but you could just as well engineer an adversarial learning environment to stump a human. Neither is “naive” because of it in any useful sense, unless you can show that that adversarial environment has some actual practical relevance.
  - FeepingCreature 5 Nov 2021 12:17 UTC
    2 points
    Parent
    That’s true, but … I feel in most cases, it’s a good idea to run mixed strategies. I think that by naivety I mean the notion that any single strategy will handle all cases—even if there are strategies where this is true, it’s wrong for almost all of them.
    
    Humans can be stumped, but we’re fairly good at dynamic strategy selection, which tends to protect us from being reliably exploited.
    - maximkazhenkov 6 Nov 2021 9:38 UTC
      17 points
      Parent
      Humans can be stumped, but we’re fairly good at dynamic strategy selection, which tends to protect us from being reliably exploited.
      Have you ever played Far Cry 4? At the beginning of that game, there is a scene where you’re being told by the main villain of the storyline to sit still while he goes downstairs to deal with some rebels. A normal human player would do the expected thing, which is to curiously explore what’s going on downstairs, which then leads to the unfolding of the main story and thus actual gameplay. But if you actually stick to the villain’s instruction and sit still for 12 minutes, it leads straight to the ending of the game.
      This is an analogous situation to your scenario, except it’s one where humans reliably fail. Now you could argue that a human player’s goal is to actually play and enjoy the game, therefore it’s perfectly reasonable to explore and forego a quick ending. But I bet even if you incentivized a novice player to finish the game in under 2 hours with a million dollars, he would not think of exploiting this Easter egg.
      More importantly, he would have learned absolutely nothing from this experience about how to act rationally (except for maybe stop believing that anyone would genuinely offer a million dollars out of the blue). The point is, it’s not just possible to rig the game against an agent for it to fail, it’s trivially easy when you have complete control of the environment. But it’s also irrelevant, because that’s not how reality works in general. And I do mean reality, not some fictional story or adversarial setup where things happen because the author says they happen.

FeepingCreature comments on EfficientZero: human ALE sample-efficiency w/​MuZero+self-supervised

FeepingCreature comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised