jacob_cannell answers Misc. questions about EfficientZero

jacob_cannell 4 Dec 2021 22:26 UTC
10 points
1. a/b Does efficient zero reward hack? Not necessarily. If by ‘reward hacking’, you mean some action in the environment that directly stimulates the true reward signal (equivalent to perhaps sex/food/fame/power/etc in humans), then potentially yes—it would consider plans (up to the limits of it’s admittedly limited planning horizon) and execute those that maximize reward. Which in this case is indistinguishable from reward ‘hacking’. But if by hacking you mean some impaction of physics into it’s mind that causes value prediction errors inside the planning rollout evaluation (more like say dopamine drugs in humans), then it’s policy is likely more human like. It will avoid reward hacking traps to the extent it has a good model of how they are in fact hacks that don’t lead to true reward, but could fall into a reward hacking trap otherwise.
2. EfficientZero approximates evidential decision theory, right? Hmm not sure yet.
3. EfficientZero is a consequentialist? (I haven’t read the linked article yet) Consequentialist vs deontological is a poor philosopher’s map of the temporal continuum of the value function (reward prediction). Model-free is basically constrained to near-sighted reward prediction and is thus ‘deontological’. This is model-based, so more ‘consequentalist’ - but again it’s continuous.
4. What is the most complex environment . . trained on? Publically? Atari. This is a brand new paper. Internally within DeepMind? I’m betting robotics and more complex games. With more tweaks and scaling up this could probably do very well in starcraft, but DM has already burned that publicity stunt with a brittle model-free false champion.
5. Roughly how many parameters does EfficientZero have? They don’t list it directly in the paper but i’m guessing between 1M to 10M based on glancing over the arch details in the appendix, it’s pretty tiny.
“Generally, compared with MuZero [31], we reduce the number of residual blocks and the number of planes as we find that there is no capability issue caused by much smaller networks in our EfficientZero with limited data. In another word, such a tiny network can acquire good performance in the limited setting.”
6. If we kept scaling up EfficientZero by OOMs in every way, what would happen? Would it eventually get to agenty AGI?
This model is in a radically different region of mindspace than animals or most DL agents. It’s definitely already consequentalist-agenty, as it’s full model-free planning. But it has just tiny bitty insect size brain that’s just big enough to learn one Atari game, but then it does super expensive insane superhuman quantities of MCTS planning rollouts with that tiny predictive world model. So as a first guess, I’d model it’s planning expense multilpier as 4 GPUs / atari_cost or > 10^8. Simulating a good approx model of the real world would probably then require a model that’s about 10^8 larger (current GPU flops / atari flops) - assuming current GPUs have achieved photorealism, which they have. Also on top of that it’s planning is short horizon inefficient compared to humans (I have ideas how to fix that, not going to post here obviously just in case). This is a demonstrator, there is more work to get on a better scaling curve. Humans don’t evaluate huge branching planning spaces explicitly MCTS style, it doesn’t scale well with increasing world complexity.
- Pattern 4 Dec 2021 22:32 UTC
  2 points
  Parent
  What’s
  2.
  ?
  - jacob_cannell 4 Dec 2021 23:08 UTC
    3 points
    Parent
    I didn’t answer question 2 yet because I’m not very familiar with evidential decision theory and didn’t have time to read up on it yet.
    - Richard_Kennaway 5 Dec 2021 11:44 UTC
      3 points
      Parent
      EDT seems to mean something different every time someone writes an article to prop it up against the argument, made from its infancy, that it recommends trying to change the world by managing the news you receive of it.
      
      When the decider neither acts on the world nor is acted on by it (save for having somehow acquired knowledge about the world), there is only maximisation of expected utility, and no distinction between causal, evidential, or any other decision theory. When the decider is embedded in the world, this is called “naturalized decision theory”, but no-one has a mature example of one.
      
      In the special case where the decider can accurately read the world and act on it, but the world has neither read nor write access to the decision-making process, and no agency in what the decider can know, CDT is correct. In the special case where the decider does not decide at all, but is a passive process that can do nothing but observe its own actions, then EDT is correct. CDT two-boxes on Newcomb and smokes in the Smoking Lesion problem. EDT (as originally formulated) one-boxes and abstains. On LessWrong, I believe the general consensus, or at least, Eliezer’s belief, is to one-box and smoke, ruling out both theories.
      
      But there is at least one author (see Eells (1981, 1982) discussed here) arguing that EDT (or his formulation of it) two-boxes on Newcomb.
      
      Recalling a Zen koan, a CDT agent is not subject to causation and an EDT agent is subject to causation, but a naturalized decision theory must be one with causation.
- Thomas Larsen 21 Jul 2022 19:28 UTC
  1 point
  Parent
  Minor note: I think you meant that it does model-based planning—this is what the graph search means. Also see the paper:
  ″ We propose a sample efficient model-based visual RL algorithm built on MuZero, which we name EfficientZero.”