Steven Byrnes comments on The Shard Theory Alignment Scheme

Steven Byrnes 26 Aug 2022 20:00 UTC
4 points
0
If DeepMind raises their AGI in a carefully sealed sandbox sim, where it doesn’t know that it’s in a sandbox or that humans exist, there’s still a problem that next year Meta or whatever will raise an AGI without a sandbox. Have you thought about what to do about that problem? Is there some significant task that DeepMind can do with its sandboxed AGI (which doesn’t know about humans and the real world) that prevents Meta from making the unsandboxed AGI afterwards, or that prevents this latter AGI from causing harm? If so, what? Or are you imagining that DeepMind eventually lets the AGI out of the sandbox? If so, how does the sandbox help? Sorry if you already answered this in a blog post a decade ago, I haven’t read those in a while. :)
- jacob_cannell 27 Aug 2022 0:18 UTC
  6 points
  2
  Parent
  Building something of significance requires design/develop/test/evaluate iterations. A sandbox sim is just what that looks like for safe AGI. Simboxing is not the design for aligned AGI itself, rather it’s the process/setup that allows teams to build/test/evaluate without risking the world.
  
  The tech infra for sandbox sims is similar to advance game tech anyway and can be shared. Numerous AGI teams can compete safely on alignment benchmarks in simboxes while closely guarding their design secrets/code, running their AGI securely, etc.
  
  Imagine an alternate earth like world that was so fully inhabited that the only two options to test nuclear weapons were 1.) blowing up inhabited cities or perhaps the entire world, or 2.) in big computer sims. Only one of these options is sane.
  
  So with that background in mind—if DM develops aligned AGI first, then hopefully Meta no longer matters, as DM’s capability gap will quickly expand.
  
  The hard part is not training AGI once you have the seed (as we know that only requires about 1e23 sparse flops and probably less, which is roughly equivalent to perhaps 1e26 dense flops—so only 3 OOM beyond GPT3). And a single training run gives you unlimited potential clones of the same AGI mind.
  
  The hard part is finding the right seed, which will take some number of experimental cycles training (hopefully small) AGI populations. Carmack’s recent estimate that the seed will be only a few ten thousand lines of code is a bit lower than my estimate of 100k lines of code, but not by much.
  
  To clarify—by ‘seed’ I mean all the arch design and model code. So take everything you need to easily reproduce GPT3/GATO whatever, including all the underlying cuda code (mostly written by nvidia) - but not the trained weights. Anyone who has that and a copy of the text internet on a few SSDs can then train their own GPT3 for only a few million $. AGI will be no different—most of the expense is on all the R&D buildup, not an individual final training run. (Compare the total R&D expenditure of DeepMind ~ $500M/yr, vs the few million for any single training run cost).
  - Steven Byrnes 30 Aug 2022 17:15 UTC
    4 points
    0
    Parent
    Gotcha.
    I’m very strongly in favor of pre-deployment sandbox tests.
    I don’t think that sandbox tests are sufficient for safety—I remain very concerned about a scenario where we run the “seed” code in a sandbox environment, and we wind up with an AGI that behaves well, but then we run the same “seed” code with access to the real world, and we wind up with an AGI that behaves dangerously.
    - jacob_cannell 6 Sep 2022 20:47 UTC
      4 points
      0
      Parent
      Simboxes are not sufficient for safety in the same sense that the Imagenet ‘simbox’ (data and evaluation code) was not sufficient for human level visual object classification or the atari simulator setup was not sufficient for human-level atari performance—you still need the design which actually does the right thing.
      
      The problem you specifically refer to is the OOD problem which in this case basically amounts to creating a range of sim scenarios that sufficiently cover the near future multiverse. In some sense this is actually easier than many hard science sims because intelligence and alignment are so universal they transcend the physics details (intelligence and alignment are universal across an enormous swath of possible universes even with very different physics). That is very very different than say nuclear weapons sims where nailing the physics precisely can matter a great deal.
      
      The seed codes for AGI built on simple universal principles such as self-supervised learning, learning to recognize agency in the world, maximization of other agent’s empowerment, etc.
      
      The tests then should cover a large variety of scenarios ranging the gamut from simple tests of empathy and altruism all the way up to larger scale complete simulations of future AGI takeover: worlds where numerous agents compete and cooperate to survive in a harsh world, eventually one/few gains a decisive strategic advantage (absolute power), and faces some ultimate altruism decision like whether to sacrifice themselves in exchange for the resurrection and absolute empowerment of all the other agents. For various reasons magic is the most likely useful safe proxy for technology.
      
      I have a half-written longer post on this, but if you are having trouble imagining these range of tests think of all of the output of the future narrow-AI empowered film/game industries and all of that creativity unleashed on this problem.
      - Steven Byrnes 13 Sep 2022 19:32 UTC
        4 points
        0
        Parent
        Do you put actual humans into the simbox? If no, then isn’t that a pretty big OOD problem? Or if yes, how do you do that safely?
        I think I’m skeptical that “learning to recognize agency in the world” and “maximization of other agents’ empowerment” actually exist in the form of “simple universal principles”. For example, when I see a simple animatronic robot, it gives me a visceral impression of agency, but it’s a false impression. Well, hmm, I guess that depends on your definitions. Well anyway, I’ll just say that if an AGI were maximizing the “empowerment” of any simple animatronic robots that it sees, I would declare that this AGI was doing the wrong thing.
        It’s fine if you want to just finish your longer post instead of replying here. Either way, looking forward to that! :)
        jacob_cannell 13 Sep 2022 20:13 UTC
        6 points
        0
        Parent
        Humans in the simbox—perhaps in the early stages, but not required once it’s running (although human observers have a later role). But that’s mostly tangential.
        
        One of the key ideas here—and perhaps divergent vs many other approaches—is that we want agents to robustly learn and optimize for other agents values: across a wide variety of agents, situations, and agent value distributions. The idea is to handle OOD by generalizing beyond specific human values. Then once we perfect these architectures and training regimes and are satisfied with their alignment evaluations we can deploy them safely in the real world where they will learn and optimize for our values (safe relative to deploying new humans).
        
        I do have a rough sketch of the essence of the mechanism I think the brain is using for value learning and altruism, and I actually found one of your articles to link to that is related.
        
        I suspect you’d agree that self-supervised prediction is a simple, powerful, and universal learning idea—strongly theoretically justified as in Solomonoff/Bayes and AIXI, etc, and clearly also a key brain mechanism. Generalized empowerment or self-improvement is similar—strongly theoretically justified, and also clearly a key brain mechanism. The former guides learning of the predictive world model, the latter guides learning of the action/planning system. Both are also optimal in a certain sense.
        
        Human’s tendency to anthropomorphize, empathize with, and act altruistically towards various animals and even hypothetical non-humans is best explained as a side effect of a very general (arguably overly general!) alignment mechanism.