johnswentworth comments on How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

johnswentworth 12 Aug 2022 18:44 UTC
3 points
1
It’s not a question of “making safety more likely” vs “guarantees”. Either we will basically figure out how to make an AGI which does not need a box, or we will probably die. At the point where there’s an unfriendly decently-capable AGI in a box, we’re probably already dead. The box maybe shifts our survival probability from epsilon to 2*epsilon (conditional on having an unfriendly decently-capable AGI running). It just doesn’t increase our survival probability by enough to be worth paying attention to, if that attention could otherwise be spent on something with any hope at all of increasing our survival probability by a nontrivial amount.
The main reason to bother boxing at all is that it takes relatively little marginal effort. If there’s nontrivial effort spent on it, then that effort would probably be more useful elsewhere.
- jacob_cannell 21 Sep 2022 4:17 UTC
  4 points
  4
  Parent
  
  Either we will basically figure out how to make an AGI which does not need a box, or we will probably die.
  
  AGI will likely be DL based, and like just about any complex engineered system, it will require testing. Only fools would unbox AGI without extensive alignment testing (yes, you can test for alignment, but only in sim boxes where the AGI is not aware of the sim).
  
  So boxing (simboxing) isn’t some optional extra safety feature, it is absolute core essential.
  
  At the point where there’s an unfriendly decently-capable AGI in a box, we’re probably already dead.
  
  Nah, human level AGI isn’t much of a risk—as long as it’s in a proper simbox. Most of the risk comes from our knowledge, ie the internet.
  - johnswentworth 21 Sep 2022 5:03 UTC
    4 points
    −2
    Parent
    I’m not saying one should forego the box. I’m saying the box does not shift our survival probability by very many bits. If our chances of survival are low without the box, they’re still low with it.
    Whether the boxed AI is capable enough to break out of the box isn’t even particularly relevant; the problem is relying on iterative design to achieve alignment in the first place.
    - jacob_cannell 21 Sep 2022 5:24 UTC
      4 points
      1
      Parent
      I actually agree with the preamble of that post:
      
      In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.
      
      By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails.
      
      So far, so good. Notice here you are essentially saying that iterative design is all important, and completely determines survival probability—the opposite of “does not shift our survival probability by very many bits”.
      
      So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.
      
      Or you could do the obvious thing and … focus on ensuring you can safely iterate.
      
      Fast takeoff is largely ruled out by physics, and moreover can be completely constrained in a simbox.
      
      Eventual deceptive alignment is the main failure mode, and that’s specifically what simboxing is for. (The AI can’t deceive us if it doesn’t believe we exist).
      
      But also, just to clarify, simboxing is essential because it enables iteration, but simboxing/iteration itself isn’t an alignment solution.
      - johnswentworth 21 Sep 2022 18:18 UTC
        6 points
        2
        Parent
        Fast takeoff is largely ruled out by physics, and moreover can be completely constrained in a simbox
        This comment prompted me to finally read the linked post, which was very enjoyable, well done. It seems to have little-to-nothing to do with fast takeoff, though; I don’t think most people who expect fast takeoff primarily expect it to occur via more efficient compute hardware or low-level filter circuits. Those with relatively conservative expectations expect fast takeoff via the ability to spin up copies, which makes coordination a lot easier and also saves needing each agent to independently study and learn existing information. Those with less conservative expectations also expect algorithmic improvements at the higher levels, more efficient information gathering/attention mechanisms, more efficient search/planning, more generalizable heuristics/features and more efficient ways to detect them, etc.
        jacob_cannell 21 Sep 2022 18:55 UTC
        6 points
        0
        Parent
        Thanks.
        
        Fast takeoff traditionally implies time from AGI to singularity measured in hours or days, which you just don’t get with merely mundane improvements like copying or mild algorithmic advances. EY (and perhaps Bostrom to some extent) anticipated fast takeoff explicitly enabled by many OOM brain inefficiency, such that the equivalent of many decades of Moore’s Law could be compressed into mere moments. The key rate limiter in these scenarios ends up being the ability to physically move raw materials through complex supply chains processes to produce more computing substrate, which is bypassed through the use of hard drexlerian nanotech.
        
        But it turns out that biology is already near optimal-ish (cells in particular already are essentially optimal nanoscale robots; thus drexlerian nanotech is probably a pipe dream), so that just isn’t the world we live in.
        johnswentworth 21 Sep 2022 19:24 UTC
        6 points
        2
        Parent
        Quoting Yudkowsky’s Intelligence Explosions Microeconomics, page 30:
        What sort of general beliefs does this concrete scenario of “hard takeoff ” imply about returns on cognitive reinvestment? It supposes that:
        An AI can get major gains rather than minor gains by doing better computer science than its human inventors.
        More generally, it’s being supposed that an AI can achieve large gains through better use of computing power it already has, or using only processing power it can rent or otherwise obtain on short timescales—in particular, without setting up new chip factories or doing anything else which would involve a long, unavoidable delay.
        An AI can continue reinvesting these gains until it has a huge cognitive problem-solving advantage over humans.
        … so Yudkowsky’s picture of hard takeoff explicitly does not route through inefficiency in the brain’s compute hardware, it routes through inefficiency in algorithms. He’s expecting the Drexlerian nanotech to come at the end of the hard takeoff; nanotech is not the main mechanism by which hard takeoff is enabled.
        The core idea of hard takeoff is that algorithmic advances can get to superintelligence without needing to build lots of new hardware. Your brain efficiency post doesn’t particularly argue against that.
        jacob_cannell 21 Sep 2022 19:53 UTC
        6 points
        2
        Parent
        That quote seemed to disagree so much with my model of early EY that I had to go back and re-read it. And I now genuinely think my earlier summary is still quite accurate.
        
        [pg 29]:
        
        Some sort of AI project run by a hedge fund, academia, Google,37 or a government, advances to a sufficiently developed level (see section 3.10) that it starts a string of selfimprovements that is sustained and does not level off. This cascade of self-improvements might start due to a basic breakthrough by the researchers which enables the AI to understand and redesign more of its own cognitive algorithms ..
        
        Once this AI started on a sustained path of intelligence explosion, there would follow some period of time while the AI was actively self-improving, and perhaps obtaining additional resources, but hadn’t yet reached a cognitive level worthy of being called “superintelligence.” This time period might be months or years,[^38] or days or seconds.[^39]
        
        At some point the AI would reach the point where it could solve the protein structure prediction problem and build nanotechnology—or figure out how to control atomic force microscopes to create new tool tips that could be used to build small nanostructures which could build more nanostructures—or perhaps follow some smarter and faster route to rapid infrastructure. An AI that goes past this point can be considered to have reached a threshold of great material capability. From this would probably follow cognitive superintelligence (if not already present); vast computing resources could be quickly accessed to further scale cognitive algorithms.
        
        Notice I said “Fast takeoff traditionally implies time from AGI to singularity measured in hours or days, which you just don’t get with merely mundane improvements like copying or mild algorithmic advances.”—Which doesn’t disagree with anything here, as I was talking about time from AGI to singularity, and regardless EY indicates rapid takeoff to superintelligence probably requires drexlerian nanotech.
        
        AGI → Superintelligence → Singularity
        
        Also EY clearly sees nanotech as the faster replacement for slow chip foundry cycles, as I summarized:
        
        . Given a choice of investments, a rational agency will choose the investment with the highest interest rate—the greatest multiplicative factor per unit time. In a context where gains can be repeatedly reinvested, an investment that returns 100-fold in one year is vastly inferior to an investment which returns 1.001-fold in one hour. At some point an AI’s internal code changes will hit a ceiling, but there’s a huge incentive to climb toward, e.g., the protein-structure-prediction threshold by improving code rather than by building chip factories
        
        Without drexlerian nanotech to smash through the code-efficiency ceiling the only alternative is the slower chip foundry route, which of course is also largely stalled if brains are efficient and already equivalent to end moore’s law tech.
        
        Regardless, in the brain efficiency post I also argue against many OOM brain software efficiency. (If anything, the brain’s incredible data efficiency is increasingly looking like a difficult barrier for AGI)
        johnswentworth 21 Sep 2022 21:05 UTC
        2 points
        0
        Parent
        I think there’s some inconsistent usage of “superintelligence” here. IIRC Yudkowsky also mentioned somewhere that he doesn’t expect humans to be able to build nanotech any time soon without AGI, therefore presumably he expects the AGI needs to be very superhuman to build nanotech. His fast takeoff scenario therefore involves the AGI reaching very superhuman levels before it starts to invest in manufacturing. But he’s using the term “superintelligence” for something quite a bit more powerful than just “very superhuman”.
        For strategic purposes, it’s the weaker version (i.e. “very superhuman”) which is mostly relevant.
        Regardless, in the brain efficiency post I also argue against many OOM brain software efficiency.
        You argued that current DL systems are mostly less data-efficient than the brain (and at best about the same). That is extremely weak evidence that nothing more data-efficient than the brain exists. And you didn’t argue at all about any other dimensions of reasoning algorithms—e.g. search efficiency, ability to transport information/models to new domains, model expressiveness, efficiency of plans, coordination/communication, metastuff, etc.
        jacob_cannell 21 Sep 2022 21:50 UTC
        4 points
        2
        Parent
        I think you are missing the forest of my argument for it’s trees. The default hypothesis—the one that requires evidence to update against—is now that the brain is efficient in most respects, rather than the converse.
        
        The larger update is that evolution is both fast and efficient. It didn’t proceed through some slow analog of moore’s law where some initial terribly inefficient designs are slowly improved. Biological evolution developed near-optimal nanotech quickly, and then slowly built up larger structure. It moved slowly only because it was never optimizing for intelligence at all, not because it is inherently slow and inefficient. But intelligence is often useful so eventually it developed near-optimal designs for various general learning machines—not in humans—but in much earlier brains.
        
        Human brains are simply standard primate brains, scaled up, with a few tweaks for language. The phase transition around human intelligence is entirely due to language adding another layer of systemic organization (like the multicellular transition); due to culture allowing us to learn from all past human experiences, so our (compressed) training dataset scales with our exponentially growing population vs being essentially constant as for animals.
        
        Deep learning is simply reverse engineering the brain (directly and indirectly), and this was always ever the only viable path to AGI ^[1]. Based on the large amount of evidence we have from DL and neuroscience it’s fairly clear (to me at least) that the the brain is also probably near optimal in data efficiency (in predictive gain per bit of sensor data per unit of compute—not to be confused with sample efficiency which you can always improve at the cost of more compute).
        
        Of course AGI will have advantages (mostly in expanding beyond the limitations of human lifetimes and associated brain sizes and slow interconnect); but overall it’s more like the beginning of a cambrian explosion that is a natural continuation of brain biological evolution, rather than some alien invasion.
        
        ↩︎
        At this point we have actually heavily explored the landscape of bayesian learning algorithms and huge surprises are unlikely.
        
        johnswentworth 21 Sep 2022 22:24 UTC
        2 points
        0
        Parent
        The default hypothesis—the one that requires evidence to update against—is now that the brain is efficient in most respects, rather than the converse.
        I think you have basically not made that case, certainly not to such a degree that people who previously believed the opposite will be convinced. You explored a few specific dimensions—like energy use, heat dissipation, circuit depth. But these are all things which we’d expect to have been under lots of evolutionary pressure for a long time. They’re also all relatively “low-level” things, in the sense that we wouldn’t expect to need unusually intricate genetic machinery to fine-tune them; we’d expect all those dimensions to be relatively accessible to evolutionary exploration.
        If you want to make the case of that brain efficiency is the default hypothesis, then you need to argue it in cases where the relevant capabilities weren’t obviously under lots of selection pressure for a long time (e.g. recently acquired capabilities like language), or where someone might expect architectural complexity to be too great for the genetic information bottleneck. You need to address at least some “hard” cases for brain efficiency, not just “easy” cases.
        Or, another angle: I’d expect that, by all of the efficiency measures in your brain efficiency post, a rat brain also looks near-optimal. Therefore, by anology to your argument, we should conclude that it is not possible for some new biological organism to undergo a “hard takeoff” (relative to evolutionary timescales) in intelligent reasoning capabilities. Where does that argument fail? What inefficiency in the rat brain did humanity improve on? If it was language, why do expect that the apparently-all-important language capability is near-optimal in humans? Also, why do we expect there won’t be some other all-important capability, just like language was a new super-important capability in the rat → human transition?
        Expand this thread
        jacob_cannell 21 Sep 2022 23:35 UTC
        6 points
        4
        Parent
        I already made much of the brain architecture/algorithms argument in an earlier post: “The Brain as a Universal Learning Machine”.
        
        In a nutshell EY/LW folks got much of their brain model from the heuristics and biases, ev psych literature which is based on the evolved modularity hypothesis, which turned out to be near completely wrong. So just by merely reading the sequences and associated lit LW folks have unfortunately picked up a fairly inaccurate default view of the brain.
        
        In a nutshell the brain is a very generic/universal learning system built mostly out of a few different complimentary types of neural computronium (cortex, cerebellum, etc) and an actual practical recursive self improvement learning system that rapidly learns efficient circuit architecture from lifetime experience. The general meta-architecture is not specific to humans, primates, or even mammals, and in fact is highly convergent and conserved—evolution found and preserved it again and again across wildly divergent lineages. So there isn’t so much room for improvement in architecture, most of the improvement comes solely from scaling.
        
        Nonetheless there are important differences across the lineages: primates along with some birds and perhaps some octopoda have the most scaling efficient archs in terms of neuron/synapse density, but these differences are most likely due to diverging optimization pressures along a pareto efficiency frontier.
        
        The difference in brain capabilities are then mostly just scaling differences: human brains are just 4x scaled up primate brains, having nearly zero detectable divergences from the core primate architecture (brain size is not a static feature of arch, the arch also defines a scaling plan, so you can think of size as being a tunable hyperparam with many downstream modifications to the wiring prior). Rodent brain arch has probably the worst scaling plan, probably they are optimized for speed and rarely grew large.
        What links here?
        Why do we post our AI safety plans on the Internet? by Peter S. Park (EA Forum; 31 Oct 2022 16:27 UTC; 15 points)
        Why do we post our AI safety plans on the Internet? by Peter S. Park (3 Nov 2022 16:02 UTC; 4 points)
        tailcalled 21 Sep 2022 20:33 UTC
        4 points
        0
        Parent
        I think Yudkowsky used to expect improvements in the low-level compute too, e.g. “Still, pumping the ions back out does not sound very adiabatic to me?”.
      - johnswentworth 21 Sep 2022 15:45 UTC
        2 points
        0
        Parent
        Did you actually read the rest of that post? Because the entire point was to talk about ways iterative design fails other than fast takeoff and the standard deceptive alignment story.
        Or you could do the obvious thing and … focus on ensuring you can safely iterate.
        The question is not whether one can iterate safely, the question is whether one can detect the problems (before it’s too late) by looking at the behavior of the system. If we can’t detect the problems just by seeing what the system does, then iteration alone will not fix the problems, no matter how safe it is to iterate. In such cases, the key thing is to expand the range of problems we can detect.
        jacob_cannell 21 Sep 2022 16:13 UTC
        4 points
        2
        Parent
        
        Did you actually read the rest of that post? Because the entire point was to talk about ways iterative design fails other than fast takeoff and the standard deceptive alignment story.
        
        I skimmed the rest, but it mostly seems to be about how particular alignment techniques (eg RLHF) may fail, or the difficulty/importance of measurement, which I probably don’t have much disagreement with. Also in general the evidence required to convince me of some core problem with iteration would be strictly enormous—as it is inherit to all evolutionary processes (biological or technological).
        
        If we can’t detect the problems just by seeing what the system does, then iteration alone will not fix the problems, no matter how safe it is to iterate. In such cases, the key thing is to expand the range of problems we can detect.
        
        Yes. Again (safe) iteration is necessary, but not sufficient. A wind tunnel isn’t a solution for areodynamic control; rather it’s a key enabling catalyst. You also need careful complete tests for alignment, various ways to measure it, etc.
      - Jay Bailey 21 Sep 2022 9:33 UTC
        1 point
        0
        Parent
        There’s a difference between “Iterative design” and “Our ability to impact iterative design.” I think John is saying in his post that iterative design is an important attribute of the problem (i.e, whether the AI alignment problem is amenable to iterative design) but in the comment above, he’s saying iterative design techniques aren’t super important, because if iterative design won’t work, they’re useless—and if iterative design will work, we’re probably okay without the box anyway, even though of course we should still use it.
        jacob_cannell 21 Sep 2022 15:37 UTC
        4 points
        2
        Parent
        Which is ridiculous because it is the simbox alone which allows iterative design.
        
        Replace “AI alignment” with flight and “box” with windtunnel:
        
        There’s a difference between “Iterative design” and “Our ability to impact iterative design.” I think John is saying in his post that iterative design is an important attribute of the problem (i.e, whether the flight control problem is amenable to iterative design) but in the comment above, he’s saying iterative design techniques aren’t super important, because if iterative design won’t work, they’re useless—and if iterative design will work, we’re probably okay without the wind tunnel anyway, even though of course we should still use it.
        
        Jay Bailey 22 Sep 2022 0:57 UTC
        3 points
        0
        Parent
        The wind tunnel is not a great analogy here since it fails to get at the main disagreement—if you test an airplane in a wind tunnel and it fails catastrophically, it doesn’t then escape the wind tunnel and crash in real life. Given that, it is safe to test flight methods in a wind tunnel and build on them iteratively. (Note: I’m not trying to be pedantic about analogies here—I believe that the wind tunnel argument fails to replicate the core disagreement between my understanding of you and my understanding of John)
        
        John says “Either we will basically figure out how to make an AGI which does not need a box, or we will probably die. At the point where there’s an unfriendly decently-capable AGI in a box, we’re probably already dead.” My understanding is that John is quite pessimistic about an AGI being containable by a simbox if it is otherwise misaligned. If this is correct, that makes the simbox relatively unimportant—the set of AGI’s that are safe to deploy into a simbox and unsafe to deploy into the real world are very small, and that’s why it doesn’t shift survival probability very much.
        
        It would still be dumb not to use it, because any tiny advantage is worth taking, but it’s not going to be a core part of the solution to alignment—we should not depend on a solution that plans on iterating in the simbox until we get it right. As opposed to a wind tunnel, where you can totally throw a plane in there and say “I’m pretty sure this design is going to fail, but I want to see how it fails” and this does not, in fact, cause the plane to escape into the real world and destroy the world.
        
        Now, you might think that a well-designed simbox would be very likely to keep a potentially misaligned AGI contained, and thus the AI alignment problem is probably amenable to iterative design. That would then narrow down the point of disagreement.
        jacob_cannell 22 Sep 2022 1:16 UTC
        4 points
        1
        Parent
        Yes, a well designed simbox can easily contain AGI, just as you or I aren’t about to escape our own simulation.
        
        Containment actually is trivial for AGI that are grown fully in the sim. It doesn’t even require realism: you can contain AGI just fine in a cartoon world, or even a purely text based world, as their sensor systems automatically learn the statistics of their specific reality and fully constrain their perceptions to that reality.
- Peter S. Park 13 Aug 2022 21:48 UTC
  4 points
  2
  Parent
  I strongly agree with John that “what we really want to do is to not build a thing which needs to be boxed in the first place.” This is indeed the ultimate security mindset.
  
  I also strongly agree that relying on a “fancy,” multifaceted box that looks secure due to its complexity, but may not be (especially to a superintelligent AGI), is not security mindset.
  
  One definition of security mindset is “suppose that anything that could go wrong, will go wrong.” So, even if we have reason to believe that we’ve achieved an aligned superintelligent AGI, we should have high-quality (not just high-quantity) security failsafes, just in case our knowledge does not generalize to the high-capabilities domain. The failsafes would help us efficiently and vigilantly test whether the AGI is indeed as aligned as we thought. This would be an example of a security mindset against overconfidence in our current assumptions.