cousin_it comments on The Duplicator: Instant Cloning Would Make the World Economy Explode

cousin_it 8 Sep 2021 17:43 UTC
4 points
To Amdahl’s law—I think simulating a brain won’t have any big serial bottlenecks. Split up by physical locality, each machine simulates a little cube of neurons and talks to machines simulating the six adjacent cubes. You can probably split one em into a million machines and get like a 500K times speedup or something. Heck, maybe even more than a million times, because each machine has better memory locality. If your intuition is different, can you explain?

To overclocking—it seems you’re saying parallelization depends on it somehow? I didn’t really understand this part.
- gwern 8 Sep 2021 23:11 UTC
  9 points
  Parent
  A brain has serial bottlenecks in the form of all the communication between neurons, in the same way you can’t simply shard GPT-3-173b onto 173 billion processors to make it run 173 billion times faster. Each compute element is going to be stuck waiting on communication with the adjacent neurons. At some point, you have 1 compute node per neuron or so (this is roughly the sort of hardware you’d expect ems to run on, brain-sized neuromorphic hardware, efficiently implementing something like spiking neurons), and almost all the time is spent idle waiting for inputs/outputs. At that point, you have saturated your available parallelism and Amdahl’s law rules. Then there’s no easy way to apply more parallelism: if you have some big chunks of brains which don’t need to communicate much and so can be parallelized for performance gains… Then you just have multiple brains.
  
  To overclocking—it seems you’re saying parallelization depends on it somehow? I didn’t really understand this part.
  
  Increasing clock speed has superlinear costs.
  - Vladimir_Nesov 9 Sep 2021 1:46 UTC
    7 points
    Parent
    
    At that point, you have saturated your available parallelism and Amdahl’s law rules. [...] Then you just have multiple brains.
    
    I think the point (or in any case my takeaway) is that this might be Giant Cheesecake Fallacy. Initially, there’s not enough hardware for running just a single em on the whole cluster to become wasteful, so this is what happens instead of running more ems slower, since serial work is more valuable. By the time you run into the limits of how much one em can be parallelized, the parallelized ems have long invented a process for making their brains bigger, making use of more nodes, preserving the regime of there only being a few ems who run on most of the hardware. This is more about personal identity of the ems than computing architecture, as a way of “making brains bigger” may well look like “multiple brains”, they are just brains of a single em, not multiple ems or multiple instances of an em.
  - cousin_it 8 Sep 2021 23:51 UTC
    4 points
    Parent
    My point is, the whole “age of em” might well come and go in the following regime: many neurons per processor, many processors per em, few ems per data center. In this regime, adding more processors to an em speeds up their subjective time almost linearly. You may ask, how can “few ems per data center” stay true? First of all, today’s data centers are like 100K processors, while one em has 100B neurons and way more synapses, so adding processors will make sense for quite awhile. Second of all, it won’t take that much subjective time for a handful of Von Neumann-smart ems to figure out how to scale themselves to more neurons per em, allowing “few, smarter ems per data center” to go on longer, which then leads smoothly to the post-em regime.
    
    Also your mentions of clock speed are still puzzling to me. My whole argument still works if there’s only ever one type of processors with one clock speed fixed in stone.
    - gwern 9 Sep 2021 14:07 UTC
      13 points
      Parent
      
      First of all, today’s data centers are like 100K processors, while one em has 100B neurons and way more synapses, so adding processors will make sense for quite awhile.
      
      Today’s data centers are completely incapable of running whole brains. We’re discussing extremely hypothetical hardware here, so what today’s data centers do is at best a loose analogy. The closest we have today is GPUs and neuromorphic hardware designed to implement neurons at the hardware level. GPUs already are a big pain to run efficiently in clusters because lack of parallelization means that communication between nodes is a major bottleneck, and communication within GPUs between layers is also a bottleneck. And neuromorphic hardware (or something like Cerebras) shows that you can create a lot of neurons at the hardware level; it’s not an area I follow in any particular detail, but for example, IBM’s Loihi chip implements 1,024 individual “spiking neural units” per core, 128 cores per chip, and they combine them in racks of 64 chips maxing out at 768 for a total of 100 million hardware neurons—so we are already far beyond any ’100k processors’ in terms of total compute elements. I suppose we could wind up having relatively few but very powerful serial compute elements for the first em, but given how strong the pressures have been to go as parallel as possible as soon as possible, I don’t see much reason to expect a ‘serial overhang’.
      - cousin_it 9 Sep 2021 15:31 UTC
        4 points
        Parent
        Okay, yeah, I had no idea that this much parallelism already existed. There could be still a reason for serial overhang (serial algorithms have more clever optimizations open to them, and neurons firing could be quite sparse at any given moment), but I’m no longer sure things will play out this way.
    - JBlack 9 Sep 2021 8:05 UTC
      1 point
      Parent
      You seem to be talking about a compute-dominated process, with almost perfect data locality. I suspect that brain emulation may be almost entirely communication-dominated with poor locality and (comparatively) very little compute. Most neurons in the brain have a great many synapses, and the graph of connections has relatively small diameter.
      So emulating any substantial part of a human brain may well need data from most of the brain every “tick”. Suppose emulating a brain in real time takes 10 units per second of compute, and 1 unit per second of data bandwidth (in convenient units where a compute node has 10 units per second of each). So a single node is bottlenecked on compute and can only run at real time.
      To achieve 2x speed you can run on two nodes to get the 20 units per second of compute capability, but your data bandwidth requirement is now 4 units/second: both the nodes need full access to the data, and they need to get it done in half the time. After 3x speed-up, there is no more benefit to adding nodes. They all hit their I/O capacity, and adding more will just slow them all down due to them all needing to access every node’s data every tick.
      This is even making the generous assumption that links between nodes have the same capacity and no more latency or coordination issues than a single node accessing its own local data.
      I’ve obviously just made up numbers to demonstrate scaling problems in an easy way here. The real numbers will depend upon things we still don’t know about brain architecture, and on future technology. The principle remains the same, though: different resource requirements scale in different ways, which yields a “most efficient” speed for given resource constraints, and it likely won’t be at all cost-effective to vary from that by an order of magnitude in either direction.
      - cousin_it 9 Sep 2021 9:35 UTC
        4 points
        Parent
        Yeah, maybe my intuition was pointing a different way: that the brain is a physical object, physics is local, and the particular physics governing the brain seems to be very local (signals travel at tens of meters per second). And signals from one part of the brain to another have to cross the intervening space. So if we divide the brain into thousands of little cubes, then each one only needs to be connected to its six neighbors, while having plenty of interesting stuff going inside—rewiring and so on.
        
        Edit: maybe another aspect of my intuition is that “tick” isn’t really a thing. Each little cube gets a constant stream of incoming activations, at time resolution much higher than typical firing time of one neuron, and generates a corresponding outgoing stream. Generating the outgoing stream requires simulating everything in the cube (at similar high time resolution), and doesn’t need any other information from the rest of the brain, except the incoming stream.
        JBlack 10 Sep 2021 8:10 UTC
        1 point
        Parent
        Thanks, making use of the relatively low propagation speed hadn’t occurred to me.
        That would indeed reduce the scaling of data bandwidth significantly. It would still exist, just be not quite as severe. Area versus volume scaling still means that bandwidth dominates compute as speeds increase (with volume emulated per node decreasing), just not quite as rapidly.
        I didn’t mean “tick” as a literal physical thing that happens in brains, just a term for whatever time scale governs the emulation updates.