OK… although I notice that everybody in the initial post is just assuming you could run the uploads without providing any arguments.
Human brains have probably more than 1000 times as many synapses as current LLMs have weights. All the values describing the synapse behavior have to be resident in some kind of memory with a whole lot of bandwidth to the processing elements. LLMs already don’t fit on single GPUs.
Unlike transformers, brains don’t pass nice compact contexts from layer to layer, so splitting them across multiple GPU-like devices is going to slow you way down because you have to shuttle such big vectors between them… assuming you can even vectorize most of it at all given the timing and whatnot, and don’t have to resort to much discrete message passing.
It’s not even clear that you can reduce a biological synapse to a single weight; in fact you probably can’t. For one thing, brains don’t run “inference” in the way that artificial neural networks do. They run forward “inference-like” things, and at the same time do continuous learning based on feedback systems that I don’t think are well understood… but definitely are not back propagation. It’s not plausible that a lot of relatively short term tasks aren’t dependent on that, so you’re probably going to have to do something more like continuously running training than like continuously running inference.
There are definitely also things going on in there that depend on the relative timings of cascades of firing through different paths. There are also chemicals sloshing around that affect the ensemble behavior of whole regions on the scale of seconds to minutes. I don’t know about in brains, but I do know that there exist biological synapses that aren’t just on or off, either.
You can try to do dedicated hardware, and colocate the “weights” with the computation, but then you run into the problem that biological synapses aren’t uniform. Brains actually do have built-in hardware architectures, and I don’t believe those can be replicated efficiently with arrays of uniform elements of any kind… at least not unless you make the elements big enough and programmable enough that your on-die density goes to pot. If you use any hardwired heterogeneity and you get it wrong, you have to spin the hardware design, which is Not Cheap (TM). You also lose density because you have to replicate relatively large computing elements instead of only replicating relatively dense memory elements. You do get a very nice speed boost on-die, but I at a wild guess I’d say that’s probably a wash with the increased need for off-die communication because of the low density.
If you want to keep your upload sane, or be able to communicate with it, you’re also going to have to give it some kind of illusion of a body and some kind of illusion of a comprehensible and stimulating environment. That means simulating an unknown but probably large amount of non-brain biology (which isn’t necessarily identical between individuals), plus a not-inconsiderable amount of outside-world physics.
So take a GPT-4 level LLM as a baseline. Assume you want to speed up your upload to be able to fast-talk about as fast as the LLM can now, so that’s a wash. Now multiply by 1000 for the raw synapse count, by say 2 for the synapse diversity, by 5? for the continuous learning, by 2 for the extra synapse complexity, and by conservatively 10 for the hardware bandwidth bottlenecks. Add another 50 percent for the body, environment, etc.
So running your upload needs 300,000 times the processing power you need to run GPT-4. Which I suspect is usually run on quad A100s (at maybe $100,000 per “inference machine”).
You can’t just spend 30 billion dollars and shove 1,200,000 A100s into a chassis; the power, cooling, and interconnect won’t scale (nor is there fab capacity to make them). If you packed them into a sphere at say 500 per cubic meter (which allows essentially zero space for cooling or interconnects, both of which get big fast), the sphere would be about 16 meters across and dissipate 300MW (with a speed of light delay from one side to the other of 50ns).
Improved chips help, but won’t save you. Moore’s law in “area form” is dead and continues to get deader. If you somehow restarted Moore’s law in its original, long-since-diverged-from form, and shrank at 1.5x in area per year for the next 17 years, you’d have transistors ten times smaller than atoms (and your power density would be, I don’t know, 100 time as high, leading to melted chips). And once you go off-die, you’re still using macrosopic wires or fibers for interconnect. Those aren’t shrinking… and I’m not sure the dies can get a lot bigger.
Switching to a completely different architecture the way I mentioned above might get back 10X or so, but doesn’t help with anything else as long as you’re building your system out of a fundamentally planar array of transistors. So you still have a 240 cubic meter, 30MW, order-of-3-billon-dollar machine, and if you get the topology wrong on the first try you get to throw it away and replace it. For one upload. That’s not very competitive with just putting 10 or even 100 people in an office.
Basically, to be able to use a bunch of uploads, you need to throw away all current computing technology and replace it with some kind of much more element-dense, much more interconnect-dense, and much less power-dense computing substrate. Something more brain-like, with a 3D structure. People have been trying to do that for decades and haven’t gotten anywhere; I don’t think it’s going to be manufactured in bulk by 2040.
… or you can try to trim the uploads themselves down by factors that end with multiple zeroes, without damaging them into uselessness. That strikes me as harder than doing the scanning… and it also strikes me as something you can’t make much progress on until you have mostly finished solving the scanning problem.
It’s not that you can’t get some kind of intelligence in realistic hardware. You might even be able to get something much smarter than a human. But you’re specifically asking to run a human upload, and that doesn’t look feasible.
“Human brains have probably more than 1000 times as many synapses as current LLMs have weights.” → Can you elaborate? I thought the ratio was more like 100-200. (180-320T ÷ 1.7T)
“If you want to keep your upload sane, or be able to communicate with it, you’re also going to have to give it some kind of illusion of a body and some kind of illusion of a comprehensible and stimulating environment.” Seems like an overstatement. Humans can get injuries where they can’t move around or feel almost any of their body, and they sure aren’t happy about it, but they are neither insane nor unable to communicate.
My much bigger disagreement is a kind of philosophical one. In my mind, I’m thinking of the reverse-engineering route, so I think:
A = [the particular kinds of mathematical operations that support the learning algorithms (and other algorithms) that power human intelligence]
B = [the particular kinds of affordances enabled by biological neurons, glial cells, etc.]
C = [the particular kinds of affordances enabled by CPUs and GPUs]
You’re thinking of a A→B→C path, whereas I’m thinking that evolution did the A→B path and separately we would do the A→C path.
I think there’s a massive efficiency hit from the fact that GPUs and CPUs are a poor fit to many useful mathematical operations. But I also think there’s a massive efficiency hit from the fact that biological neurons are a poor fit to many useful mathematical operations.
So whereas you’re imagining brain neurons doing the basic useful operations, instead I’m imagining that the brain has lots of little machines involving collections of maybe dozens of neurons and hundreds of synapses assembled into a jerry-rigged contraption that does a single basic useful mathematical operation in an incredibly inefficient way, just because that particular operation happens to be the kind of thing that an individual biological neuron can’t do.
On the main point, I don’t think you can make those optimizations safely unless you really understand a huge amount of detail about what’s going on. Just being able to scan brains doesn’t give you any understanding, but at the same time it’s probably a prerequisite to getting a complete understanding. So you have to do the two relatively serially.
You might need help from superhuman AGI to even figure it out, and you might even have to be superhuman AGI to understand the result. Even if you don’t, it’s going to take you a long time, and the tests you’ll need to do if you want to “optimize stuff out” aren’t exactly risk free.
Basically, the more you deviate from just emulating the synapses you’ve found[1], and the more simplifications you let yourself make, the less it’s like an upload and the more it’s like a biology-inspired nonhuman AGI.
Also, I’m not so sure I see a reason to believe that those multicellular gadgets actually exist, except in the same way that you can find little motifs and subsystems that emerge, and even repeat, in plain old neural networks. If there are a vast number of them and they’re hard-coded, then you have to ask where. Your whole genome is only what, 4GB? Most of it used for other stuff. And it seems as though it’s a lot easier to from a developmental point of view to code for minor variations on “build this 1000 gross functional areas, and within them more or less just have every cell send out dendrites all over the place and learn which connections work”, than for “put a this machine here and a that machine there within this functional area”.
“Human brains have probably more than 1000 times as many synapses as current LLMs have weights.” → Can you elaborate? I thought the ratio was more like 100-200. (180-320T ÷ 1.7T)
I’m sorry; I was just plain off by a factor of 10 because apparently I can’t do even approximate division right.
Humans can get injuries where they can’t move around or feel almost any of their body, and they sure aren’t happy about it, but they are neither insane nor unable to communicate.
A fair point up, with a few limitations. Not a lot of people are completely locked in with no high bandwidth sensory experience, and I don’t think anybody’s quite sure what’s going on with the people who are. Vision and/or hearing are already going to be pretty hard to provide. But maybe not as hard as I’m making them out to be, if you’re willing to trace the connections all the way back to the sensory cells. Maybe you do just have to do the head. I am not gonna volunteer, though.
In the end, I’m still not buying that uploads have enough of a chance of being practical to run in a pre-FOOM timeframe to be worth spending time on, as well as being pretty pessimistic about anything produced by any number of uploaded-or-not “alignment researchers” actually having much of a real impact on outcomes anyway. And I’m still very worried about a bunch of issues about ethics and values of all concerned.
… and all of that’s assuming you could get the enormous resources to even try any of it.
By the way, I would have responded to these sooner, but apparently my algorithm for detecting them has bugs...
It’s possible that you (jacobjacob) and jbash are actually in agreement that (part of) the brain does something that is not literally backprop but “relevantly similar” to backprop—but you’re emphasizing the “relevantly similar” part and jbash is emphasizing the “not literally” part?
I think that’s likely correct. What I mean is that it’s not running all the way to the end of a network, computing a loss function at the end of a well defined inference cycle, computing a bunch of derivatives, etc… and also not doing anything like any of that mid-cycle. If you’re willing to accept a large class of feedback systems as “essentially back propagation”, then it depends on what’s in your class. And I surely don’t know what it’s actually doing.
OK… although I notice that everybody in the initial post is just assuming you could run the uploads without providing any arguments.
Human brains have probably more than 1000 times as many synapses as current LLMs have weights. All the values describing the synapse behavior have to be resident in some kind of memory with a whole lot of bandwidth to the processing elements. LLMs already don’t fit on single GPUs.
Unlike transformers, brains don’t pass nice compact contexts from layer to layer, so splitting them across multiple GPU-like devices is going to slow you way down because you have to shuttle such big vectors between them… assuming you can even vectorize most of it at all given the timing and whatnot, and don’t have to resort to much discrete message passing.
It’s not even clear that you can reduce a biological synapse to a single weight; in fact you probably can’t. For one thing, brains don’t run “inference” in the way that artificial neural networks do. They run forward “inference-like” things, and at the same time do continuous learning based on feedback systems that I don’t think are well understood… but definitely are not back propagation. It’s not plausible that a lot of relatively short term tasks aren’t dependent on that, so you’re probably going to have to do something more like continuously running training than like continuously running inference.
There are definitely also things going on in there that depend on the relative timings of cascades of firing through different paths. There are also chemicals sloshing around that affect the ensemble behavior of whole regions on the scale of seconds to minutes. I don’t know about in brains, but I do know that there exist biological synapses that aren’t just on or off, either.
You can try to do dedicated hardware, and colocate the “weights” with the computation, but then you run into the problem that biological synapses aren’t uniform. Brains actually do have built-in hardware architectures, and I don’t believe those can be replicated efficiently with arrays of uniform elements of any kind… at least not unless you make the elements big enough and programmable enough that your on-die density goes to pot. If you use any hardwired heterogeneity and you get it wrong, you have to spin the hardware design, which is Not Cheap (TM). You also lose density because you have to replicate relatively large computing elements instead of only replicating relatively dense memory elements. You do get a very nice speed boost on-die, but I at a wild guess I’d say that’s probably a wash with the increased need for off-die communication because of the low density.
If you want to keep your upload sane, or be able to communicate with it, you’re also going to have to give it some kind of illusion of a body and some kind of illusion of a comprehensible and stimulating environment. That means simulating an unknown but probably large amount of non-brain biology (which isn’t necessarily identical between individuals), plus a not-inconsiderable amount of outside-world physics.
So take a GPT-4 level LLM as a baseline. Assume you want to speed up your upload to be able to fast-talk about as fast as the LLM can now, so that’s a wash. Now multiply by 1000 for the raw synapse count, by say 2 for the synapse diversity, by 5? for the continuous learning, by 2 for the extra synapse complexity, and by conservatively 10 for the hardware bandwidth bottlenecks. Add another 50 percent for the body, environment, etc.
So running your upload needs 300,000 times the processing power you need to run GPT-4. Which I suspect is usually run on quad A100s (at maybe $100,000 per “inference machine”).
You can’t just spend 30 billion dollars and shove 1,200,000 A100s into a chassis; the power, cooling, and interconnect won’t scale (nor is there fab capacity to make them). If you packed them into a sphere at say 500 per cubic meter (which allows essentially zero space for cooling or interconnects, both of which get big fast), the sphere would be about 16 meters across and dissipate 300MW (with a speed of light delay from one side to the other of 50ns).
Improved chips help, but won’t save you. Moore’s law in “area form” is dead and continues to get deader. If you somehow restarted Moore’s law in its original, long-since-diverged-from form, and shrank at 1.5x in area per year for the next 17 years, you’d have transistors ten times smaller than atoms (and your power density would be, I don’t know, 100 time as high, leading to melted chips). And once you go off-die, you’re still using macrosopic wires or fibers for interconnect. Those aren’t shrinking… and I’m not sure the dies can get a lot bigger.
Switching to a completely different architecture the way I mentioned above might get back 10X or so, but doesn’t help with anything else as long as you’re building your system out of a fundamentally planar array of transistors. So you still have a 240 cubic meter, 30MW, order-of-3-billon-dollar machine, and if you get the topology wrong on the first try you get to throw it away and replace it. For one upload. That’s not very competitive with just putting 10 or even 100 people in an office.
Basically, to be able to use a bunch of uploads, you need to throw away all current computing technology and replace it with some kind of much more element-dense, much more interconnect-dense, and much less power-dense computing substrate. Something more brain-like, with a 3D structure. People have been trying to do that for decades and haven’t gotten anywhere; I don’t think it’s going to be manufactured in bulk by 2040.
… or you can try to trim the uploads themselves down by factors that end with multiple zeroes, without damaging them into uselessness. That strikes me as harder than doing the scanning… and it also strikes me as something you can’t make much progress on until you have mostly finished solving the scanning problem.
It’s not that you can’t get some kind of intelligence in realistic hardware. You might even be able to get something much smarter than a human. But you’re specifically asking to run a human upload, and that doesn’t look feasible.
Little nitpicks:
“Human brains have probably more than 1000 times as many synapses as current LLMs have weights.” → Can you elaborate? I thought the ratio was more like 100-200. (180-320T ÷ 1.7T)
“If you want to keep your upload sane, or be able to communicate with it, you’re also going to have to give it some kind of illusion of a body and some kind of illusion of a comprehensible and stimulating environment.” Seems like an overstatement. Humans can get injuries where they can’t move around or feel almost any of their body, and they sure aren’t happy about it, but they are neither insane nor unable to communicate.
My much bigger disagreement is a kind of philosophical one. In my mind, I’m thinking of the reverse-engineering route, so I think:
A = [the particular kinds of mathematical operations that support the learning algorithms (and other algorithms) that power human intelligence]
B = [the particular kinds of affordances enabled by biological neurons, glial cells, etc.]
C = [the particular kinds of affordances enabled by CPUs and GPUs]
You’re thinking of a A→B→C path, whereas I’m thinking that evolution did the A→B path and separately we would do the A→C path.
I think there’s a massive efficiency hit from the fact that GPUs and CPUs are a poor fit to many useful mathematical operations. But I also think there’s a massive efficiency hit from the fact that biological neurons are a poor fit to many useful mathematical operations.
So whereas you’re imagining brain neurons doing the basic useful operations, instead I’m imagining that the brain has lots of little machines involving collections of maybe dozens of neurons and hundreds of synapses assembled into a jerry-rigged contraption that does a single basic useful mathematical operation in an incredibly inefficient way, just because that particular operation happens to be the kind of thing that an individual biological neuron can’t do.
On the main point, I don’t think you can make those optimizations safely unless you really understand a huge amount of detail about what’s going on. Just being able to scan brains doesn’t give you any understanding, but at the same time it’s probably a prerequisite to getting a complete understanding. So you have to do the two relatively serially.
You might need help from superhuman AGI to even figure it out, and you might even have to be superhuman AGI to understand the result. Even if you don’t, it’s going to take you a long time, and the tests you’ll need to do if you want to “optimize stuff out” aren’t exactly risk free.
Basically, the more you deviate from just emulating the synapses you’ve found[1], and the more simplifications you let yourself make, the less it’s like an upload and the more it’s like a biology-inspired nonhuman AGI.
Also, I’m not so sure I see a reason to believe that those multicellular gadgets actually exist, except in the same way that you can find little motifs and subsystems that emerge, and even repeat, in plain old neural networks. If there are a vast number of them and they’re hard-coded, then you have to ask where. Your whole genome is only what, 4GB? Most of it used for other stuff. And it seems as though it’s a lot easier to from a developmental point of view to code for minor variations on “build this 1000 gross functional areas, and within them more or less just have every cell send out dendrites all over the place and learn which connections work”, than for “put a this machine here and a that machine there within this functional area”.
I’m sorry; I was just plain off by a factor of 10 because apparently I can’t do even approximate division right.
A fair point up, with a few limitations. Not a lot of people are completely locked in with no high bandwidth sensory experience, and I don’t think anybody’s quite sure what’s going on with the people who are. Vision and/or hearing are already going to be pretty hard to provide. But maybe not as hard as I’m making them out to be, if you’re willing to trace the connections all the way back to the sensory cells. Maybe you do just have to do the head. I am not gonna volunteer, though.
In the end, I’m still not buying that uploads have enough of a chance of being practical to run in a pre-FOOM timeframe to be worth spending time on, as well as being pretty pessimistic about anything produced by any number of uploaded-or-not “alignment researchers” actually having much of a real impact on outcomes anyway. And I’m still very worried about a bunch of issues about ethics and values of all concerned.
… and all of that’s assuming you could get the enormous resources to even try any of it.
By the way, I would have responded to these sooner, but apparently my algorithm for detecting them has bugs...
… which may already be really hard to do correctly…
I have an important appointment this weekend that will take up most of my time, but hope to come back to this after that, but wanted to quickly note:
Why?
Last time I looked into this 6 years ago seemed like an open question and it could plausibly be backprop or at least close enough: https://www.lesswrong.com/posts/QWyYcjrXASQuRHqC5/brains-and-backprop-a-key-timeline-crux
3yrs ago Daniel Kokotajlo shared some further updates in that direction: https://www.lesswrong.com/posts/QWyYcjrXASQuRHqC5/brains-and-backprop-a-key-timeline-crux?commentId=RvZAPmy6KStmzidPF
It’s possible that you (jacobjacob) and jbash are actually in agreement that (part of) the brain does something that is not literally backprop but “relevantly similar” to backprop—but you’re emphasizing the “relevantly similar” part and jbash is emphasizing the “not literally” part?
I think that’s likely correct. What I mean is that it’s not running all the way to the end of a network, computing a loss function at the end of a well defined inference cycle, computing a bunch of derivatives, etc… and also not doing anything like any of that mid-cycle. If you’re willing to accept a large class of feedback systems as “essentially back propagation”, then it depends on what’s in your class. And I surely don’t know what it’s actually doing.