That transition only happens once—there are not ever more and more levels of universality or linguistic programmability.
Why do you think this? (I’m unconvinced by “universal learning machine” type things that I’ve seen, not because I disagree, but because they don’t seem to address transitions within the shape of what stuff is learned and how it interacts.)
There are NNs that train for a lifetime then die, and there are NNs that train for a lifetime but then network together to share all their knowledge before dying. There are not ever more levels to that.
There are turing universal computational systems (which are all equivalent in universal ability to simulate other systems), and there are non-universal computational systems. There are not more levels to that.
There are NNs that train for a lifetime then die, and there are NNs that train for a lifetime but then network together to share all their knowledge before dying.
But crucially, humans do not share all their knowledge. Every time a great scientist or engineer or manager or artist dies, a ton of intuition and skills and illegible knowledge dies with them. What is passed on is only what can be easily compressed into the extremely lossy channels of language.
As the saying goes, “humans are as stupid as they can be while still undergoing intelligence-driven takeoff at all”; otherwise humans would have taken over the world sooner. That applies to knowledge sharing in particular—our language channels are just barely good enough to take off.
Even just the ability to copy a mind would push AIs far further along the same direction. Ability to merge minds would go far further still.
Edit: Of course humans do not share all their knowledge, and the cultural transition is obviously graded in the sense that the evolutionary stages of early language, writing, printing press, computers, internet etc gradually improve the externalized network connectivity and storage of our cybernetic civilization. But by the time of AGI that transition is already very well along, such that all we are really losing—as you point out and I agree—is a ton of intuitions/skills/knowledge etc that dies with the decay of human brains, but we externalize much of the most important of our knowledge. Nonetheless ending that tragedy is our great common cause.
I agree that substrate independence is one of the great advantages of digital minds, other than speed.
But there are some fundamental tradeoffs:
You can use GPUs (von neumman) which separate compute and logic. They are much much slower in the sense that they take many many cycles to simulate one cycle of a large ANN. They waste much energy having to shuffle the weights around the chip from memory to logic.
Or you can use neuromorphic computers, which combine memory and logic. They are potentially enormously faster as they can simulate one cycle of a large ANN per clock cycle, but constrained to more brain like designs and thus optimized for low circuit depth but larger circuits (cheap circuitry). For the greatest cheap circuit density, energy efficiency, and speed you need to use analog synapses but in doing so you basically give up the ability to easily transfer the knowledge out of the system—it becomes more ‘mortal’ as hinton recently argues.
For the greatest cheap circuit density, energy efficiency, and speed you need to use analog synapses but in doing so you basically give up the ability to easily transfer the knowledge out of the system—it becomes more ‘mortal’ as hinton recently argues.
This seems like a small tradeoff, and this does not seem like a big enough deal to restore these to anything like human mortality, with all its enormous global effects. It may be much harder to copy weights off a idiosyncratic mess of analogue circuits modified in-place by their training to maximize energy efficiency than it is to run cp foo.pkl bar.pkl, absolutely, but the increase in difficulty here seems more on par with ‘a small sub-field with a few hundred grad students/engineers for a few years’ than ‘the creation of AGI’, and so one can assume it’d be solved almost immediately should it ever actually become a problem.
For example, even if it’s ultra-miniaturized, you can tap connections to optionally read off activations between many pairs of layers, which will affect only a small part of it and not eliminate the miniaturization or energy savings—and with the layer embeddings summarizing a group of layers, now you can do knowledge distillation to another such neuromorphic computer (or smaller). Knowledge distillation, or self-distillation rather, will cost little and works well. Or, since you can presumably set the analogue values even if you can’t read them, and have a model worth copying, you can pay the one-time cost to distill it out to a more von-Neumann computer, one where you can more easily read the weights out, and thence copy it onto all of the other neuromorphics henceforth. Or, you can reverse-engineer the weights themselves: probe the original and the copy with synthetic data flipping a bit at a time to run finite-differences on outputs like activations/embeddings, starting at the lowest available tap, to eventually reconstruct the equivalent weights group by group. (This may require lots of probes, but these systems by definition run extremely fast and since you’re only probing a small part of it at a time, run even faster than that.) Just off the cuff, and I’m sure you could think of several better approaches if you tried. So I don’t expect ‘mortal’ NNs to be all that different from our current ‘immortal’ NNs or things like FPGAs.
Largely agreed, which is partly why I said only more ‘mortal’ with ‘mortal’ in scare quotes. Or put another way, the full neuromorphic analog route still isn’t as problematic to copy weights out of vs an actual brain, and I expect actual uploading to be possible eventually so … it’s mostly a matter of copy speeds and expenses as you point out, and for the most hardcore analog neuromorphic designs like brains you still can exploit sophisticated distillation techniques as you discuss. But it does look like there are tradeoffs that increase copy out cost as you move to the most advanced neuromorphic designs.
This whole thing is just thought experiment, correct? “what we would have to do to mimic the brain’s energy efficiency”. Because analog synapses where we left off a network of analog gates to connect any given synapse to an ADC (something that current prototype analog inference accelerators use, and analog FPGAs do exist) are kinda awful.
The reason is because of https://openai.com/research/emergent-tool-use . What they found in this paper was that you want to make your Bayesian updates to your agent’s policy in large batches. Meaning you need to be able to copy the policy many times across a fleet of hardware that runs in separate agents, and learn the expected value and errors of the given policy across a larger batch of episodes. The copying requires precise reading of the values, so they need to be binary, and there is no benefit from modifying the policy rapidly in real time.
The reason why we have brains that learn rapidly in real time, overfitting to a small number of strong examples, is because this was all that was possible with the hardware nature could evolve. It is suboptimal.
I think Jake is right that we shouldn’t imagine an unlimited set of levels of learning. I however do think that there are one or two more levels beyond self learning, and cultural transmission.
The next level ( which could maybe be described as two levels) is not something that evolution has managed in any mammalian species:
take an existing brain which has filled most of its learning capacity and is beginning to plateau in skill-gain-from-experience and add significantly more capacity.
Make significant architectural changes involving substantial change to long distance wiring. For example, if I were to rewire half of my visual cortex to instead be part of my mathematical reasoning module.
Both of these are sort of examples of plasticity/editability.
I expect that if we had the ability to do either on of these to a human (e.g. via brain-computer interface) then you could turn a below average IQ human into an impressively skilled mathematician. And you could turn an impressively skilled mathematician into the greatest math genius in the history of the human race.
If I am correct about this, then I think it is fair to consider this a fundamentally different level than cultural knowledge transmission.
(Copied from another comment) Nathan points out increasing size; and large scale / connective plasticity. Another one would be full reflectivity: introspection and self-reprogramming. Another one would be the ability to copy chunks of code and A/B test them as they function in the whole agent. I don’t get why Jacob is so confident that these sorts of things aren’t major and/or that there aren’t more of them than we’ve thought of.
I think this is partially a matter of ontological taste. I mean, you are obviously correct that many innovations coming after the transition the author is interested in seem to produce qualitative shifts in the collective intelligence of humanity. On the other hand, if you take the view that all of these are fundamentally enabled by that first transition, then it seems reasonable to treat that as special in a way that the other innovations are not.
I suppose where the rubber meets the road, if one grants both the special status of the transition to universal cultural learning and that other kinds of innovation can lead to qualitative shifts in collective intelligence, is whether or not further innovations of the second kind can still play the role that foom is supposed to play in EY’s argument (I take Nathan Helm-Burger’s comment to be one argument that such innovations can play this role).
I don’t necessarily care too much about which ones are “special” or “qualitative”, though I did say qualitative. The practical question at hand is how much more intelligence can you pack into given compute, and how quickly can you get there. If a mathematical insight allows you to write code that’s shorter, and runs significantly faster and with less memory requirements, and gives outputs that are more effective, then we’ve answered most of the practical question. History seems chock full of such things.
But yeah I also agree that there’s other more “writ large” sorts of transitions.
Nathan points out large scale / connective plasticity. Another one would be full reflectivity: introspection and self-reprogramming. Another one would be the ability copy chunks of code and A/B test them as they function in the whole agent. I don’t get why Jacob is so confident that these sorts of things aren’t major and/or that there aren’t more of them than we’ve thought of.
At the risk of going round in circles, you begin your post by saying you don’t care which ones are special or qualitative, and end it by wondering why the author is confident certain kinds of transition are not “major”. Is this term, like the others, just standing in for ‘significant enough to play a certain kind of role in an “AI leads to doom” argument’? Or does it mean something else?
I get the impression that you want to avoid too much wrangling over which labels should be applied to which kinds of thing, but then, you brought up the worry about the original post, so I don’t quite know what your point is.
It just means specific innovations that have especially big increases in intelligence. But I think that lots of innovations, such as mathematical ideas, have big increases in intelligence.
Okay, sure. If my impression of the original post is right, the author would not disagree with you, but would rather claim that there is an important distinction to be made among these innovations. Namely, one of them is the 0-1 transition to universality, and the others are not. So, do you disagree that such a distinction may be important at all, or merely that it is not a distinction that supports the argument made in the original post?
It would be a large, broad increase in intelligence. There may be other large broad increases in intelligence. I think there are also other large narrow increases, and small broad increases. Jacob seems to be claiming that there aren’t further large increases to be had. I think the transition to universality is pretty vague. Wouldn’t increasing memory capacity also be a sort of increase in universality?
I have to say I agree that there is vagueness in the transition to universality. That is hardly surprising seeing as it is a confusing and contentious subject that involves integrating perspectives on a number of other confusing and contentious subjects (language, biological evolution, cultural evolution, collective intelligence etc...). However, despite the vagueness, I personally still see this transition, from being unable to accrete cultural innovations to being able to do so, as a special one, different in kind from particular technologies that have been invented since.
Perhaps another way to put it is that the transition seems to bestow on us, as a collective, a meta-ability to obtain new abilities (or increased intelligence, as you put it), that we previously lacked. It is true that there are particular new abilities that are particularly valuable, but there may not be any further meta-abilities to obtain.
Just so we aren’t speaking past each other. Do you get what I am saying here? Even if you disagree that this is relevant, which may be reasonable, does the distinction I am driving at even make sense to you, or still not?
No, I don’t see a real distinction here. If you increase skull size, you increase the rate at which new abilities are invented and combined. If you come up with a mathematical idea, you advance a whole swath of ability-seeking searches. I listed some other things that increase meta-ability. What’s the distinction between various things that hit back to the meta-level?
There is an enormous difference between “increase skull size” when already well into diminishing returns for brain size given only 1e9s of training data, and an improvement that allows compressing knowledge, externalizing it, and sharing it permanently to train new minds.
After that cultural transition, each new mind can train on the compressed summary experiences of all previous minds of the tribe/nation/civilization. You go from having only 1e9s of training data that is thrown away when each individual dies, to having an effective training dataset that scales with total extant integrated population over time. It is a radical shift to a fundemental new scaling equation, and that is why it is a metasystems transition, whereas increasing skull size is not.
Increasing skull size would also let you have much larger working memory, have multiple trains of thought but still with high interconnect, etc., which would let you work on problems that are too hard to fit in one normal human’s working memory.
I simply don’t buy the training data limit. You have infinite free training data from internal events, aka math.
More zoomed out, I still haven’t seen you argue why there aren’t more shifts that change the scaling equation. (I’ve listed some that I think would do so.)
The distinction is that without the initial 0-1 phase transition, none of the other stuff is possible. They are all instances of cumulative cultural accretion, whereas the transition constitutes entering the regime of cumulative cultural accretion (other biological organisms and extant AI systems are not in this regime). If I understand the author correctly, the creation of AGI will increase the pace of cumulative cultural accretion, but will not lead us (or them) to exit that regime (since, according to the point about universality, there is no further regime).
I think this answer also applies to the other comment you made, for what it’s worth. It would take me more time than I am willing to spend to make a cogent case for this here, so I will leave the discussion for now.
Innovations that unlock a broad swath of further abilities could be called “qualitatively more intelligent”. But 1. things that seem “narrow”, such as many math ideas, are qualitative increases in intelligence in this sense; and 2. there’s a lot of innovations that sure seem to obviously be qualitative increases.
Why do you think this? (I’m unconvinced by “universal learning machine” type things that I’ve seen, not because I disagree, but because they don’t seem to address transitions within the shape of what stuff is learned and how it interacts.)
There are NNs that train for a lifetime then die, and there are NNs that train for a lifetime but then network together to share all their knowledge before dying. There are not ever more levels to that.
There are turing universal computational systems (which are all equivalent in universal ability to simulate other systems), and there are non-universal computational systems. There are not more levels to that.
But crucially, humans do not share all their knowledge. Every time a great scientist or engineer or manager or artist dies, a ton of intuition and skills and illegible knowledge dies with them. What is passed on is only what can be easily compressed into the extremely lossy channels of language.
As the saying goes, “humans are as stupid as they can be while still undergoing intelligence-driven takeoff at all”; otherwise humans would have taken over the world sooner. That applies to knowledge sharing in particular—our language channels are just barely good enough to take off.
Even just the ability to copy a mind would push AIs far further along the same direction. Ability to merge minds would go far further still.
Edit: Of course humans do not share all their knowledge, and the cultural transition is obviously graded in the sense that the evolutionary stages of early language, writing, printing press, computers, internet etc gradually improve the externalized network connectivity and storage of our cybernetic civilization. But by the time of AGI that transition is already very well along, such that all we are really losing—as you point out and I agree—is a ton of intuitions/skills/knowledge etc that dies with the decay of human brains, but we externalize much of the most important of our knowledge. Nonetheless ending that tragedy is our great common cause.
I agree that substrate independence is one of the great advantages of digital minds, other than speed.
But there are some fundamental tradeoffs:
You can use GPUs (von neumman) which separate compute and logic. They are much much slower in the sense that they take many many cycles to simulate one cycle of a large ANN. They waste much energy having to shuffle the weights around the chip from memory to logic.
Or you can use neuromorphic computers, which combine memory and logic. They are potentially enormously faster as they can simulate one cycle of a large ANN per clock cycle, but constrained to more brain like designs and thus optimized for low circuit depth but larger circuits (cheap circuitry). For the greatest cheap circuit density, energy efficiency, and speed you need to use analog synapses but in doing so you basically give up the ability to easily transfer the knowledge out of the system—it becomes more ‘mortal’ as hinton recently argues.
This seems like a small tradeoff, and this does not seem like a big enough deal to restore these to anything like human mortality, with all its enormous global effects. It may be much harder to copy weights off a idiosyncratic mess of analogue circuits modified in-place by their training to maximize energy efficiency than it is to run
cp foo.pkl bar.pkl
, absolutely, but the increase in difficulty here seems more on par with ‘a small sub-field with a few hundred grad students/engineers for a few years’ than ‘the creation of AGI’, and so one can assume it’d be solved almost immediately should it ever actually become a problem.For example, even if it’s ultra-miniaturized, you can tap connections to optionally read off activations between many pairs of layers, which will affect only a small part of it and not eliminate the miniaturization or energy savings—and with the layer embeddings summarizing a group of layers, now you can do knowledge distillation to another such neuromorphic computer (or smaller). Knowledge distillation, or self-distillation rather, will cost little and works well. Or, since you can presumably set the analogue values even if you can’t read them, and have a model worth copying, you can pay the one-time cost to distill it out to a more von-Neumann computer, one where you can more easily read the weights out, and thence copy it onto all of the other neuromorphics henceforth. Or, you can reverse-engineer the weights themselves: probe the original and the copy with synthetic data flipping a bit at a time to run finite-differences on outputs like activations/embeddings, starting at the lowest available tap, to eventually reconstruct the equivalent weights group by group. (This may require lots of probes, but these systems by definition run extremely fast and since you’re only probing a small part of it at a time, run even faster than that.) Just off the cuff, and I’m sure you could think of several better approaches if you tried. So I don’t expect ‘mortal’ NNs to be all that different from our current ‘immortal’ NNs or things like FPGAs.
Largely agreed, which is partly why I said only more ‘mortal’ with ‘mortal’ in scare quotes. Or put another way, the full neuromorphic analog route still isn’t as problematic to copy weights out of vs an actual brain, and I expect actual uploading to be possible eventually so … it’s mostly a matter of copy speeds and expenses as you point out, and for the most hardcore analog neuromorphic designs like brains you still can exploit sophisticated distillation techniques as you discuss. But it does look like there are tradeoffs that increase copy out cost as you move to the most advanced neuromorphic designs.
This whole thing is just thought experiment, correct? “what we would have to do to mimic the brain’s energy efficiency”. Because analog synapses where we left off a network of analog gates to connect any given synapse to an ADC (something that current prototype analog inference accelerators use, and analog FPGAs do exist) are kinda awful.
The reason is because of https://openai.com/research/emergent-tool-use . What they found in this paper was that you want to make your Bayesian updates to your agent’s policy in large batches. Meaning you need to be able to copy the policy many times across a fleet of hardware that runs in separate agents, and learn the expected value and errors of the given policy across a larger batch of episodes. The copying requires precise reading of the values, so they need to be binary, and there is no benefit from modifying the policy rapidly in real time.
The reason why we have brains that learn rapidly in real time, overfitting to a small number of strong examples, is because this was all that was possible with the hardware nature could evolve. It is suboptimal.
I think Jake is right that we shouldn’t imagine an unlimited set of levels of learning. I however do think that there are one or two more levels beyond self learning, and cultural transmission. The next level ( which could maybe be described as two levels) is not something that evolution has managed in any mammalian species:
take an existing brain which has filled most of its learning capacity and is beginning to plateau in skill-gain-from-experience and add significantly more capacity.
Make significant architectural changes involving substantial change to long distance wiring. For example, if I were to rewire half of my visual cortex to instead be part of my mathematical reasoning module. Both of these are sort of examples of plasticity/editability. I expect that if we had the ability to do either on of these to a human (e.g. via brain-computer interface) then you could turn a below average IQ human into an impressively skilled mathematician. And you could turn an impressively skilled mathematician into the greatest math genius in the history of the human race. If I am correct about this, then I think it is fair to consider this a fundamentally different level than cultural knowledge transmission.
(Copied from another comment) Nathan points out increasing size; and large scale / connective plasticity. Another one would be full reflectivity: introspection and self-reprogramming. Another one would be the ability to copy chunks of code and A/B test them as they function in the whole agent. I don’t get why Jacob is so confident that these sorts of things aren’t major and/or that there aren’t more of them than we’ve thought of.
But why do you think that? It seems like things like the methods of science, and like mathematical insights, both enhance intelligence qualitatively.
I think this is partially a matter of ontological taste. I mean, you are obviously correct that many innovations coming after the transition the author is interested in seem to produce qualitative shifts in the collective intelligence of humanity. On the other hand, if you take the view that all of these are fundamentally enabled by that first transition, then it seems reasonable to treat that as special in a way that the other innovations are not.
I suppose where the rubber meets the road, if one grants both the special status of the transition to universal cultural learning and that other kinds of innovation can lead to qualitative shifts in collective intelligence, is whether or not further innovations of the second kind can still play the role that foom is supposed to play in EY’s argument (I take Nathan Helm-Burger’s comment to be one argument that such innovations can play this role).
I don’t necessarily care too much about which ones are “special” or “qualitative”, though I did say qualitative. The practical question at hand is how much more intelligence can you pack into given compute, and how quickly can you get there. If a mathematical insight allows you to write code that’s shorter, and runs significantly faster and with less memory requirements, and gives outputs that are more effective, then we’ve answered most of the practical question. History seems chock full of such things.
But yeah I also agree that there’s other more “writ large” sorts of transitions.
Nathan points out large scale / connective plasticity. Another one would be full reflectivity: introspection and self-reprogramming. Another one would be the ability copy chunks of code and A/B test them as they function in the whole agent. I don’t get why Jacob is so confident that these sorts of things aren’t major and/or that there aren’t more of them than we’ve thought of.
At the risk of going round in circles, you begin your post by saying you don’t care which ones are special or qualitative, and end it by wondering why the author is confident certain kinds of transition are not “major”. Is this term, like the others, just standing in for ‘significant enough to play a certain kind of role in an “AI leads to doom” argument’? Or does it mean something else?
I get the impression that you want to avoid too much wrangling over which labels should be applied to which kinds of thing, but then, you brought up the worry about the original post, so I don’t quite know what your point is.
It just means specific innovations that have especially big increases in intelligence. But I think that lots of innovations, such as mathematical ideas, have big increases in intelligence.
Okay, sure. If my impression of the original post is right, the author would not disagree with you, but would rather claim that there is an important distinction to be made among these innovations. Namely, one of them is the 0-1 transition to universality, and the others are not. So, do you disagree that such a distinction may be important at all, or merely that it is not a distinction that supports the argument made in the original post?
It would be a large, broad increase in intelligence. There may be other large broad increases in intelligence. I think there are also other large narrow increases, and small broad increases. Jacob seems to be claiming that there aren’t further large increases to be had. I think the transition to universality is pretty vague. Wouldn’t increasing memory capacity also be a sort of increase in universality?
I have to say I agree that there is vagueness in the transition to universality. That is hardly surprising seeing as it is a confusing and contentious subject that involves integrating perspectives on a number of other confusing and contentious subjects (language, biological evolution, cultural evolution, collective intelligence etc...). However, despite the vagueness, I personally still see this transition, from being unable to accrete cultural innovations to being able to do so, as a special one, different in kind from particular technologies that have been invented since.
Perhaps another way to put it is that the transition seems to bestow on us, as a collective, a meta-ability to obtain new abilities (or increased intelligence, as you put it), that we previously lacked. It is true that there are particular new abilities that are particularly valuable, but there may not be any further meta-abilities to obtain.
Just so we aren’t speaking past each other. Do you get what I am saying here? Even if you disagree that this is relevant, which may be reasonable, does the distinction I am driving at even make sense to you, or still not?
No, I don’t see a real distinction here. If you increase skull size, you increase the rate at which new abilities are invented and combined. If you come up with a mathematical idea, you advance a whole swath of ability-seeking searches. I listed some other things that increase meta-ability. What’s the distinction between various things that hit back to the meta-level?
There is an enormous difference between “increase skull size” when already well into diminishing returns for brain size given only 1e9s of training data, and an improvement that allows compressing knowledge, externalizing it, and sharing it permanently to train new minds.
After that cultural transition, each new mind can train on the compressed summary experiences of all previous minds of the tribe/nation/civilization. You go from having only 1e9s of training data that is thrown away when each individual dies, to having an effective training dataset that scales with total extant integrated population over time. It is a radical shift to a fundemental new scaling equation, and that is why it is a metasystems transition, whereas increasing skull size is not.
Increasing skull size would also let you have much larger working memory, have multiple trains of thought but still with high interconnect, etc., which would let you work on problems that are too hard to fit in one normal human’s working memory.
I simply don’t buy the training data limit. You have infinite free training data from internal events, aka math.
More zoomed out, I still haven’t seen you argue why there aren’t more shifts that change the scaling equation. (I’ve listed some that I think would do so.)
The distinction is that without the initial 0-1 phase transition, none of the other stuff is possible. They are all instances of cumulative cultural accretion, whereas the transition constitutes entering the regime of cumulative cultural accretion (other biological organisms and extant AI systems are not in this regime). If I understand the author correctly, the creation of AGI will increase the pace of cumulative cultural accretion, but will not lead us (or them) to exit that regime (since, according to the point about universality, there is no further regime).
I think this answer also applies to the other comment you made, for what it’s worth. It would take me more time than I am willing to spend to make a cogent case for this here, so I will leave the discussion for now.
Ok. I think you’re confused though; other things we’ve discussed are pretty much as 0 to 1 as cultural accumulation.
Innovations that unlock a broad swath of further abilities could be called “qualitatively more intelligent”. But 1. things that seem “narrow”, such as many math ideas, are qualitative increases in intelligence in this sense; and 2. there’s a lot of innovations that sure seem to obviously be qualitative increases.