I have some comments on things Eliezer has said. I don’t expect these disagreements are very important to the main questions, because I tend to agree with him overall despite it.
my worldview also permits but does not mandate that you get up to the chimp level [...]
A naïve model of intelligence is a linear axis, that puts everything on a simple trendline with one-dimensional distances. I assume most people here understand that this is a gross oversimplification. Intelligence is made of multiple pieces, which can have unique strengths. There is still such a thing as generality of intelligence, that at some point you have enough tools to dynamically apply your reasoning to a great many more things than those tools were originally adapted for. This ability does seem to have some degree of scale, in that a human is more general than a chimp is more general than a mouse, though it also seems to be fairly sharp-edged, in that the difference in generality between a human and a chimp seems much greater than between a chimp and a mouse.
Because of the great differences between computer systems and biological ones, the individual components of computer intelligence (whether necessary for generality or not) when measured relative to a human tend to jump quickly between zero, when the program doesn’t have that ability, and effectively infinite, when the program has that ability. There is also a large set of relations between capabilities whereby one ability can substitute for another, typically at the cost of some large factor reduction in performance.
A traditional chess engine has several component skills, like searching moves and updating decision trees, that it does vastly better than a human. This ability feeds down into some metrics of positional understanding. Positional understanding is not a particularly strong fundamental ability of a traditional chess engine, but rather something inefficiently paid for with its other skills that it does have in excess. The same idea holds for human intelligence, where we can use our more fundamental evolved skills, like object recognition, to build more complex skills. Because we have a broad array of baseline skills, and the enough tools to combine them to fit novel tasks, we can solve a much wider array of tasks, and can transfer domains with generally less cost than computers can. Nonetheless, there exist cognitive tasks we know can be done well that are outside of human mental capability.
When I envision AI progress leading up to AGI, I don’t think of a single figure of merit that increases uniformly. I think of a set of capabilities, of which some are ~0, some are ~∞, and others are derived quantities not explicitly coded in. Scale advances in NNs push the effective infinities to greater effective infinities, and by extension push up the derived quantities across the board. Fundamental algorithmic advances increase the set of capabilities at ~infinity. At some point I expect the combination of fundamental and derived quantities to capture enough facets of cognition to push generality past a tipping point. In the run up to that point, lesser levels of generality will likely make AI systems applicable to more and more extensions of their primary domains.
This seems to me like it’s mostly, if not totally, a literal interpretation of the world. Yet, to finally get to the point, nowhere in my map do I have a clear interpretation of what “get up to the chimp level” means. The degree to which chimps are generally intelligent seems very specific to the base skill set that chimps have, and it seems much more likely than not that AI will approach it from a completely different angle, because their base skillset is completely different and will generalize in a completely different way. The comment that “chimps are not very valuable” does not seem to map onto any relevant comment about pre-explosion AI. I do not know what it would mean to have a chimp level AI, or even chimp level generality.
I would not be terribly surprised to find that results on benchmarks continue according to graph, and yet, GPT-4 somehow does not seem very much smarter than GPT-3 in conversation.
I would be quite surprised for a similar improvement in perplexity not to correspond to at least a similar improvement in apparent smartness, versus GPT-3 over GPT-2.
I would not be surprised for the perplexity improvement to level off, maybe not immediately but at least in some small count of generations, as it seems entirely reasonable that there are some aspects of cognition that GPT-style models can’t purchase. But for perplexity to improve on cue without an apparent improvement in intelligence, while logically coherent, would imply some very weird things about either the entropy of language or the scaling of model capacity.
That is, either language has a bunch of easy to access regularity between what GPT-3 reached and what an agent with a more advanced understanding of the world could reach, distributed coincidentally in line with previous capability increases, or GPT-3 roughly caps out the semantic capabilities of the network, but extra parameters added on top are still practically identically effective at extracting a huge amount of more marginal non-semantic regularities at a rate fast enough to compete with prior model scale increases that did both.
Stuff coming uncorrelated that way, sounds like some of the history I lived through, where people managed to make the graphs of Moore’s Law seem to look steady by rejiggering the axes, and yet, between 1990 and 2000 home computers got a whole lot faster, and between 2010 and 2020 they did not.
There is truth to this comment, in that Dennard scaling fell around the turn of the millenia, but the 2010s were deceptive in that home computers stagnated in performance despite Moore’s Law improvements, because Intel sold you your CPUs, were stuck on an old node, being stuck on an old node prevented them from pushing out new architectures, and their monopoly position meant they never really needed to compete on price or core count either.
But Moore’s Law did continue, just through TSMC, and the corresponding performance improvements were felt primarily in GPUs and mobile SoCs. Both of those have improved at a great pace. In the last three years competition has returned to the desktop CPU market, and Intel has just managed to get out of their node crisis, so CPU performance really is picking up steam again. This is true both for per-core performance, driven by architectures making use of the great many transistors available, and even moreso true of aggregate performance, what with core counts in the Steam Survey increasing from an average of 3.0 in early 2017 to an average of 5.0 in April this year.
You are correct that the scaling regimes are different now, and Dennard scaling really is dead for good, but if you look back at the original Moore’s Law graphs from 1965, they never mentioned frequency, so I don’t buy the claim that the graphs have been rejigged.
I have some comments on things Eliezer has said. I don’t expect these disagreements are very important to the main questions, because I tend to agree with him overall despite it.
A naïve model of intelligence is a linear axis, that puts everything on a simple trendline with one-dimensional distances. I assume most people here understand that this is a gross oversimplification. Intelligence is made of multiple pieces, which can have unique strengths. There is still such a thing as generality of intelligence, that at some point you have enough tools to dynamically apply your reasoning to a great many more things than those tools were originally adapted for. This ability does seem to have some degree of scale, in that a human is more general than a chimp is more general than a mouse, though it also seems to be fairly sharp-edged, in that the difference in generality between a human and a chimp seems much greater than between a chimp and a mouse.
Because of the great differences between computer systems and biological ones, the individual components of computer intelligence (whether necessary for generality or not) when measured relative to a human tend to jump quickly between zero, when the program doesn’t have that ability, and effectively infinite, when the program has that ability. There is also a large set of relations between capabilities whereby one ability can substitute for another, typically at the cost of some large factor reduction in performance.
A traditional chess engine has several component skills, like searching moves and updating decision trees, that it does vastly better than a human. This ability feeds down into some metrics of positional understanding. Positional understanding is not a particularly strong fundamental ability of a traditional chess engine, but rather something inefficiently paid for with its other skills that it does have in excess. The same idea holds for human intelligence, where we can use our more fundamental evolved skills, like object recognition, to build more complex skills. Because we have a broad array of baseline skills, and the enough tools to combine them to fit novel tasks, we can solve a much wider array of tasks, and can transfer domains with generally less cost than computers can. Nonetheless, there exist cognitive tasks we know can be done well that are outside of human mental capability.
When I envision AI progress leading up to AGI, I don’t think of a single figure of merit that increases uniformly. I think of a set of capabilities, of which some are ~0, some are ~∞, and others are derived quantities not explicitly coded in. Scale advances in NNs push the effective infinities to greater effective infinities, and by extension push up the derived quantities across the board. Fundamental algorithmic advances increase the set of capabilities at ~infinity. At some point I expect the combination of fundamental and derived quantities to capture enough facets of cognition to push generality past a tipping point. In the run up to that point, lesser levels of generality will likely make AI systems applicable to more and more extensions of their primary domains.
This seems to me like it’s mostly, if not totally, a literal interpretation of the world. Yet, to finally get to the point, nowhere in my map do I have a clear interpretation of what “get up to the chimp level” means. The degree to which chimps are generally intelligent seems very specific to the base skill set that chimps have, and it seems much more likely than not that AI will approach it from a completely different angle, because their base skillset is completely different and will generalize in a completely different way. The comment that “chimps are not very valuable” does not seem to map onto any relevant comment about pre-explosion AI. I do not know what it would mean to have a chimp level AI, or even chimp level generality.
I would be quite surprised for a similar improvement in perplexity not to correspond to at least a similar improvement in apparent smartness, versus GPT-3 over GPT-2.
I would not be surprised for the perplexity improvement to level off, maybe not immediately but at least in some small count of generations, as it seems entirely reasonable that there are some aspects of cognition that GPT-style models can’t purchase. But for perplexity to improve on cue without an apparent improvement in intelligence, while logically coherent, would imply some very weird things about either the entropy of language or the scaling of model capacity.
That is, either language has a bunch of easy to access regularity between what GPT-3 reached and what an agent with a more advanced understanding of the world could reach, distributed coincidentally in line with previous capability increases, or GPT-3 roughly caps out the semantic capabilities of the network, but extra parameters added on top are still practically identically effective at extracting a huge amount of more marginal non-semantic regularities at a rate fast enough to compete with prior model scale increases that did both.
There is truth to this comment, in that Dennard scaling fell around the turn of the millenia, but the 2010s were deceptive in that home computers stagnated in performance despite Moore’s Law improvements, because Intel sold you your CPUs, were stuck on an old node, being stuck on an old node prevented them from pushing out new architectures, and their monopoly position meant they never really needed to compete on price or core count either.
But Moore’s Law did continue, just through TSMC, and the corresponding performance improvements were felt primarily in GPUs and mobile SoCs. Both of those have improved at a great pace. In the last three years competition has returned to the desktop CPU market, and Intel has just managed to get out of their node crisis, so CPU performance really is picking up steam again. This is true both for per-core performance, driven by architectures making use of the great many transistors available, and even moreso true of aggregate performance, what with core counts in the Steam Survey increasing from an average of 3.0 in early 2017 to an average of 5.0 in April this year.
You are correct that the scaling regimes are different now, and Dennard scaling really is dead for good, but if you look back at the original Moore’s Law graphs from 1965, they never mentioned frequency, so I don’t buy the claim that the graphs have been rejigged.