I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we’ve seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance. Qualitative capabilities are more relevant to safety than benchmark performance is, because while qualitative capabilities include things like “code a simple video game” and “summarize movies with emojis”, they also include things like “break out of confinement and kill everyone”. It’s the latter capability, and not PTB performance, that you’d need to predict if you wanted to reliably stay out of the x-risk regime — and the fact that we can’t currently do so is, I imagine, what brought to mind the analogy between scaling and Russian roulette.
I.e., a straight line in domain X is indeed not surprising; what’s surprising is the way in which that straight line maps to the things we care about more than X.
(Usual caveats apply here that I may be misinterpreting folks, but that is my best read of the argument.)
I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we’ve seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance.
This is a reasonable thesis, and if indeed it’s the one Gwern intended, then I apologize for missing it!
That said, I have a few objections,
Isn’t it a bit suspicious that the thing-that’s-discontinuous is hard to measure, but the-thing-that’s-continuous isn’t? I mean, this isn’t totally suspicious, because subjective experiences are often hard to pin down and explain using numbers and statistics. I can understand that, but the suspicion is still there.
“No one predicted X in advance” is only damning to a theory if people who believed that theory were making predictions about it at all. If people who generally align with Paul Christiano were indeed making predictions to the effect of GPT-3 capabilities being impossible or very unlikely within a narrow future time window, then I agree that would be damning to Paul’s worldview. But—and maybe I missed something—I didn’t see that. Did you?
There seems to be an implicit claim that Paul Christiano’s theory was falsified via failure to retrodict the data. But that’s weird, because much of the evidence being presented is mainly that the previous trends were upheld (for example, with Gwern saying, “The impact of GPT-3 was in establishing that trendlines did continue...”). But if Paul’s worldview is that “we should extrapolate trends, generally” then that piece of evidence seems like a remarkable confirmation of his theory, not a disconfirmation.
Do we actually have strong evidence that the qualitative things being mentioned were discontinuous with respect to time? I can certainly see some things being discontinuous with past progress (like the ability for GPT-3 to do arithmetic). But overall I feel like I’m being asked to believe something quite strong about GPT-3 breaking trends without actual references to what progress really looked like in the past.
I don’t deny that you can find quite a few discontinuities on a variety of metrics, especially if you search for them post-hoc. I think it would be fairly strawmanish to say that people in Paul Christiano’s camp don’t expect those at all. My impression is that they just don’t expect those to be overwhelming in a way that makes reliable reference class forecasting qualitatively useless; it seems like extrapolating from the past still gives you a lot better of a model than most available alternatives.
it seems like extrapolating from the past still gives you a lot better of a model than most available alternatives.
My impression is that some people are impressed by GPT-3′s capabilities, whereas your response is “ok, but it’s part of the straight-line trend on Penn Treebank; maybe it’s a little ahead of schedule, but nothing to write home about.” But clearly you and they are focused on different metrics!
That is, suppose it’s the case that GPT-3 is the first successfully commercialized language model. (I think in order to make this literally true you have to throw on additional qualifiers that I’m not going to look up; pretend I did that.) So on a graph of “language model of type X revenue over time”, total revenue is static at 0 for a long time and then shortly after GPT-3′s creation departs from 0.
It seems like the fact that GPT-3 could be commercialized in this way when GPT-2 couldn’t is a result of something that Penn Treebank perplexity is sort of pointing at. (That is, it’d be hard to get a model with GPT-3′s commercializability but GPT-2′s Penn Treebank score.) But what we need in order for the straight line on PTB to be useful as a model for predicting revenue is to know ahead of time what PTB threshold you need for commercialization.
And so this is where the charge of irrelevancy is coming from: yes, you can draw straight lines, but they’re straight lines in the wrong variables. In the interesting variables (from the “what’s the broader situation?” worldview), we do see discontinuities, even if there are continuities in different variables.
[As an example of the sort of story that I’d want, imagine we drew the straight line of ELO ratings for Go-bots, had a horizontal line of “human professionals” on that graph, and were able to forecast the discontinuity in “number of AI wins against human grandmasters” by looking at straight-line forecasts in ELO.]
That is, suppose it’s the case that GPT-3 is the first successfully commercialized language model. (I think in order to make this literally true you have to throw on additional qualifiers that I’m not going to look up; pretend I did that.) So on a graph of “language model of type X revenue over time”, total revenue is static at 0 for a long time and then shortly after GPT-3′s creation departs from 0.
I think it’s the nature of every product that comes on the market that it will experience a discontinuity from having zero revenue to having some revenue at some point. It’s an interesting question of when that will happen, and maybe your point is simply that it’s hard to predict when that will happen when you just look at the Penn Treebank trend.
However, I suspect that the revenue curve will look pretty continuous, now that it’s gone from zero to one. Do you disagree?
In a world with continuous, gradual progress across a ton of metrics, you’re going to get discontinuities from zero to one. I don’t think anyone from the Paul camp disagrees with that (in fact, Katja Grace talked about this in her article). From the continuous takeoff perspective, these discontinuities don’t seem very relevant unless going from zero to one is very important in a qualitative sense. But I would contend that going from “no revenue” to “some revenue” is not actually that meaningful in the sense of distinguishing AI from the large class of other economic products that have gradual development curves.
your point is simply that it’s hard to predict when that will happen when you just look at the Penn Treebank trend.
This is a big part of my point; a smaller elaboration is that it can be easy to trick yourself into thinking that, because you understand what will happen with PTB, you’ll understand what will happen with economics/security/etc., when in fact you don’t have much understanding of the connection between those, and there might be significant discontinuities. [To be clear, I don’t have much understanding of this either; I wish I did!]
For example, I imagine that, by thirty years from now, we’ll have language/code models that can do significant security analysis of the code that was available in 2020, and that this would have been highly relevant/valuable to people in 2020 interested in computer security. But when will this happen in the 2020-2050 range that seems likely to me? I’m pretty uncertain, and I expect this to look a lot like ‘flicking a switch’ in retrospect, even tho the leadup to flicking that switch will probably look like smoothly increasing capabilities on ‘toy’ problems.
[My current guess is that Paul / people in “Paul’s camp” would mostly agree with the previous paragraph, except for thinking that it’s sort of weird to focus on specifically AI computer security productivity, rather than the overall productivity of the computer security ecosystem, and this misplaced focus will generate the ‘flipping the switch’ impression. I think most of the disagreements are about ‘where to place the focus’, and this is one of the reasons it’s hard to find bets; it seems to me like Eliezer doesn’t care much about the lines Paul is drawing, and Paul doesn’t care much about the lines Eliezer is drawing.]
However, I suspect that the revenue curve will look pretty continuous, now that it’s gone from zero to one. Do you disagree?
I think I agree in a narrow sense and disagree in a broad sense. For this particular example, I expect OpenAI’s revenues from GPT-3 to look roughly continuous now that they’re selling/licensing it at all (until another major change happens; like, the introduction of a competitor would likely cause the revenue trend to change).
More generally, suppose we looked at something like “the total economic value of horses over the course of human history”. I think we would see mostly smooth trends plus some implied starting and stopping points for those trends. (Like, “first domestication of a horse” probably starts a positive trend, “invention of stirrups” probably starts another positive trend, “introduction of horses to America” starts another positive trend, “invention of the automobile” probably starts a negative trend that ends with “last horse that gets replaced by a tractor/car”.)
In my view, ‘understanding the world’ looks like having a causal model that you can imagine variations on (and have those imaginations be meaningfully grounded in reality), and I think the bits that are most useful for building that causal model are the starts and stops of the trends, rather than the smooth adoption curves or mostly steady equilibria in between. So it seems sort of backwards to me to say that for most of the time, most of the changes in the graph are smooth, because what I want out of the graph is to figure out the underlying generator, where the non-smooth bits are the most informative. The graph itself only seems useful as a means to that end, rather than an end in itself.
Isn’t it a bit suspicious that the thing-that’s-discontinuous is hard to measure, but the-thing-that’s-continuous isn’t? I mean, this isn’t totally suspicious, because subjective experiences are often hard to pin down and explain using numbers and statistics. I can understand that, but the suspicion is still there.
I sympathize with this view, and I agree there is some element of truth to it that may point to a fundamental gap in our understanding (or at least in mine). But I’m not sure I entirely agree that discontinuous capabilities are necessarily hard to measure: for example, there are benchmarks available for things like arithmetic, which one can train on and make quantitative statements about.
I think the key to the discontinuity question is rather that 1) it’s the jumps in model scaling that are happening in discrete increments; and 2) everything is S-curves, and a discontinuity always has a linear regime if you zoom in enough. Those two things together mean that, while a capability like arithmetic might have a continuous performance regime on some domain, in reality you can find yourself halfway up the performance curve in a single scaling jump (and this is in fact what happened with arithmetic and GPT-3). So the risk, as I understand it, is that you end up surprisingly far up the scale of “world-ending” capability from one generation to the next, with no detectable warning shot beforehand.
“No one predicted X in advance” is only damning to a theory if people who believed that theory were making predictions about it at all. If people who generally align with Paul Christiano were indeed making predictions to the effect of GPT-3 capabilities being impossible or very unlikely within a narrow future time window, then I agree that would be damning to Paul’s worldview. But—and maybe I missed something—I didn’t see that. Did you?
No, you’re right as far as I know; at least I’m not aware of any such attempted predictions. And in fact, the very absence of such prediction attempts is interesting in itself. One would imagine that correctly predicting the capabilities of an AI from its scale ought to be a phenomenally valuable skill — not just from a safety standpoint, but from an economic one too. So why, indeed, didn’t we see people make such predictions, or at least try to?
There could be several reasons. For example, perhaps Paul (and other folks who subscribe to the “continuum” world-model) could have done it, but they were unaware of the enormous value of their predictive abilities. That seems implausible, so let’s assume they knew the value of such predictions would be huge. But if you know the value of doing something is huge, why aren’t you doing it? Well, if you’re rational, there’s only one reason: you aren’t doing it because it’s too hard, or otherwise too expensive compared to your alternatives. So we are forced to conclude that this world-model — by its own implied self-assessment — has, so far, proved inadequate to generate predictions about the kinds of capabilities we really care about.
(Note: you could make the argument that OpenAI did make such a prediction, in the approximate yet very strong sense that they bet big on a meaningful increase in aggregate capabilities from scale, and won. You could also make the argument that Paul, having been at OpenAI during the critical period, deserves some credit for that decision. I’m not aware of Paul ever making this argument, but if made, it would be a point in favor of such a view and against my argument above.)
I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we’ve seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance. Qualitative capabilities are more relevant to safety than benchmark performance is, because while qualitative capabilities include things like “code a simple video game” and “summarize movies with emojis”, they also include things like “break out of confinement and kill everyone”. It’s the latter capability, and not PTB performance, that you’d need to predict if you wanted to reliably stay out of the x-risk regime — and the fact that we can’t currently do so is, I imagine, what brought to mind the analogy between scaling and Russian roulette.
I.e., a straight line in domain X is indeed not surprising; what’s surprising is the way in which that straight line maps to the things we care about more than X.
(Usual caveats apply here that I may be misinterpreting folks, but that is my best read of the argument.)
This is a reasonable thesis, and if indeed it’s the one Gwern intended, then I apologize for missing it!
That said, I have a few objections,
Isn’t it a bit suspicious that the thing-that’s-discontinuous is hard to measure, but the-thing-that’s-continuous isn’t? I mean, this isn’t totally suspicious, because subjective experiences are often hard to pin down and explain using numbers and statistics. I can understand that, but the suspicion is still there.
“No one predicted X in advance” is only damning to a theory if people who believed that theory were making predictions about it at all. If people who generally align with Paul Christiano were indeed making predictions to the effect of GPT-3 capabilities being impossible or very unlikely within a narrow future time window, then I agree that would be damning to Paul’s worldview. But—and maybe I missed something—I didn’t see that. Did you?
There seems to be an implicit claim that Paul Christiano’s theory was falsified via failure to retrodict the data. But that’s weird, because much of the evidence being presented is mainly that the previous trends were upheld (for example, with Gwern saying, “The impact of GPT-3 was in establishing that trendlines did continue...”). But if Paul’s worldview is that “we should extrapolate trends, generally” then that piece of evidence seems like a remarkable confirmation of his theory, not a disconfirmation.
Do we actually have strong evidence that the qualitative things being mentioned were discontinuous with respect to time? I can certainly see some things being discontinuous with past progress (like the ability for GPT-3 to do arithmetic). But overall I feel like I’m being asked to believe something quite strong about GPT-3 breaking trends without actual references to what progress really looked like in the past.
I don’t deny that you can find quite a few discontinuities on a variety of metrics, especially if you search for them post-hoc. I think it would be fairly strawmanish to say that people in Paul Christiano’s camp don’t expect those at all. My impression is that they just don’t expect those to be overwhelming in a way that makes reliable reference class forecasting qualitatively useless; it seems like extrapolating from the past still gives you a lot better of a model than most available alternatives.
My impression is that some people are impressed by GPT-3′s capabilities, whereas your response is “ok, but it’s part of the straight-line trend on Penn Treebank; maybe it’s a little ahead of schedule, but nothing to write home about.” But clearly you and they are focused on different metrics!
That is, suppose it’s the case that GPT-3 is the first successfully commercialized language model. (I think in order to make this literally true you have to throw on additional qualifiers that I’m not going to look up; pretend I did that.) So on a graph of “language model of type X revenue over time”, total revenue is static at 0 for a long time and then shortly after GPT-3′s creation departs from 0.
It seems like the fact that GPT-3 could be commercialized in this way when GPT-2 couldn’t is a result of something that Penn Treebank perplexity is sort of pointing at. (That is, it’d be hard to get a model with GPT-3′s commercializability but GPT-2′s Penn Treebank score.) But what we need in order for the straight line on PTB to be useful as a model for predicting revenue is to know ahead of time what PTB threshold you need for commercialization.
And so this is where the charge of irrelevancy is coming from: yes, you can draw straight lines, but they’re straight lines in the wrong variables. In the interesting variables (from the “what’s the broader situation?” worldview), we do see discontinuities, even if there are continuities in different variables.
[As an example of the sort of story that I’d want, imagine we drew the straight line of ELO ratings for Go-bots, had a horizontal line of “human professionals” on that graph, and were able to forecast the discontinuity in “number of AI wins against human grandmasters” by looking at straight-line forecasts in ELO.]
I think it’s the nature of every product that comes on the market that it will experience a discontinuity from having zero revenue to having some revenue at some point. It’s an interesting question of when that will happen, and maybe your point is simply that it’s hard to predict when that will happen when you just look at the Penn Treebank trend.
However, I suspect that the revenue curve will look pretty continuous, now that it’s gone from zero to one. Do you disagree?
In a world with continuous, gradual progress across a ton of metrics, you’re going to get discontinuities from zero to one. I don’t think anyone from the Paul camp disagrees with that (in fact, Katja Grace talked about this in her article). From the continuous takeoff perspective, these discontinuities don’t seem very relevant unless going from zero to one is very important in a qualitative sense. But I would contend that going from “no revenue” to “some revenue” is not actually that meaningful in the sense of distinguishing AI from the large class of other economic products that have gradual development curves.
This is a big part of my point; a smaller elaboration is that it can be easy to trick yourself into thinking that, because you understand what will happen with PTB, you’ll understand what will happen with economics/security/etc., when in fact you don’t have much understanding of the connection between those, and there might be significant discontinuities. [To be clear, I don’t have much understanding of this either; I wish I did!]
For example, I imagine that, by thirty years from now, we’ll have language/code models that can do significant security analysis of the code that was available in 2020, and that this would have been highly relevant/valuable to people in 2020 interested in computer security. But when will this happen in the 2020-2050 range that seems likely to me? I’m pretty uncertain, and I expect this to look a lot like ‘flicking a switch’ in retrospect, even tho the leadup to flicking that switch will probably look like smoothly increasing capabilities on ‘toy’ problems.
[My current guess is that Paul / people in “Paul’s camp” would mostly agree with the previous paragraph, except for thinking that it’s sort of weird to focus on specifically AI computer security productivity, rather than the overall productivity of the computer security ecosystem, and this misplaced focus will generate the ‘flipping the switch’ impression. I think most of the disagreements are about ‘where to place the focus’, and this is one of the reasons it’s hard to find bets; it seems to me like Eliezer doesn’t care much about the lines Paul is drawing, and Paul doesn’t care much about the lines Eliezer is drawing.]
I think I agree in a narrow sense and disagree in a broad sense. For this particular example, I expect OpenAI’s revenues from GPT-3 to look roughly continuous now that they’re selling/licensing it at all (until another major change happens; like, the introduction of a competitor would likely cause the revenue trend to change).
More generally, suppose we looked at something like “the total economic value of horses over the course of human history”. I think we would see mostly smooth trends plus some implied starting and stopping points for those trends. (Like, “first domestication of a horse” probably starts a positive trend, “invention of stirrups” probably starts another positive trend, “introduction of horses to America” starts another positive trend, “invention of the automobile” probably starts a negative trend that ends with “last horse that gets replaced by a tractor/car”.)
In my view, ‘understanding the world’ looks like having a causal model that you can imagine variations on (and have those imaginations be meaningfully grounded in reality), and I think the bits that are most useful for building that causal model are the starts and stops of the trends, rather than the smooth adoption curves or mostly steady equilibria in between. So it seems sort of backwards to me to say that for most of the time, most of the changes in the graph are smooth, because what I want out of the graph is to figure out the underlying generator, where the non-smooth bits are the most informative. The graph itself only seems useful as a means to that end, rather than an end in itself.
Yeah, these are interesting points.
I sympathize with this view, and I agree there is some element of truth to it that may point to a fundamental gap in our understanding (or at least in mine). But I’m not sure I entirely agree that discontinuous capabilities are necessarily hard to measure: for example, there are benchmarks available for things like arithmetic, which one can train on and make quantitative statements about.
I think the key to the discontinuity question is rather that 1) it’s the jumps in model scaling that are happening in discrete increments; and 2) everything is S-curves, and a discontinuity always has a linear regime if you zoom in enough. Those two things together mean that, while a capability like arithmetic might have a continuous performance regime on some domain, in reality you can find yourself halfway up the performance curve in a single scaling jump (and this is in fact what happened with arithmetic and GPT-3). So the risk, as I understand it, is that you end up surprisingly far up the scale of “world-ending” capability from one generation to the next, with no detectable warning shot beforehand.
No, you’re right as far as I know; at least I’m not aware of any such attempted predictions. And in fact, the very absence of such prediction attempts is interesting in itself. One would imagine that correctly predicting the capabilities of an AI from its scale ought to be a phenomenally valuable skill — not just from a safety standpoint, but from an economic one too. So why, indeed, didn’t we see people make such predictions, or at least try to?
There could be several reasons. For example, perhaps Paul (and other folks who subscribe to the “continuum” world-model) could have done it, but they were unaware of the enormous value of their predictive abilities. That seems implausible, so let’s assume they knew the value of such predictions would be huge. But if you know the value of doing something is huge, why aren’t you doing it? Well, if you’re rational, there’s only one reason: you aren’t doing it because it’s too hard, or otherwise too expensive compared to your alternatives. So we are forced to conclude that this world-model — by its own implied self-assessment — has, so far, proved inadequate to generate predictions about the kinds of capabilities we really care about.
(Note: you could make the argument that OpenAI did make such a prediction, in the approximate yet very strong sense that they bet big on a meaningful increase in aggregate capabilities from scale, and won. You could also make the argument that Paul, having been at OpenAI during the critical period, deserves some credit for that decision. I’m not aware of Paul ever making this argument, but if made, it would be a point in favor of such a view and against my argument above.)