perhaps because GPT-3 was maybe a discontinuity for language models.
??? Wasn’t GPT-3 the result of people at OpenAI saying “huh, looks like language models scale according to such-and-such law, let’s see if that continues to hold”, and that law did continue to hold? Seems like an almost central example of continuous progress if you’re evaluating by typical language model metrics like perplexity.
Seems like an almost central example of continuous progress if you’re evaluating by typical language model metrics like perplexity.
I think we should determine whether GPT-3 is an example of continuous progress in perplexity based on the extent to which it lowered the SOTA perplexity (on huge internet-text corpora), and its wall clock training time. I don’t see why the correctness of a certain scaling law or the researchers’ beliefs/motivation should affect this determination.
I agree I haven’t filled in all the details to argue for continuous progress (mostly because I don’t know the exact numbers), but when you get better results by investing more resources to push forward on a predicted scaling law, if there is a discontinuity it comes from a discontinuity in resource investment, which feels quite different from a technological discontinuity (e.g. we can model it and see a discontinuity is unlikely). This was the case with AlphaGo for example.
Separately, I also predict GPT-3 was not an example of discontinuity on perplexity, because it did not constitute a discontinuity in resource investment. (There may have been a discontinuity from resource investment in language models earlier in 2018-19, though I would guess even that wasn’t the case.)
Several of the discontinuities in the AI Impacts investigation were the result of discontinuities in resource investment, IIRC.
I think Ajeya’s report mostly assumes, rather than argues, that there won’t be a discontinuity of resource investment. Maybe I’m forgetting something but I don’t remember her analyzing the different major actors to see if any of them has shown signs of secretly running a Manhattan project or being open to doing so in the future.
Also, discontinuous progress is systematically easier than both of you in this conversation make it sound: The process is not “Choose a particular advancement (GPT-3), identify the unique task or dimension which it is making progress on, and then see whether or not it was a discontinuity on the historical trend for that task/dimension.” There is no one task or dimension that matters; rather, any “strategically significant” dimension matters. Maybe GPT-3 isn’t a discontinuity in perplexity, but is still a discontinuity in reasoning ability or common-sense understanding or wordsmithing or code-writing.
(To be clear, I agree with you that GPT-3 probably isn’t a discontinuity in any strategically significant dimension, for exactly the reasons you give: GPT-3 seems to be just continuing a trend set by the earlier GPTs, including the resource-investment trend.)
Maybe GPT-3 isn’t a discontinuity in perplexity, but is still a discontinuity in reasoning ability or common-sense understanding or wordsmithing or code-writing.
I was disagreeing with this statement in the OP:
GPT-3 was maybe a discontinuity for language models.
I agree that it “could have been” a discontinuity on those other metrics, and my argument doesn’t apply there. I wasn’t claiming it would.
I think Ajeya’s report mostly assumes, rather than argues, that there won’t be a discontinuity of resource investment. Maybe I’m forgetting something but I don’t remember her analyzing the different major actors to see if any of them has shown signs of secretly running a Manhattan project or being open to doing so in the future.
It doesn’t argue for it explicitly, but if you look at the section and the corresponding appendix, it just seems pretty infeasible for there to be a large discontinuity—a Manhattan project in the US that had been going on for the last 5 years and finished tomorrow would cost ~$1T, while current projects cost ~$100M, and 4 orders of magnitude at the pace in AI and Compute would be a discontinuity of slightly under 4 years. This wouldn’t be a large / robust discontinuity according to the AI Impacts methodology, and I think it wouldn’t even pick this up as a “small” discontinuity?
Several of the discontinuities in the AI Impacts investigation were the result of discontinuities in resource investment, IIRC.
I didn’t claim otherwise? I’m just claiming you should distinguish between them.
If anything this would make me update that discontinuities in AI are less likely, given that I can be relatively confident there won’t be discontinuities in AI investment (at least in the near-ish future).
Yes, but they spent more money and created a much larger model than other groups, sooner than I’d otherwise have expected. It also reaches some threshold for “scarily good” for me which makes me surprised.
suggest that it wasn’t a discontinuity in terms of validation loss, which seems to the inverse of perplexity.
Also, from the Wikipedia page:
GPT-3′s full version has a capacity of 175 billion [parameters] [...] Prior to the release of GPT-3, the largest language model was Microsoft’s Turing NLG, introduced in February 2020, with a capacity of 17 billion parameters or less than 10 percent compared to GPT-3.
The year before GPT-2 had 1.5 billion parameters and XLNET had 340M. The year before that, in 2018 BERT had 340M. Here are two charts around that time:
Unclear whether there was a discontinuity roughly at the time of Nvidia’s Megatron, particularly on the logarithmic scale. GPT-3 was 10x the size of Microsoft’s last model, but came 4 months afterwards, which seems like it might maybe break that exponential.
??? Wasn’t GPT-3 the result of people at OpenAI saying “huh, looks like language models scale according to such-and-such law, let’s see if that continues to hold”, and that law did continue to hold? Seems like an almost central example of continuous progress if you’re evaluating by typical language model metrics like perplexity.
I think we should determine whether GPT-3 is an example of continuous progress in perplexity based on the extent to which it lowered the SOTA perplexity (on huge internet-text corpora), and its wall clock training time. I don’t see why the correctness of a certain scaling law or the researchers’ beliefs/motivation should affect this determination.
I agree I haven’t filled in all the details to argue for continuous progress (mostly because I don’t know the exact numbers), but when you get better results by investing more resources to push forward on a predicted scaling law, if there is a discontinuity it comes from a discontinuity in resource investment, which feels quite different from a technological discontinuity (e.g. we can model it and see a discontinuity is unlikely). This was the case with AlphaGo for example.
Separately, I also predict GPT-3 was not an example of discontinuity on perplexity, because it did not constitute a discontinuity in resource investment. (There may have been a discontinuity from resource investment in language models earlier in 2018-19, though I would guess even that wasn’t the case.)
Several of the discontinuities in the AI Impacts investigation were the result of discontinuities in resource investment, IIRC.
I think Ajeya’s report mostly assumes, rather than argues, that there won’t be a discontinuity of resource investment. Maybe I’m forgetting something but I don’t remember her analyzing the different major actors to see if any of them has shown signs of secretly running a Manhattan project or being open to doing so in the future.
Also, discontinuous progress is systematically easier than both of you in this conversation make it sound: The process is not “Choose a particular advancement (GPT-3), identify the unique task or dimension which it is making progress on, and then see whether or not it was a discontinuity on the historical trend for that task/dimension.” There is no one task or dimension that matters; rather, any “strategically significant” dimension matters. Maybe GPT-3 isn’t a discontinuity in perplexity, but is still a discontinuity in reasoning ability or common-sense understanding or wordsmithing or code-writing.
(To be clear, I agree with you that GPT-3 probably isn’t a discontinuity in any strategically significant dimension, for exactly the reasons you give: GPT-3 seems to be just continuing a trend set by the earlier GPTs, including the resource-investment trend.)
I was disagreeing with this statement in the OP:
I agree that it “could have been” a discontinuity on those other metrics, and my argument doesn’t apply there. I wasn’t claiming it would.
It doesn’t argue for it explicitly, but if you look at the section and the corresponding appendix, it just seems pretty infeasible for there to be a large discontinuity—a Manhattan project in the US that had been going on for the last 5 years and finished tomorrow would cost ~$1T, while current projects cost ~$100M, and 4 orders of magnitude at the pace in AI and Compute would be a discontinuity of slightly under 4 years. This wouldn’t be a large / robust discontinuity according to the AI Impacts methodology, and I think it wouldn’t even pick this up as a “small” discontinuity?
I didn’t claim otherwise? I’m just claiming you should distinguish between them.
If anything this would make me update that discontinuities in AI are less likely, given that I can be relatively confident there won’t be discontinuities in AI investment (at least in the near-ish future).
OK, sure. I think I misread you.
Yes, but they spent more money and created a much larger model than other groups, sooner than I’d otherwise have expected. It also reaches some threshold for “scarily good” for me which makes me surprised.
My impression was that it followed existing trends pretty well, but I haven’t looked into it deeply.
From the paper, charts such as:
suggest that it wasn’t a discontinuity in terms of validation loss, which seems to the inverse of perplexity.Also, from the Wikipedia page:
The year before GPT-2 had 1.5 billion parameters and XLNET had 340M. The year before that, in 2018 BERT had 340M. Here are two charts around that time:
Unclear whether there was a discontinuity roughly at the time of Nvidia’s Megatron, particularly on the logarithmic scale. GPT-3 was 10x the size of Microsoft’s last model, but came 4 months afterwards, which seems like it might maybe break that exponential.