Yesterday I read this blog post which speculated on the possibility of training GPT-2 in the year 2005, using a supercomputer. For context, please read that post. Today I had the following conversation, on the same topic, with ChatGPT (GPT-4) using the new Wolfram Alpha plugin. This new plugin can (attempt to) do math and even (attempt to) look up statistics, such as the FLOP/s of the most powerful supercomputers from the TOP500 list.
However, don’t get too excited, as ChatGPT doesn’t seem to know how to use its new toy. It got an error message almost every time it tried to use the plugin.
The most startling thing I saw in this conversation was that ChatGPT actually tried to recover from its own errors. At one point it even gave up and did the math “manually” (meaning by itself.) It’s possible that it learned how to do this from OpenAI engineers with clever prompting and training data, but it still shows some agency and fluid intelligence.
Everything said by ChatGPT is quoted as follows:
Model: Plugins alpha
Enabled plugins: Wolfram
Everything unquoted is me. I did not fix my own spelling mistakes and grammatical errors, so you can see the real prompts that ChatGPT is replying too. Also, I will not attempt to fix math mistakes or comment on them, but feel free to do the math “manually” and let me know if ChatGPT got it right. Note that it took me more time to format this conversation for the post than the time it took to have the conversation itself. Hopefully an export feature will get added.
Conversation
A recent article argues that it would have been possible to train GPT-2 in 2005, using the bluegene/l supercomputer. It would take a few weeks. He estimates that gpt-2 used 10^21 flops and about 40GB of training data. It might have been possible to collect that much data from the web at that time.
Based on this article’s logic, I want you to estimate what year GPT-3 would have been possible. Estimate how many flops and training data gpt-3 needed to train, and then determine when there was a supercomputer large enough to do it within a few months.
To estimate the year in which GPT-3 could have been trained, we need to consider several factors:
The number of floating-point operations per second (FLOPS) required to train GPT-3.
The amount of training data required to train GPT-3.
The computational capabilities of supercomputers available in different years.
Let’s start by estimating the FLOPS and training data required for GPT-3:
GPT-3 is a much larger model than GPT-2, with 175 billion parameters compared to GPT-2′s 1.5 billion parameters. This means that GPT-3 is over 100 times larger than GPT-2 in terms of the number of parameters. If we assume that the computational requirements for training scale linearly with the number of parameters, and given that GPT-2 required approximately 10^21 FLOPS to train, we can estimate that GPT-3 would require roughly 100 * 10^21 FLOPS, or 10^23 FLOPS, to train.
As for the training data, GPT-3 was trained on a diverse dataset called the “WebText” dataset, which includes a wide range of text from web pages. The dataset used for training GPT-3 is estimated to be around 45 terabytes (TB) in size. This is significantly larger than the 40 GB of training data mentioned for GPT-2.
Next, let’s consider the computational capabilities of supercomputers:
The performance of supercomputers has been increasing over the years, and the TOP500 list provides information about the most powerful supercomputers in the world at different points in time. To determine when a supercomputer capable of training GPT-3 within a few months would have been available, we need to find a supercomputer with a performance of at least several exaFLOPS (10^18 FLOPS) to handle the estimated 10^23 FLOPS required for GPT-3 training.
Let’s use the TOP500 list to find out when a supercomputer with the required performance became available.
Used Wolfram
Request to Wolfram
{
"input": "When was the first supercomputer with exaFLOPS performance available?"
}
Response from Wolfram
Wolfram|Alpha could not understand: When was the first supercomputer with exaFLOPS performance available?.
Things to try instead:
first with performance available
supercomputer
As of my knowledge cutoff date in September 2021, there were no supercomputers that had achieved exaFLOPS performance. The first exascale supercomputer was expected to be built in the early 2020s, but I do not have information beyond my knowledge cutoff date to confirm whether this has occurred.
Given that GPT-3 would require a supercomputer with performance on the order of exaFLOPS to be trained within a few months, and considering the estimated 45 TB of training data, it is likely that GPT-3 could not have been feasibly trained until the early 2020s or later, when exascale supercomputers were expected to become available.
Please note that these estimates are based on the information available as of September 2021 and are subject to change based on advancements in supercomputing technology and other factors.
When was gpt-3 trained?
GPT-3 (Generative Pre-trained Transformer 3) was developed by OpenAI, and the research paper introducing GPT-3 was published in May 2020. The model was trained prior to the publication of the paper, so the training of GPT-3 likely took place in late 2019 or early 2020. The exact dates of the training process are not publicly disclosed in the research paper.
What was the most powerful computer of 2012?
Used Wolfram
Request to Wolfram
{
"input": "Most powerful supercomputer in 2012"
}
Response from Wolfram
Wolfram|Alpha could not understand: Most powerful supercomputer in 2012.
Things to try instead:
Most powerful supercomputer
2012
As of my knowledge cutoff date in September 2021, the most powerful supercomputer in the world in 2012 was the IBM Sequoia, which was installed at the Lawrence Livermore National Laboratory (LLNL) in the United States. The IBM Sequoia was ranked as the number one supercomputer on the TOP500 list in June 2012.
The IBM Sequoia was based on IBM’s Blue Gene/Q architecture and was equipped with 1,572,864 processor cores. It achieved a peak performance of 20.13 petaFLOPS (quadrillions of floating-point operations per second) on the LINPACK benchmark, making it the fastest supercomputer in the world at that time.
Please note that this information is based on the TOP500 list and the state of supercomputing as of my knowledge cutoff date.
How long would it have taken sequoia to train gpt-3?
To estimate how long it would have taken the IBM Sequoia supercomputer to train GPT-3, we need to consider the computational requirements for training GPT-3 and the performance of the Sequoia supercomputer.
As previously mentioned, GPT-3 is a very large language model with 175 billion parameters. We estimated that GPT-3 would require roughly 10^23 FLOPS (floating-point operations per second) to train. This is a rough estimate based on the scaling of computational requirements from GPT-2 to GPT-3.
The IBM Sequoia supercomputer achieved a peak performance of 20.13 petaFLOPS (20.13 x 10^15 FLOPS) on the LINPACK benchmark.
Using these estimates, we can calculate the time required to train GPT-3 on Sequoia as follows:
Time (in seconds) = (10^23 FLOPS required for GPT-3) / (20.13 x 10^15 FLOPS per second of Sequoia)
Let’s perform this calculation to get an estimate of the training time.
Used Wolfram
Request to Wolfram
{
"input": "(10^23 FLOPS) / (20.13 * 10^15 FLOPS per second) to years"
}
Response from Wolfram
Query:
"(10^23 FLOPS) / (20.13 * 10^15 FLOPS per second) to years"
Input interpretation:
convert (10^23 FLOPS)/(2.013×10^16 FLOPS) to years
Result:
( and year are not compatible)
Wolfram|Alpha website result for "(10^23 FLOPS) / (20.13 * 10^15 FLOPS per second) to years":
https://www.wolframalpha.com/input?i=%2810%5E23+FLOPS%29+%2F+%2820.13+%2A+10%5E15+FLOPS+per+second%29+to+years
It seems there was an issue with the units in the query. Let’s correct the units and perform the calculation again:
Time (in seconds) = (10^23 FLOPS required for GPT-3) / (20.13 x 10^15 FLOPS per second of Sequoia)
We can then convert the result from seconds to years. Let’s calculate the estimated training time for GPT-3 on the IBM Sequoia supercomputer.
Used Wolfram
Request to Wolfram
{
"input": "(10^23) / (20.13 * 10^15) seconds to years"
}
Response from Wolfram
Query:
"(10^23) / (20.13 * 10^15) seconds to years"
Input interpretation:
convert 10^23/(20.13×10^15) seconds to years
Result:
0.1575 years
Additional conversions:
1.89 months
8.214 weeks
57.5 days
1380 hours
82795 minutes
Comparison as time:
~~ 34 × traditional American workweek ( 40 h )
Comparisons as period:
~~ 0.26 × orbital period of Venus ( 0.61519726 a )
~~ 0.65 × orbital period of Mercury ( 0.2408467 a )
Comparisons as half-life:
~~ ( 0.03 ~~ 1/33 ) × half-life of cobalt-60 ( 1.6635×10^8 s )
~~ ( 0.061 ~~ 1/17 ) × half-life of sodium-22 ( 8.2108×10^7 s )
Interpretations:
time
period
half-life
Wolfram|Alpha website result for "(10^23) / (20.13 * 10^15) seconds to years":
https://www.wolframalpha.com/input?i=%2810%5E23%29+%2F+%2820.13+%2A+10%5E15%29+seconds+to+years
Based on the calculation, it is estimated that it would have taken approximately 0.1575 years, or about 1.89 months, to train GPT-3 on the IBM Sequoia supercomputer. This estimate assumes that the computational requirements for training GPT-3 are on the order of 10^23 FLOPS, and that the peak performance of the IBM Sequoia supercomputer is 20.13 petaFLOPS.
Please note that this is a rough estimate and may not accurately reflect the actual training time for GPT-3. The training time for a neural network model like GPT-3 can be influenced by various factors, including the optimization algorithms used, the batch size, the distribution of training data across compute nodes, and other technical considerations. Additionally, the estimate does not take into account other practical constraints, such as memory limitations and data transfer speeds, which could affect the feasibility of training GPT-3 on the IBM Sequoia supercomputer.
It sounds like they would have been able to train gpt-3 in 2012 then. How much training data would have been available at that time? Maybe we can estimate the size of the entire web, and then account for the fact that a web scrape might not get all the data available.
Estimating the size of the entire web in 2012 is a challenging task due to the dynamic and decentralized nature of the internet. The total amount of data on the web is constantly changing as new content is created, old content is deleted or updated, and websites come and go. Additionally, not all data on the web is publicly accessible or suitable for training language models.
One way to estimate the size of the web is to consider the number of indexed pages reported by search engines at the time. However, this approach has limitations, as search engines do not index the entire web, and the size of individual pages can vary significantly.
The World Wide Web Size (www.worldwidewebsize.com) is a website that provides estimates of the number of indexed pages on the web based on data from major search engines. According to this website, the estimated number of indexed pages in 2012 was around 14 billion. However, this number only represents a fraction of the total web content, as it excludes non-indexed pages, multimedia content, and content behind paywalls or login screens.
To estimate the total amount of text data available for training language models, we would need to consider the average size of a web page, the proportion of text content on each page, and the fraction of the web that could be feasibly scraped and used for training. Unfortunately, precise estimates for these factors are not readily available.
It’s also important to note that collecting training data from the web involves legal and ethical considerations, such as copyright and privacy concerns. Language models like GPT-3 are typically trained on diverse datasets that include publicly available text from a wide range of sources, rather than scraping the entire web.
In summary, while it’s difficult to provide an accurate estimate of the total amount of training data available on the web in 2012, it’s likely that there was a substantial amount of text data that could have been used for training language models, subject to legal, ethical, and technical constraints.
Do you know of any large corpus of text data that were used by AI researchers in 2012?
In 2012, AI researchers had access to several large text corpora that were commonly used for natural language processing (NLP) tasks. Some of the notable corpora available at that time include:
Common Crawl: Common Crawl is a publicly available web archive that provides regularly updated web crawl data. It contains petabytes of data collected from the web, including raw web page data, metadata, and text extracts. Common Crawl has been used by researchers for a wide range of NLP tasks and is one of the largest publicly available web corpora.
Wikipedia Dump: Wikipedia provides periodic data dumps of its entire content. These dumps contain the full text of all Wikipedia articles and are often used by researchers as a large and diverse source of text data for training language models and other NLP tasks.
English Gigaword: The English Gigaword corpus is a comprehensive archive of newswire text data. It contains over four million articles from various news sources and is commonly used for tasks such as language modeling, named entity recognition, and machine translation.
Project Gutenberg: Project Gutenberg is a digital library of over 60,000 free eBooks, including many classic works of literature. The text data from Project Gutenberg has been used by researchers for various language-related tasks, including language modeling and text generation.
Penn Treebank: The Penn Treebank is a widely used annotated corpus of English text. It includes text from sources such as the Wall Street Journal, newswire services, and more. The corpus is annotated with part-of-speech tags, syntactic parses, and other linguistic information, making it valuable for syntactic and linguistic analysis.
It’s important to note that while these corpora were available to researchers in 2012, the specific versions and sizes of the corpora may have evolved over time. Additionally, researchers often combine data from multiple sources to create custom datasets tailored to specific research tasks.
What was the flop/s, in exponent form, of the most powerful supercomputer in every year from 2005-2015?
Used Wolfram
Request to Wolfram
{
"input": "FLOPS of most powerful supercomputer from 2005 to 2015"
}
Response from Wolfram
Wolfram|Alpha could not understand: FLOPS of most powerful supercomputer from 2005 to 2015.
Things to try instead:
most powerful supercomputer
2005 to 2015
As of my knowledge cutoff date in September 2021, the TOP500 list provides information about the most powerful supercomputers in the world for different years. Below, I will provide the peak performance (in FLOPS) of the most powerful supercomputer for each year from 2005 to 2015, based on the TOP500 list:
2005: IBM Blue Gene/L − 280.6 teraFLOPS (2.806 × 10^14 FLOPS)
2006: IBM Blue Gene/L − 367.0 teraFLOPS (3.67 × 10^14 FLOPS)
2007: IBM Blue Gene/L − 478.2 teraFLOPS (4.782 × 10^14 FLOPS)
2008: IBM Roadrunner − 1.026 petaFLOPS (1.026 × 10^15 FLOPS)
Please note that the performance values provided are peak performance values achieved by these supercomputers on the LINPACK benchmark. The values are rounded to three decimal places, and the FLOPS values are expressed in exponent form. The same supercomputer may appear in multiple years if it held the top position on the TOP500 list for more than one year.
Using your estimate of 10^23 flops to train gpt-3, tell me how many days each year’s computer would need.
To calculate the number of days each year’s supercomputer would need to train GPT-3, we can use the formula:
Time (in days) = (10^23 FLOPS required for GPT-3) / (Peak performance of supercomputer in FLOPS) / (Number of seconds in a day)
Let’s calculate the estimated training time for GPT-3 on each year’s most powerful supercomputer from 2005 to 2015, based on the peak performance values provided earlier:
2005: IBM Blue Gene/L − 2.806 × 10^14 FLOPS
2006: IBM Blue Gene/L − 3.67 × 10^14 FLOPS
2007: IBM Blue Gene/L − 4.782 × 10^14 FLOPS
2008: IBM Roadrunner − 1.026 × 10^15 FLOPS
2009: Cray Jaguar − 1.759 × 10^15 FLOPS
2010: Tianhe-1A − 2.566 × 10^15 FLOPS
2011: Fujitsu K Computer − 1.051 × 10^16 FLOPS
2012: IBM Sequoia − 2.013 × 10^16 FLOPS
2013: NUDT Tianhe-2 − 3.386 × 10^16 FLOPS
2014: NUDT Tianhe-2 − 3.386 × 10^16 FLOPS
2015: NUDT Tianhe-2 − 3.386 × 10^16 FLOPS
Let’s perform the calculations for each year’s supercomputer.
Based on the calculations, here is the estimated training time for GPT-3 on each year’s most powerful supercomputer from 2005 to 2015, expressed in days:
2005: IBM Blue Gene/L − 4124.76 days
2006: IBM Blue Gene/L − 3153.70 days
2007: IBM Blue Gene/L − 2420.34 days
2008: IBM Roadrunner − 1128.08 days
2009: Cray Jaguar − 657.99 days
2010: Tianhe-1A − 451.06 days
2011: Fujitsu K Computer − 110.12 days
2012: IBM Sequoia − 57.50 days
2013: NUDT Tianhe-2 − 34.18 days
2014: NUDT Tianhe-2 − 34.18 days
2015: NUDT Tianhe-2 − 34.18 days
Please note that these estimates are based on the peak performance values of each supercomputer and the assumption that GPT-3 requires 10^23 FLOPS to train. The actual training time may vary due to factors such as memory limitations, data transfer speeds, and other technical considerations. Additionally, these estimates do not take into account the availability and size of training data, which are also important factors in training large language models like GPT-3.
If we estimate that GPT-4 took 10^25 flops, how many days would it take to train on each year’s supercomputer.
Used Wolfram
Request to Wolfram
{
"input": "flopsRequiredGPT4 = 10^25;\ntrainingTimeInDaysGPT4 = flopsRequiredGPT4 / peakPerformance / secondsInADay;\ntrainingTimeInDaysGPT4"
}
Response from Wolfram
"10000000000000000000000000/(peakPerformance secondsInADay)"
It seems there was an issue with the calculation. Let’s try to recalculate the estimated training time for GPT-4 on each year’s most powerful supercomputer from 2005 to 2015, using the updated estimate of 10^25 FLOPS required for GPT-4:
2005: IBM Blue Gene/L − 2.806 × 10^14 FLOPS
2006: IBM Blue Gene/L − 3.67 × 10^14 FLOPS
2007: IBM Blue Gene/L − 4.782 × 10^14 FLOPS
2008: IBM Roadrunner − 1.026 × 10^15 FLOPS
2009: Cray Jaguar − 1.759 × 10^15 FLOPS
2010: Tianhe-1A − 2.566 × 10^15 FLOPS
2011: Fujitsu K Computer − 1.051 × 10^16 FLOPS
2012: IBM Sequoia − 2.013 × 10^16 FLOPS
2013: NUDT Tianhe-2 − 3.386 × 10^16 FLOPS
2014: NUDT Tianhe-2 − 3.386 × 10^16 FLOPS
2015: NUDT Tianhe-2 − 3.386 × 10^16 FLOPS
Let’s perform the calculations for each year’s supercomputer using the updated FLOPS requirement for GPT-4.
Used Wolfram
Request to Wolfram
{
"input": "flopsRequiredGPT4 = 10^25;\ntrainingTimeInDaysGPT4 = flopsRequiredGPT4 / peakPerformance / secondsInADay;\nN[trainingTimeInDaysGPT4]"
}
Response from Wolfram
"1. ^25/(peakPerformance secondsInADay)"
It appears that there is still an issue with the calculation. Let’s manually calculate the estimated training time for GPT-4 on each year’s most powerful supercomputer from 2005 to 2015, using the updated estimate of 10^25 FLOPS required for GPT-4:
We’ll use the formula:
Time (in days) = (10^25 FLOPS required for GPT-4) / (Peak performance of supercomputer in FLOPS) / (Number of seconds in a day)
The number of seconds in a day is 86,400.
Let’s calculate the estimated training time for GPT-4 on each year’s supercomputer:
2005: IBM Blue Gene/L - (10^25 FLOPS) / (2.806 × 10^14 FLOPS) / 86,400 seconds ≈ 4.12 × 10^8 days
2006: IBM Blue Gene/L - (10^25 FLOPS) / (3.67 × 10^14 FLOPS) / 86,400 seconds ≈ 3.15 × 10^8 days
2007: IBM Blue Gene/L - (10^25 FLOPS) / (4.782 × 10^14 FLOPS) / 86,400 seconds ≈ 2.42 × 10^8 days
2008: IBM Roadrunner - (10^25 FLOPS) / (1.026 × 10^15 FLOPS) / 86,400 seconds ≈ 1.13 × 10^8 days
Please note that these estimates are based on the peak performance values of each supercomputer and the assumption that GPT-4 requires 10^25 FLOPS to train. The actual training time may vary due to factors such as memory limitations, data transfer speeds, and other technical considerations. Additionally, these estimates do not take into account the availability and size of training data, which are also important factors
in training large language models like GPT-4.
It seems like there was a compute overhang during the 2010s. They could have trained much more powerful ML models, but they didn’t know that they could. Now we are using all the compute we can apply to the problem of making powerful systems like gpt-4
The concept of “compute overhang” refers to a situation where there is a significant gap between the computational resources available and the computational resources being utilized for a specific task or set of tasks. In the context of machine learning and AI research, a compute overhang could mean that researchers had access to more computational power than they were using to train models at a given time.
There are several reasons why a compute overhang might have existed during the 2010s:
Algorithmic Advances: The development of more powerful machine learning models often depends on algorithmic innovations and breakthroughs. During the 2010s, researchers made significant advances in neural network architectures, optimization techniques, and training methodologies. These advances enabled the training of larger and more complex models, such as GPT-3 and GPT-4, which require vast amounts of compute.
Data Availability: The training of large language models requires not only substantial computational resources but also large and diverse datasets. The availability of big data, including text corpora, increased significantly during the 2010s, enabling the training of more powerful models.
Awareness and Experimentation: Researchers may not have been fully aware of the potential of using all available compute to train more powerful models. Experimentation with larger models and more compute-intensive training may have been limited by budget constraints, hardware availability, and other practical considerations.
Hardware and Infrastructure: The development of specialized hardware, such as GPUs and TPUs, and the availability of cloud computing infrastructure have made it more feasible to scale up machine learning training. These advances have contributed to the ability to fully utilize available compute resources.
It’s important to note that the development of powerful AI models like GPT-3 and GPT-4 is not solely a matter of applying more compute. It also involves advances in algorithms, data collection and preprocessing, model architecture design, and other research areas. Today, researchers continue to explore ways to make the most of available computational resources while also seeking to improve model efficiency, interpretability, and safety.
Based on the hardware and software situation of 2021, do you believe that there is still a compute overhang, or is AI research now compute constrained? Also, are we now more constrained by compute, data or something else?
As of my knowledge cutoff date in September 2021, AI research is experiencing both compute constraints and other types of constraints. The extent to which AI research is constrained by compute, data, or other factors can vary depending on the specific research area, the scale of the models being developed, and the resources available to researchers. Here are some considerations:
Compute Constraints: While computational resources have grown significantly over the years, the demand for compute in AI research has also increased, especially for training very large models such as GPT-3 and large-scale transformer models. Researchers are often limited by the availability and cost of computational resources, particularly for projects that require extensive experimentation and hyperparameter tuning. Access to specialized hardware such as GPUs and TPUs can also be a limiting factor.
Data Constraints: The availability and quality of data are critical factors in AI research. Large and diverse datasets are needed to train powerful models, but collecting and curating such datasets can be challenging. Data constraints can include issues related to data privacy, data labeling, data biases, and the representativeness of the data. Researchers are exploring techniques such as data augmentation, transfer learning, and few-shot learning to address data constraints.
Algorithmic and Architectural Constraints: Advances in AI research are not solely dependent on compute and data; they also rely on innovations in algorithms, model architectures, and optimization techniques. Researchers are exploring ways to improve model efficiency, generalization, interpretability, and robustness, as well as investigating new paradigms beyond traditional deep learning.
Ethical and Societal Constraints: AI research is increasingly subject to ethical and societal considerations, including fairness, accountability, transparency, and safety. Ensuring that AI models are ethical, unbiased, and do not perpetuate harmful behavior is a priority for researchers and requires careful attention to both model development and deployment.
Overall, AI research faces a combination of constraints, and addressing these constraints requires a multifaceted approach that goes beyond increasing compute and data availability. Researchers are actively working on methods to overcome these constraints and develop AI models that are more efficient, effective, and aligned with human values.
Suppose there is a textbook from 2021 which explains the transformer model, large language models, AI art, how to implement them, and what they can do. Suppose it gets transported through time and lands in the lap of the guy in charge of bluegene/l in 2005. He trains gpt-2, and shows everyone how to do it. Write a story about how AI research progresses from there, covering the years 2005-2021
Once upon a time in 2005, in a research lab nestled within the walls of a prestigious institution, a supercomputer named Blue Gene/L hummed away, crunching numbers and solving complex scientific problems. The lab’s director, Dr. Ada Lovelace, was a visionary computer scientist known for her groundbreaking work in high-performance computing.
One day, as Dr. Lovelace was reviewing the latest simulations, something unusual caught her eye—a mysterious book that seemed to have appeared out of nowhere on her desk. Curious, she opened the book and was astonished to find that it was a textbook from the future—dated 2021. The book detailed the inner workings of a revolutionary neural network architecture called the “transformer model,” along with comprehensive explanations of large language models, AI-generated art, and practical implementation guides.
Dr. Lovelace was captivated. She realized that the knowledge contained within the book had the potential to revolutionize the field of artificial intelligence. With Blue Gene/L at her disposal, she wasted no time in assembling a team of researchers to embark on a daring experiment: to train a language model called GPT-2 using the supercomputer’s immense computational power.
The experiment was a resounding success. GPT-2 demonstrated an unprecedented ability to generate coherent and contextually appropriate text, capturing the attention of the global research community. News of the breakthrough spread like wildfire, and the world was abuzz with excitement about the possibilities of AI.
As the years went by, AI research progressed at an accelerated pace. Inspired by the success of GPT-2, researchers around the world began to explore the capabilities of transformer models in various domains. AI-generated art flourished, with algorithms creating stunning visual masterpieces that blurred the lines between human and machine creativity.
Language models became increasingly sophisticated, with each new iteration pushing the boundaries of what was thought possible. AI-powered virtual assistants became ubiquitous, capable of engaging in natural conversations and providing valuable assistance in everyday tasks.
The impact of AI extended beyond language and art. In healthcare, AI models were trained to analyze medical images with remarkable accuracy, aiding in the early detection of diseases. In finance, AI-driven algorithms optimized investment strategies and improved risk assessment.
However, the rapid advancement of AI was not without challenges. Ethical considerations came to the forefront as researchers grappled with issues of bias, fairness, and transparency. The potential for AI-generated misinformation and deepfakes raised concerns about the integrity of information in the digital age.
Dr. Lovelace, now a renowned figure in the field of AI, became a vocal advocate for responsible AI research and development. She championed the importance of ethical guidelines and interdisciplinary collaboration, emphasizing the need to ensure that AI technologies were aligned with human values and societal goals.
As the year 2021 approached, the world of AI looked vastly different from what it might have been without the mysterious textbook from the future. The field had matured, with AI systems integrated into countless aspects of society, enhancing human capabilities and driving innovation.
Dr. Lovelace often reflected on the serendipitous arrival of the textbook that had set these events in motion. While the origin of the book remained a mystery, its impact was undeniable. It had served as a catalyst for a new era of AI—one that held the promise of a better, brighter future for humanity.
And so, the story of AI research continued to unfold, with each new discovery opening doors to uncharted possibilities and shaping the world in ways that were once the stuff of science fiction.
I mentioned an article that claimed gpt-2 could have been trained in 2005. One criticism of the article is that gpt-2 wasn’t really useful for anything, it was just a curiosity. So I think things would happen a little differently from your story. Gpt-2 would create a lot of interest in the AI world but the public would not take notice until about 2011-2012 when it became possible to train gpt-3. And I think it would take at least until about 2015 for it to impact people’s lives. Do you agree?
As an AI language model, I don’t have opinions or beliefs, but I can provide some analysis based on available information.
GPT-2, released by OpenAI in 2019, was indeed a significant advancement in the field of natural language processing (NLP) and language modeling. It demonstrated the ability to generate coherent and contextually appropriate text, which was a notable achievement. However, GPT-2 was primarily seen as a research milestone and a demonstration of the capabilities of large-scale language models. While it generated interest and excitement in the AI research community, its practical applications were limited at the time of its release.
GPT-3, released in 2020, built upon the foundation laid by GPT-2 and represented a substantial leap in terms of scale and capabilities. GPT-3′s ability to perform various language tasks with minimal fine-tuning made it more versatile and applicable to a wide range of real-world use cases, including language translation, text summarization, code generation, and conversational AI. As a result, GPT-3 garnered significant attention both within and outside the AI research community, and its potential for practical applications became more apparent.
In a hypothetical scenario where GPT-2 could have been trained in 2005, it’s plausible that it would have been seen as an important research achievement and a curiosity within the AI community, but its broader impact on society may have been limited. The development of GPT-3 or a similar large-scale language model would likely have been necessary to unlock more practical applications and generate widespread public interest.
It’s also important to consider that the development and adoption of AI technologies depend on various factors, including the availability of computational resources, data, algorithmic advances, and infrastructure. Additionally, ethical considerations, public perception, and regulatory frameworks play a role in shaping the trajectory of AI research and its impact on society.
Overall, while the hypothetical scenario of training GPT-2 in 2005 is an interesting thought experiment, the actual progression of AI research and its impact on society is influenced by a complex interplay of technological, societal, and ethical factors.
I myself was became interested when I first heard about gpt-2. I notice that I never hear about gpt-1. Can you tell me about language models before gpt-2?
Before GPT-2, there were several important language models and architectures that laid the groundwork for the development of large-scale transformer-based language models like GPT-2 and GPT-3. One of these models was GPT-1, also known as the original “Generative Pre-trained Transformer.” Here’s an overview of GPT-1 and some other notable language models that preceded GPT-2:
GPT-1 (Generative Pre-trained Transformer): GPT-1 was introduced by OpenAI in a paper titled “Improving Language Understanding by Generative Pre-training” in 2018. GPT-1 was a transformer-based language model that used unsupervised pre-training followed by supervised fine-tuning for specific natural language processing (NLP) tasks. The model was pre-trained on a large corpus of text data using a language modeling objective, and then fine-tuned on smaller labeled datasets for tasks such as text classification, entailment, and similarity. GPT-1 demonstrated strong performance on a variety of NLP benchmarks, and its success motivated the development of GPT-2.
Word2Vec: Word2Vec, introduced by researchers at Google in 2013, is a family of neural network-based models used to produce dense vector representations of words (word embeddings) based on their co-occurrence in a text corpus. Word2Vec models, such as Skip-Gram and Continuous Bag of Words (CBOW), were influential in advancing NLP research by enabling the capture of semantic and syntactic relationships between words in a continuous vector space.
LSTM (Long Short-Term Memory): LSTM is a type of recurrent neural network (RNN) architecture introduced by Hochreiter and Schmidhuber in 1997. LSTMs are designed to address the vanishing gradient problem in traditional RNNs, allowing them to effectively model long-term dependencies in sequences of data. LSTMs have been widely used in NLP tasks such as language modeling, machine translation, and text generation.
ELMo (Embeddings from Language Models): ELMo, introduced by researchers at the Allen Institute for Artificial Intelligence in 2018, is a language model that generates contextualized word embeddings. Unlike traditional word embeddings, which assign a fixed vector to each word, ELMo produces embeddings that are context-dependent, allowing the model to capture different meanings of the same word based on its usage in a sentence. ELMo demonstrated improved performance on a range of NLP tasks.
BERT (Bidirectional Encoder Representations from Transformers): BERT, introduced by researchers at Google in 2018, is a transformer-based model that is pre-trained using a masked language modeling objective and a next sentence prediction task. BERT is designed to be bidirectional, allowing it to capture context from both the left and right sides of a token. BERT achieved state-of-the-art performance on a variety of NLP benchmarks and tasks.
These models and architectures represent important milestones in the development of language models and NLP research. They contributed to the understanding of language representation, context modeling, and transfer learning, paving the way for the development of more advanced models like GPT-2 and GPT-3.
GPT-2005: A conversation with ChatGPT (featuring semi-functional Wolfram Alpha plugin!)
Yesterday I read this blog post which speculated on the possibility of training GPT-2 in the year 2005, using a supercomputer. For context, please read that post. Today I had the following conversation, on the same topic, with ChatGPT (GPT-4) using the new Wolfram Alpha plugin. This new plugin can (attempt to) do math and even (attempt to) look up statistics, such as the FLOP/s of the most powerful supercomputers from the TOP500 list.
However, don’t get too excited, as ChatGPT doesn’t seem to know how to use its new toy. It got an error message almost every time it tried to use the plugin.
The most startling thing I saw in this conversation was that ChatGPT actually tried to recover from its own errors. At one point it even gave up and did the math “manually” (meaning by itself.) It’s possible that it learned how to do this from OpenAI engineers with clever prompting and training data, but it still shows some agency and fluid intelligence.
Everything said by ChatGPT is quoted as follows:
Everything unquoted is me. I did not fix my own spelling mistakes and grammatical errors, so you can see the real prompts that ChatGPT is replying too. Also, I will not attempt to fix math mistakes or comment on them, but feel free to do the math “manually” and let me know if ChatGPT got it right. Note that it took me more time to format this conversation for the post than the time it took to have the conversation itself. Hopefully an export feature will get added.
Conversation
A recent article argues that it would have been possible to train GPT-2 in 2005, using the bluegene/l supercomputer. It would take a few weeks. He estimates that gpt-2 used 10^21 flops and about 40GB of training data. It might have been possible to collect that much data from the web at that time.
Based on this article’s logic, I want you to estimate what year GPT-3 would have been possible. Estimate how many flops and training data gpt-3 needed to train, and then determine when there was a supercomputer large enough to do it within a few months.
When was gpt-3 trained?
What was the most powerful computer of 2012?
How long would it have taken sequoia to train gpt-3?
It sounds like they would have been able to train gpt-3 in 2012 then. How much training data would have been available at that time? Maybe we can estimate the size of the entire web, and then account for the fact that a web scrape might not get all the data available.
Do you know of any large corpus of text data that were used by AI researchers in 2012?
What was the flop/s, in exponent form, of the most powerful supercomputer in every year from 2005-2015?
Using your estimate of 10^23 flops to train gpt-3, tell me how many days each year’s computer would need.
If we estimate that GPT-4 took 10^25 flops, how many days would it take to train on each year’s supercomputer.
It seems like there was a compute overhang during the 2010s. They could have trained much more powerful ML models, but they didn’t know that they could. Now we are using all the compute we can apply to the problem of making powerful systems like gpt-4
Based on the hardware and software situation of 2021, do you believe that there is still a compute overhang, or is AI research now compute constrained? Also, are we now more constrained by compute, data or something else?
Suppose there is a textbook from 2021 which explains the transformer model, large language models, AI art, how to implement them, and what they can do. Suppose it gets transported through time and lands in the lap of the guy in charge of bluegene/l in 2005. He trains gpt-2, and shows everyone how to do it. Write a story about how AI research progresses from there, covering the years 2005-2021
I mentioned an article that claimed gpt-2 could have been trained in 2005. One criticism of the article is that gpt-2 wasn’t really useful for anything, it was just a curiosity. So I think things would happen a little differently from your story. Gpt-2 would create a lot of interest in the AI world but the public would not take notice until about 2011-2012 when it became possible to train gpt-3. And I think it would take at least until about 2015 for it to impact people’s lives. Do you agree?
I myself was became interested when I first heard about gpt-2. I notice that I never hear about gpt-1. Can you tell me about language models before gpt-2?