The launch of GPT-4 got me thinking about how humans create valuable things that are too complex to understand. GPT-4 is an example of a broader class of valuable-but-incomprehensible things that includes the world economy, the source code for large products like Google Search, and life itself.
OpenAI created GPT-4 by having a vast number of “stupid” GPU cores iterate very fast, making a series of tiny improvements to an incredibly complex model of human knowledge, until eventually the model turned into something valuable. Matt Ridley has argued that a similar iterative evolutionary process is responsible for the creation of pretty much everything that matters.
But this kind of productive evolution doesn’t just happen automatically by itself. You can’t just pour out a pile of DNA, or people, or companies, or training data, and expect something good to magically evolve by itself. For productive evolution to happen, you need to have the right structures to direct it. Those structures ensure that iteration is fast, that each iterative step has low risk of causing harm, and that iteration progresses in a good direction.
In the rest of this post I’ll look at what this means in practice, drawing on examples from machine learning, metric-driven development, online experimentation, programming tools, large scale social behavior, and online speech.
Given the amazing things GPT can do, you might think that it was created through “Intelligent Design” — some omniscient god-like programmers at OpenAI thought slowly and deeply about the way all human knowledge works and came up with an intricate “grand theory” that allowed everything to be understood. Back in the days of “Expert Systems” that is how most people thought Artificial Intelligence would be achieved, and indeed people like Noam Chomski still feel that this is the real way to create AI.
In fact the part of GPT written by humans is simple enough that you can implement it in only 60 lines of code. That code describes a “representation” (the Transformer) that can model human knowledge in terms of billions of parameters whose values we don’t know. The code also provides a “cost function” that measures the accuracy of a particular model by seeing how well it predicts the next word in a passage of human written text.
While this code is very simple, it is enough to guide a rapid iterative process that creates something amazing. We simply initialize the billions of parameters to random numbers and direct hundreds of thousands of GPU cores to repeatedly test how well the model predicts the next word for a passage of text and to make tiny changes to the model parameters so that it predicts the next word better. Given enough processor time, this simple iterative process eventually leads to a model that is remarkably accurate.
We can see a similar pattern in the structure companies like Meta/Facebook (where I worked as a Product Manager) use to direct the action of their employees.
Like GPT’s parameterized model, Facebook has a “representation” of its product in the form of its source code. Like GPT, that code is too huge and messy for any engineer to truly understand how it works. Facebook also has a “cost function” in terms of a simple, easily measured engagement metric that is used to guide iteration.
Facebook the company created it’s product by starting with something very simple and then having tens of thousands of employees make rapid iterative changes to the code. Changes that improve the engagement metric are kept, and so the product improves, as judged by that metric.
A key part of Facebook culture is that it’s more important to iterate rapidly than to always do the right thing. Facebook thus places great importance on having very simple engagement-based metrics that can be used to judge whether a launch is good based on data from a short experiment, rather than requiring wise people to think deeply about whether a change is actually good.
The problem with requiring wise people to think deeply about whether a change is good is that thinking deeply is slow. When I was at Facebook, it was common for engineers to have suspicions that the changes they were shipping were actually making the product worse, but the cultural norm was to ship such changes anyway. The assumption is that it’s worth the cost of shipping some changes that make the product worse if it allows the company to iterate faster.
That doesn’t mean that concerns about bad launches were completely ignored. However the expected thing to do if you have concerns about bad launches isn’t to hold up your launch (except in extreme circumstances), but to pass the concern onto the data science team, who use those concerns to craft updated new metrics that allow teams to continue to iterate fast, but with a smaller number of launches making the product worse. Indeed it was a process like this that caused the transition from “time spent” as the core metric to “meaningful social interactions”.
There are of course disadvantages to requiring that your metric be something simple and easily measured. “Not everything that counts can be counted”, and it’s hard to measure things like “is this morally right, or improving people’s lives” using this kind of mechanism. However there are clear benefits in finding ways to judge potential changes as quickly as possible.
Being stupid really fast often beats being thoughtful really slowly.
A/B tests are a core component of the iterative process for most online products. The team will make a small product change that might make the product better, expose that modified product to a small fraction (e.g. 1%) of users, and see whether the product improved, as judged by a simple metric (e.g. time spent).
Such experiments can be controversial, because people don’t always like the idea that they are being “experimented on”. For example, there was a lot of media anger when Facebook revealed that they had run an experiment to test whether boosting the ranking of positive or negative posts would make people happier.
What’s often not appreciated is that the vast majority of experiments make the product worse. As a rough ball-park figure, probably 10% of experiments totally break the product in a way that makes it unusable, 40% of them make the product significantly worse in some easily measurable way, 40% of them make the product worse in some minor or inconclusive way, and 10% of them make the product better.
Those numbers sound terrible until you realize that the 10% of experiments that totally broke the product got taken down within minutes by automated systems, the 40% that made the product significantly worse got shut down within a day when analysts saw the numbers trending bad, the 40% that made the product slightly worse only affected 1% of users for a week, and the 10% that made the product better were launched and benefitted all users forever.
We could have improved the quality of our experiments by thinking more deeply, but thinking deeply is slow, which means that we would have launched less experiments, and the benefits of rapid iteration usually exceed the benefits of making less mistakes.
An important property of online experiments is that each experiment is limited in how much harm it can cause if it makes the product worse. An experiment that makes Google Search results worse has harm limited to the small number of users the experiment was enabled for, and the small number of their queries it was applied to. If a bad experiment had the potential to cause permanent harm to its users then rapid experimentation is much more challenging.
We can see a similar pattern in the “blameless postmortems” that are used in many tech companies.
If someone does something that causes something bad to happen (e.g. takes the site down or launches a very bad product change), a “postmortem” doc is written that investigates how the bad thing happened. The core outcome of a postmortem is usually a change to the company’s process that makes it less likely that such bad outcomes happen again—such as a new automated test.
Critically, the postmortem does not name the person who made the mistake, and the person who made the mistake will usually suffer no negative career consequences for having made that mistake. I experienced this first hand when I made a bad change to Google Search’s indexing system that caused indexing to fail.
The most important reason to have blameless postmortems is that they remove the need for employees to be careful. Being careful is slow, so careful employees will iterate more slowly and your product will improve less. Anything a company can do to remove the need for employees to be careful will increase productivity.
A lot of software development tools are about removing the need for software engineers to be careful, and allowing them to iterate as quickly as possible.
When I use my editor to make a change to my product’s source code, the syntax highlighter will immediately highlight anything with obvious errors in red. (latency—milliseconds)
If I use a typed language (e.g. Typescript or Swift) I’ll have the type-checker running in the background every time I modify a file, and my editor will let me know about problems with my code that are simple enough for the type checker to catch. (latency—seconds)
Developers are encouraged to write “unit tests” for their code that test whether it is working well. A good unit test should complete within a few seconds, and should detect the majority of ways in which someone might accidentally break the product. Developers typically have their system configured to automatically watch what files they modify and run the unit tests for all modules that depend on the changed files. (latency—tens of seconds)
Where possible, engineers will use interpreted languages (e.g. Javascript or Python) to avoid the latency of waiting for their code to compile. If I’m implementing a web app in Javascript, I can make a change to my source code, see the web site immediately refresh, and judge whether my change worked. Even if a compiled language only takes a few seconds to compile, that extra latency slows developer iteration speed enough to significantly slow iteration.
Engineers working on products like Google Search or Facebook Feed Ranking use offline evaluation systems that predict how well their code will work with real users by testing it against past data—avoiding the need to wait days for the results of a live experiment. Does your new feed ranking model accurately predict which search results human raters thought were good? Does your new Feed Ranking model accurately predict which things users clicked on? (latency—minutes)
Finally, if your code passes all those low latency tests, it can be tested using a live experiment, which typically runs for one to two weeks. While the experiment is running, the engineers can occupy their time by rapidly iterating on a different project that isn’t blocked on experiment results.
Hardware companies like Intel (where I worked) can’t afford to wait for their designs to be turned into physical chips so instead they rely very heavily on simulations. This allows them to rapidly iterate, making small changes to the design of the processor, and seeing whether it executes without bugs, and performs well on simulated workloads.
Large machine learning models like GPT can take months to train. If developers had to wait months in order to know whether they were improving the design of their models then it would be impossible for them to make rapid progress, so instead developers spend most of their time working on tiny models that take seconds or minutes to train, and then come up with scaling rules that allow them to predict the behavior of larger models given the behavior of those smaller models.
The list goes on but you get the point.
We can see similar iterative processes, and structures to direct them, all throughout society.
Capitalism allows thousands of companies to rapidly try out many different products and business models to see what works, with “profit” as their guiding metric. Iterative progress stalls when power becomes consolidated in a few companies, and thus it is a primary responsibility of the government to prevent consolidation of power. Well crafted regulations ensure that activities that make companies money are activities that benefit society, and those regulations should ideally be simple enough that they don’t slow down iteration.
Federalism and local government allow different countries, states, town, and sub-cultures to try out different rules in different places, allowing society to rapidly learn what ways of organizing a society work best, while limiting the negative impacts when a bad local law is introduced.
The Hippie Communes of the 1970s were arguably one of the most American things ever. Other countries like Russia and China relied on “Intelligent Design” where wise leaders used deep thought to come up with the right way to run society, imposed it top down on the whole country, and caused the deaths of millions of people. By contrast, the Hippie Communes of the US created thousands of wildly different social experiments, tested them to see what worked, and caused very little harm when they failed.
Academic publishing has been famously dysfunctional for a long time. If progress depends on rapid iteration, the last thing you want is a publication process that takes months to share results with other people, a peer-review process that screens out the most novel ideas in case they are wrong, and a grant-making process that punishes failure. One reasons why Machine Learning is such a successful research disciplines is that, since most researchers would rather get an industry job than an academic job, the field has largely abandoned the traditional academic publication system and shares results rapidly.
I used to be part of a comedy group called “Footlights” at Cambridge. When I was part of the group I’d spend lots of time hanging out with other comedians and we would write new comedy material by making iterative changes to each other’s comedy routines and testing them on each other. Once I left the group I mostly lost the ability to write comedy. It seems likely that social environments for rapid iteration on ideas are key to many fields of human creativity.
Selective breeding has created plants and animals (e.g. dog breeds) that have been hugely valuable to humanity. The DNA of these life-forms is far too complex for people to understand, but we were able to create them using a process of rapid iteration, where we bred lots of random variants, and then used structures like dog shows to identify and share the incremental modifications that worked best.
What does his mean for online discussion products like Facebook and Twitter?
Let’s assume that the purpose of online discussions is to come up with ‘effective’ beliefs about the world that cause people who hold those beliefs to do positive productive things.
If that is the case, then, following the framework established in the rest of this post, we’d want to encourage people to come up with new incremental improvements to existing beliefs as quickly as possible, and we’d want to reward the beliefs that were most beneficial to the groups that believed them.
Current online platforms clearly fail at both of these requirements. The beliefs that spread rapidly are more likely to be beliefs that make people angry rather than beliefs that have been shown to be beneficial. Similarly, the very real potential for harmful beliefs to spread and cause damage has led to the expectation that people who say the wrong things might be subject to “cancellation”, which slows down the speed at which people can iterate to find better beliefs.
So what are we to conclude from all this?
One thing I think we can conclude is that speed often matters more than “deep thought” or avoiding making mistakes. Amazing things can happen if you can empower people to “be stupid really fast” and remove the need to be careful. This requires making it safe to make mistakes, shortening the time required for an iterative step as much as possible, and having a fast way to judge whether a change is good enough to be worth keeping.
However this does not mean that such approaches are without cost. There are often tensions between “moving fast” and “breaking things”. Simple metrics like “time spent” or “company profit” or “language model perplexity” make it easier to unlock rapid iterative progress, but can mask costs like negative social consequences which are hard to measure on short time scales.
Are there simple principles that we should follow to manage such trade-offs. I’m not sure, but it seems important.
The Power of High Speed Stupidity
Link post
The launch of GPT-4 got me thinking about how humans create valuable things that are too complex to understand. GPT-4 is an example of a broader class of valuable-but-incomprehensible things that includes the world economy, the source code for large products like Google Search, and life itself.
OpenAI created GPT-4 by having a vast number of “stupid” GPU cores iterate very fast, making a series of tiny improvements to an incredibly complex model of human knowledge, until eventually the model turned into something valuable. Matt Ridley has argued that a similar iterative evolutionary process is responsible for the creation of pretty much everything that matters.
But this kind of productive evolution doesn’t just happen automatically by itself. You can’t just pour out a pile of DNA, or people, or companies, or training data, and expect something good to magically evolve by itself. For productive evolution to happen, you need to have the right structures to direct it. Those structures ensure that iteration is fast, that each iterative step has low risk of causing harm, and that iteration progresses in a good direction.
In the rest of this post I’ll look at what this means in practice, drawing on examples from machine learning, metric-driven development, online experimentation, programming tools, large scale social behavior, and online speech.
Given the amazing things GPT can do, you might think that it was created through “Intelligent Design” — some omniscient god-like programmers at OpenAI thought slowly and deeply about the way all human knowledge works and came up with an intricate “grand theory” that allowed everything to be understood. Back in the days of “Expert Systems” that is how most people thought Artificial Intelligence would be achieved, and indeed people like Noam Chomski still feel that this is the real way to create AI.
In fact the part of GPT written by humans is simple enough that you can implement it in only 60 lines of code. That code describes a “representation” (the Transformer) that can model human knowledge in terms of billions of parameters whose values we don’t know. The code also provides a “cost function” that measures the accuracy of a particular model by seeing how well it predicts the next word in a passage of human written text.
While this code is very simple, it is enough to guide a rapid iterative process that creates something amazing. We simply initialize the billions of parameters to random numbers and direct hundreds of thousands of GPU cores to repeatedly test how well the model predicts the next word for a passage of text and to make tiny changes to the model parameters so that it predicts the next word better. Given enough processor time, this simple iterative process eventually leads to a model that is remarkably accurate.
We can see a similar pattern in the structure companies like Meta/Facebook (where I worked as a Product Manager) use to direct the action of their employees.
Like GPT’s parameterized model, Facebook has a “representation” of its product in the form of its source code. Like GPT, that code is too huge and messy for any engineer to truly understand how it works. Facebook also has a “cost function” in terms of a simple, easily measured engagement metric that is used to guide iteration.
Facebook the company created it’s product by starting with something very simple and then having tens of thousands of employees make rapid iterative changes to the code. Changes that improve the engagement metric are kept, and so the product improves, as judged by that metric.
A key part of Facebook culture is that it’s more important to iterate rapidly than to always do the right thing. Facebook thus places great importance on having very simple engagement-based metrics that can be used to judge whether a launch is good based on data from a short experiment, rather than requiring wise people to think deeply about whether a change is actually good.
The problem with requiring wise people to think deeply about whether a change is good is that thinking deeply is slow. When I was at Facebook, it was common for engineers to have suspicions that the changes they were shipping were actually making the product worse, but the cultural norm was to ship such changes anyway. The assumption is that it’s worth the cost of shipping some changes that make the product worse if it allows the company to iterate faster.
That doesn’t mean that concerns about bad launches were completely ignored. However the expected thing to do if you have concerns about bad launches isn’t to hold up your launch (except in extreme circumstances), but to pass the concern onto the data science team, who use those concerns to craft updated new metrics that allow teams to continue to iterate fast, but with a smaller number of launches making the product worse. Indeed it was a process like this that caused the transition from “time spent” as the core metric to “meaningful social interactions”.
There are of course disadvantages to requiring that your metric be something simple and easily measured. “Not everything that counts can be counted”, and it’s hard to measure things like “is this morally right, or improving people’s lives” using this kind of mechanism. However there are clear benefits in finding ways to judge potential changes as quickly as possible.
Being stupid really fast often beats being thoughtful really slowly.
A/B tests are a core component of the iterative process for most online products. The team will make a small product change that might make the product better, expose that modified product to a small fraction (e.g. 1%) of users, and see whether the product improved, as judged by a simple metric (e.g. time spent).
Such experiments can be controversial, because people don’t always like the idea that they are being “experimented on”. For example, there was a lot of media anger when Facebook revealed that they had run an experiment to test whether boosting the ranking of positive or negative posts would make people happier.
What’s often not appreciated is that the vast majority of experiments make the product worse. As a rough ball-park figure, probably 10% of experiments totally break the product in a way that makes it unusable, 40% of them make the product significantly worse in some easily measurable way, 40% of them make the product worse in some minor or inconclusive way, and 10% of them make the product better.
Those numbers sound terrible until you realize that the 10% of experiments that totally broke the product got taken down within minutes by automated systems, the 40% that made the product significantly worse got shut down within a day when analysts saw the numbers trending bad, the 40% that made the product slightly worse only affected 1% of users for a week, and the 10% that made the product better were launched and benefitted all users forever.
We could have improved the quality of our experiments by thinking more deeply, but thinking deeply is slow, which means that we would have launched less experiments, and the benefits of rapid iteration usually exceed the benefits of making less mistakes.
An important property of online experiments is that each experiment is limited in how much harm it can cause if it makes the product worse. An experiment that makes Google Search results worse has harm limited to the small number of users the experiment was enabled for, and the small number of their queries it was applied to. If a bad experiment had the potential to cause permanent harm to its users then rapid experimentation is much more challenging.
We can see a similar pattern in the “blameless postmortems” that are used in many tech companies.
If someone does something that causes something bad to happen (e.g. takes the site down or launches a very bad product change), a “postmortem” doc is written that investigates how the bad thing happened. The core outcome of a postmortem is usually a change to the company’s process that makes it less likely that such bad outcomes happen again—such as a new automated test.
Critically, the postmortem does not name the person who made the mistake, and the person who made the mistake will usually suffer no negative career consequences for having made that mistake. I experienced this first hand when I made a bad change to Google Search’s indexing system that caused indexing to fail.
The most important reason to have blameless postmortems is that they remove the need for employees to be careful. Being careful is slow, so careful employees will iterate more slowly and your product will improve less. Anything a company can do to remove the need for employees to be careful will increase productivity.
A lot of software development tools are about removing the need for software engineers to be careful, and allowing them to iterate as quickly as possible.
When I use my editor to make a change to my product’s source code, the syntax highlighter will immediately highlight anything with obvious errors in red. (latency—milliseconds)
If I use a typed language (e.g. Typescript or Swift) I’ll have the type-checker running in the background every time I modify a file, and my editor will let me know about problems with my code that are simple enough for the type checker to catch. (latency—seconds)
Developers are encouraged to write “unit tests” for their code that test whether it is working well. A good unit test should complete within a few seconds, and should detect the majority of ways in which someone might accidentally break the product. Developers typically have their system configured to automatically watch what files they modify and run the unit tests for all modules that depend on the changed files. (latency—tens of seconds)
Where possible, engineers will use interpreted languages (e.g. Javascript or Python) to avoid the latency of waiting for their code to compile. If I’m implementing a web app in Javascript, I can make a change to my source code, see the web site immediately refresh, and judge whether my change worked. Even if a compiled language only takes a few seconds to compile, that extra latency slows developer iteration speed enough to significantly slow iteration.
Engineers working on products like Google Search or Facebook Feed Ranking use offline evaluation systems that predict how well their code will work with real users by testing it against past data—avoiding the need to wait days for the results of a live experiment. Does your new feed ranking model accurately predict which search results human raters thought were good? Does your new Feed Ranking model accurately predict which things users clicked on? (latency—minutes)
Finally, if your code passes all those low latency tests, it can be tested using a live experiment, which typically runs for one to two weeks. While the experiment is running, the engineers can occupy their time by rapidly iterating on a different project that isn’t blocked on experiment results.
Hardware companies like Intel (where I worked) can’t afford to wait for their designs to be turned into physical chips so instead they rely very heavily on simulations. This allows them to rapidly iterate, making small changes to the design of the processor, and seeing whether it executes without bugs, and performs well on simulated workloads.
Large machine learning models like GPT can take months to train. If developers had to wait months in order to know whether they were improving the design of their models then it would be impossible for them to make rapid progress, so instead developers spend most of their time working on tiny models that take seconds or minutes to train, and then come up with scaling rules that allow them to predict the behavior of larger models given the behavior of those smaller models.
The list goes on but you get the point.
We can see similar iterative processes, and structures to direct them, all throughout society.
Capitalism allows thousands of companies to rapidly try out many different products and business models to see what works, with “profit” as their guiding metric. Iterative progress stalls when power becomes consolidated in a few companies, and thus it is a primary responsibility of the government to prevent consolidation of power. Well crafted regulations ensure that activities that make companies money are activities that benefit society, and those regulations should ideally be simple enough that they don’t slow down iteration.
Federalism and local government allow different countries, states, town, and sub-cultures to try out different rules in different places, allowing society to rapidly learn what ways of organizing a society work best, while limiting the negative impacts when a bad local law is introduced.
The Hippie Communes of the 1970s were arguably one of the most American things ever. Other countries like Russia and China relied on “Intelligent Design” where wise leaders used deep thought to come up with the right way to run society, imposed it top down on the whole country, and caused the deaths of millions of people. By contrast, the Hippie Communes of the US created thousands of wildly different social experiments, tested them to see what worked, and caused very little harm when they failed.
Academic publishing has been famously dysfunctional for a long time. If progress depends on rapid iteration, the last thing you want is a publication process that takes months to share results with other people, a peer-review process that screens out the most novel ideas in case they are wrong, and a grant-making process that punishes failure. One reasons why Machine Learning is such a successful research disciplines is that, since most researchers would rather get an industry job than an academic job, the field has largely abandoned the traditional academic publication system and shares results rapidly.
I used to be part of a comedy group called “Footlights” at Cambridge. When I was part of the group I’d spend lots of time hanging out with other comedians and we would write new comedy material by making iterative changes to each other’s comedy routines and testing them on each other. Once I left the group I mostly lost the ability to write comedy. It seems likely that social environments for rapid iteration on ideas are key to many fields of human creativity.
Selective breeding has created plants and animals (e.g. dog breeds) that have been hugely valuable to humanity. The DNA of these life-forms is far too complex for people to understand, but we were able to create them using a process of rapid iteration, where we bred lots of random variants, and then used structures like dog shows to identify and share the incremental modifications that worked best.
What does his mean for online discussion products like Facebook and Twitter?
Let’s assume that the purpose of online discussions is to come up with ‘effective’ beliefs about the world that cause people who hold those beliefs to do positive productive things.
If that is the case, then, following the framework established in the rest of this post, we’d want to encourage people to come up with new incremental improvements to existing beliefs as quickly as possible, and we’d want to reward the beliefs that were most beneficial to the groups that believed them.
Current online platforms clearly fail at both of these requirements. The beliefs that spread rapidly are more likely to be beliefs that make people angry rather than beliefs that have been shown to be beneficial. Similarly, the very real potential for harmful beliefs to spread and cause damage has led to the expectation that people who say the wrong things might be subject to “cancellation”, which slows down the speed at which people can iterate to find better beliefs.
So what are we to conclude from all this?
One thing I think we can conclude is that speed often matters more than “deep thought” or avoiding making mistakes. Amazing things can happen if you can empower people to “be stupid really fast” and remove the need to be careful. This requires making it safe to make mistakes, shortening the time required for an iterative step as much as possible, and having a fast way to judge whether a change is good enough to be worth keeping.
However this does not mean that such approaches are without cost. There are often tensions between “moving fast” and “breaking things”. Simple metrics like “time spent” or “company profit” or “language model perplexity” make it easier to unlock rapid iterative progress, but can mask costs like negative social consequences which are hard to measure on short time scales.
Are there simple principles that we should follow to manage such trade-offs. I’m not sure, but it seems important.
Original post on substack here: https://messyprogress.substack.com/p/the-power-of-high-speed-stupidity