There has been a lot of hand-wringing about accelerating AI progress within the AI safety community since OpenAI’s publication of their GPT-3 and Scaling Laws papers. OpenAI’s clear explication of scaling provides a justification for researchers to invest more in compute and provides a clear path forward for improving AI capabilities. Many in the AI safety community have rightly worried that this will lead to an arms race dynamic and faster timelines to AGI.
At the same time there’s also an argument that the resources being directed towards scaling transformers may have counter-factually been put towards other approaches (like reverse engineering the neocortex) that are more likely to lead to existentially dangerous AI. My own personal credence on transformers slowing the time to AGI is low, maybe 20%, but I think it’s important to weigh in.
There is also a growing concern within the AI safety community that simply scaling up GPT-3 by adding more data, weights, and training compute could lead to something existentially dangerous once a few other relatively simple components are added.
I have not seen the idea that scaling transformers will lead to existentially dangerous AI (after combining with a few other simple bits) defended in detail anywhere but it seems very much an idea “in the water” based on the few discussions with AI safety researchers I have been privy too. It has been alluded to various places online also:
Connor Leahy has said that a sufficiently large transformer model could serve as a powerful world model for an otherwise dumb and simple reinforcement learning agent, allowing it to rapidly learn how to do dangerous things in the world. For the record, I think this general argument is a super important point and something we should worry about, even though in this post I’ll mainly be presenting reasons for skepticism.
Gwern is perhaps the most well-known promoter of scaling being something we should worry about. He says “The scaling hypothesis regards the blessings of scale as the secret of AGI: intelligence is ‘just’ simple neural units & learning algorithms applied to diverse experiences at a (currently) unreachable scale.”
Observe the title of Alignment Newsletter #156: “The scaling hypothesis: a plan for building AGI”. Note: I’m not sure what Rohin Shah’s views are exactly, but from what I read they are pretty nuanced.
Zac Hatfield-Dodds (who later went on to do AI Safety work at Anthropic) commented on LessWrong 16 July 2021: “Now it looks like prosaic alignment might be the only kind we get, and the deadline might be very early indeed.”
lennart : “The strong scaling hypothesis is stating that we only need to scale a specific architecture, to achieve transformative or superhuman capabilities — this architecture might already be available.”
MIRI is famously secretive about what they are doing, but they’ve been pretty public that they’ve made a shift towards transformer alignment as a result of OpenAI’s work. Eliezer Yudkowsky told me he thinks GPT-N plus “a few other things” could lead to existentially dangerous AI (personal communication that I believe is consistent with his public views as they were expressed recently in the published MIRI conversations).
I do think a GPT-N model or a close cousin could be a component of an existentially dangerous AI. A vision transformer could serve a role analogous to the visual cortex in humans. A GPT type model trained on language might even make a good “System 1” for language, although I’m little less certain about that. So it definitely makes sense to be focusing a substantial amount of resources to transformer alignment when thinking about how to reduce AI x-risk.
While I’ve seen a lot of posts making the bullish case on LessWrong and the EA Forum, I’ve seen fewer posts making a bearish case. The only I have seen are a series of inciteful and interesting posts from nostalgebraist. [Interestingly, the bearish points I argue are very much distinct from the lines of attack nostalgebraist takes, so it’s worth looking at his posts too, especially his last one.] Another reason for writing this stems from my suspicion that too many AI safety resources are being put towards transformer alignment. Transformers are taking over AI right now, but I suspect they will be overtaken by a completely different architecture and approach soon (some strong candidates to take over in the near-term are the perciever architecture, Hopfield networks, energy based models, genetically/evolutionarily designed architectures, gated multi-layer perceptrons, and probably others I’m missing). The fact is we don’t really have any understanding of what makes a good architecture and there is no good reason to think transformers are the final story. Some of the transformer alignment work (like dataset sanitization) may transfer to whatever architecture replaces transformers, but I don’t we can predict with any certainty how much of it will transfer to future architectures and methods.
Given the number of AI safety orgs and academics already working transformer alignment, I question if it is a good investment for EAs on the current margin. A full discussion of neglectedness is beyond the scope of this post, however you can look at this EA Forum post that touches on the academic contribution to transformer alignment, and I’ll note there is also much work on aligning transformers going on in industry too.
Summary of main points
Transformers, like other deep learning models that came before, appear to work primarily via interpolation and have trouble finding theories that extrapolate. Having the capability to find theories that can extrapolate is at the very least a key to scientific progress and probably a prerequisite for existentially dangerous AI.
A recent paper shows CNNs have trouble grokking Conway’s Game of Life. Discussing the Rashomon effect, I make the case that grokking will be a pretty circumscribed / rare phenomena.
The degree to which GPT-3 can do common sense reasoning seems extremely murky to me. I generally agree with people who have said GPT-3 mostly does System 1 type stuff, and not System 2 stuff.
There are numerous other problems with transformers which appear solvable in the near term, some of which are already well on their way to being solved.
The economic utility of very large transformer models is overhyped at the moment.
Hypothesis: transformers work by interpolation only
(Figure caption: Some double descent curves, from [1].)
Double descent is a phenomena which is critical to understanding how deep learning models work. Figure 1 shows double descent curves for two language models from OpenAI’s “Deep Double Descent” paper,[1:1] which Evan Hubinger has summarized on LessWrong. Notice how the test loss first decreases, bottoms out, and then increases. The error bottoms out and starts to increase because of overfitting. This is the bias-variance trade-off which can be derived from the classical theory of statistical modeling. Notice however that as model size continues to increase, the test loss curve bends back down. This is the double descent phenomena. At large enough model size the test loss eventually becomes lower than it was in the regime were the bias-variance trade-off applied, although you can’t see it in this particular figure.
Notice that the double descent test loss curve peaks when the training loss bottoms out near zero. This is the interpolation threshold. The model has memorized the training data precisely. (or nearly so. In CNNs it is typical for the training loss to reach precisely zero).
An important point about interpolation is that it works locally. Algorithms that work via interpolation are incapable of discovering global trends. My favorite illustration of this is the following:[2]
No matter how many parameters or data you put in a neural network, it will never figure out that the underlying trend is y = x^2.
What deep learning models appear to in effect is dimensionality reduction to a lower-dimension manifold followed by piece-wise linear interpolation, which is very similar to k-nearest neighbors. If I understand things correctly, Trenton Bricken has shown something similar for transformers, by drawing out a mathematical correspondence between the attention mechanism in transformers and sparse distributed memory, a high level model of how memory works in the brain (the main difference is that transformer representations aren’t actually sparse).[3]
At least three forms of double descent have been discovered. The first occurs as you increase the number of parameters. The second occurs during training—oddly enough, during training a model can have better test error, than worse, and then better! (It seems historically this was hidden by the widespread practice of early stopping.) The last occurs as more training data is added.
Why do I bring up these other forms of double descent? Mainly to point out these this is evidence these systems are very different than biological brains. Imagine working through some flashcards and then getting worse after a certain point. Or imagine a situation where adding more flashcards to the deck actually makes you worse at a language. These odd properties of transformers (which are shared with most if not all deep learning models) are clearly sub-optimal which leads me to assign higher credence to the view that eventually transformers (and a lot of other deep learning stuff) will be replaced by something significantly different.
CNNs trained past the interpolation threshold memorize their training data (input-label relationships, assuming one-to-one correspondence). Memorization is a big part of how GPT-3 works, too. When unprompted, about 1% of the text produced by large language models is copied verbatim from the training corpus.[4] (As a reminders on some of the relevant numbers: GPT-3 has 175 Gb parameters and the size of the training data was ~45 Tb). Using adversarial techniques it may be able to extract specific data about people etc that is in the training data.[5] It appears as models get larger they memorize more—the extraction of people’s names and personal information from much smaller models like BERT was found to be difficult. OpenAI’s Codex seems to utilize a lot of memorization, often returning small code samples verbatim that were in the training data from Github. Some memorization is of course necessary and important (for instance models need to memorize how to spell words). However when a lot of the model’s capability come from memorization, I tend to be less impressed. On the other hand, perception and System 1 in the brain also seems to rely on a lot of brute force memorization and interpolation.[2:2]
Sometimes GPT-3′s interpolation abilities can be quite impressive, for instance Alyssa Vance gave the prompt “Early this morning, in a shocking surprise attack, the international credit card and finance company Visa launched a full-scale invasion of the island nation of Taiwan” and GPT-3′s output is quite impressive. GPT-2′s “extrapolation” of Ginsberg’s Moloch is also quite impressive. However, this is only extrapolation in a loose sense, a different way of looking at it may be “interpolation within the space of Moloch and Moloch-like sentences”. In general though, I appears that transformers struggle in models /theories/explanations that extrapolate, that reach outside the context they were discovered in to give a truly new prediction. The best examples of such theories are in science. The generation of such theories seems to often require a creative leap and can’t be done just by brute force fitting or induction (more on this below). A different way of saying this is that by optimizing for an objective (like next-work prediction) you don’t explore the landscape of possible models/theories enough (cf Kenneth Stanley’s well-known arguments about this). [My favorite example of a creative leap, by the way, is when the Greek astronomer Aristarchus hypothesized that the stars are glowing orbs like the sun, just very far away].
To give a simple example of the distinction I’m trying to flesh out here—Galileo observed that the period of a pendulum doesn’t depend on the amount of mass attached to it but does depend on the length (longer length = longer period) which are two high level rules / principles that extrapolate. Could a GPT-N model, reading about the properties of pendulums, come up with similar rules and apply them consistently? I have a hard time believing that it would, and in the event that it could it would probably require a ton of training data. On the other hand, humans had trouble discovering this simple law too (pendulums I think were around long before Galileo). A better thing to look at here is how GPT-3 particularly struggles with multiple-choice conceptual physics questions at the ~High School / Early College level, achieving only 35% accuracy (random guessing = 25%). For college physics level questions it does just barely better than random chance.[6] Learning how to think abstractly and apply a small number of powerful rules and principles to an infinite number of diverse situations is the key to doing physics. My guess is the equations and principles of physics were in GPT-3′s training data, along with some physics problems, they just were a tiny part so it didn’t prioritize them much.
To try to summarize these various points, I think there’s a fairly strong argument that GPT-3, like other deep learning models, works mainly via some form of interpolation between stuff in its training data, and this constitutes a significant limitation which makes me less concerned about a scaled up GPT-like model being existentially dangerous.
There is an important exception to all this, however, where GPT-N does discover rules that extrapolate, called “grokking”:
Why I’m not so worried about grokking and emergent behavior during scaling
For most tasks in the GPT-3 paper, the performance scales smoothly with model size. For a few, however, there are sudden jumps in performance. The tasks exhibiting significant jumps were addition, subtraction, and symbol substitution. Labeling these jumps “phase changes” is a terrible abuse in terminology—on closer inspection they are not at all discontinuous jumps and the term misleadingly suggests the emergence of a new internal order (phase changes should occur uniformly throughout a medium/space—the interpolation threshold in double descent may be a sort of phase change, but not grokking).
Note added shortly after publication: I forgot to mention unpublished results on BIG-Bench which showed a rapid jump for IPA translation—technically not Grokking but an unexpected jump non-the-less. (see LessWrong discussion on this here). Also, there are more examples of unexpected jumps in performance here from Jacob Steinhardt.
More recently it has been shown that with enough training data and parameters simple transformer models can learn how to reproduce certain mathematical transformations exactly.[7] During training, the models exhibit jumps upwards to 100% accuracy, with varying degrees of sharpness in the jump. The authors call this “grokking”. The set of transformations they studied involved addition, subtraction, multiplication, and the modulo operator.
As a result of these findings, AI safety researchers are worried about unexpected emergent behavior appearing in large models as they are scaled up.[8]
Here’s the thing about grokking though—for the network to Grok (get perfect accuracy) the architecture has to be able to literally do the algorithm and SGD has to find it. In the case of transformers, that means the algorithm must be easily decomposable into a series of matrix multiplies (it appears maybe repeated multiplication is Turing complete, so that’s why I stress easily. Notice that all the examples of grokking with transformers involve simple operations that can be decomposed into things like swapping values or arithmetic, which can be easily expressed as a series of matrix multiplications. Division is notably absent from both the grokking and GPT-3 paper, I wonder why...
But grokking doesn’t always work, even when we know that the network can do the thing easily in principle. This was shown in a paper by Jacob M. Springer and Garrett T. Kenyon recently.[9] (I did a summer internship with Dr. Kenyon in 2010 and can vouch for his credibility). The authors set up a simple CNN architecture that in principle can learn the rules for Conway’s Game of Life, so given an input board state the CNN can reproduce the Game of Life exactly, given the right parameters. The network was trained on over one million randomly generated examples, but despite all this data the network could not learn the exact solution. In fact, the minimal architecture couldn’t even learn how to predict just two steps out! They then tested what happens when they duplicate the the filter maps in several layers, creating m times as many weights than are necessary. They found that the degree of overcompleteness m scaled very quickly with the number of steps the network could predict.
The authors argue that their findings are consistent with the Lottery Ticket Hypothesis (LTH) that deep neural nets must get lucky by having a subset of initial parameters that are close enough to the desired solution. In other words, SGD alone can’t always find the right solution—some luck is involved in the initial parameter settings—which explains why bigger models with a larger pool of parameters to work with do better. (I feel compelled to mention that attempts to validate the LTH have produced a mixed bag of murky results and it remains only a hypothesis, not a well-established theory or principle.)
There is another important fact about data modeling that implies grokking or even semi-grokking will be exceptionally rare in deep learning models—the Rashomon effect, first described by Leo Breiman.[10] The effect is simply the observation that for any dataset, there is an infinite number of functions which fit it exactly which are mechanistically very different from each other. In his original paper, Brieman demonstrates this effect empirically by training a bunch of decision trees which all get equivalent accuracy on a test set but work very differently internally. Any model that works by fitting a ton of parameters to large data is subject to the Rashomon effect. The Rashomon effect implies that in the general case SGD is very unlikely to converge to the true model—ie very unlikely to Grok. In fact, I doubt SGD would even find a good approximation to the true model. (By “true model” I mean whatever algorithm or set of equations is generating the underlying data).
Solomonoff induction tries avoid “Rashomon hell” by biasing the Bayesian updating towards models with shorter algorithmic descriptions, with assumption that shorter descriptions are always closer to the truth. [Side note: I’m skeptical of Occam’s razor and how successfully this strategy works in any real world setup is, to my knowledge, rather poorly understood, which is just one of many reasons Solomonoff induction is a bad model for intelligence in my view (Note: sorry to be so vague. A review of problems with Solomonoff Induction will be the subject of a future post/article at some point)].
Even if biasing towards simpler models is a good idea, we don’t have a good way of doing this in deep learning yet, apart from restricting the number of parameters, which usually hurts test set performance to some degree [clarification: regularization methods bias towards simpler models that are easier to approximate, but they don’t really reduce the amount of compute needed to run a model in terms of FLOPs]. It used to be thought that SGD sought out “flat minima” in the loss (minima with low curvature) which result in simpler models in terms of how compressible they are, but further studies have shown this isn’t really true.[11]] . So we have reasons to believe transformers will be subject to the Rashomon effect and grokking will be very hard.(Sorry this section was rather sloppy—there are a lot of papers showing SGD leads to flatter minima which are associated with better generalization ability. I’m still think there’s a potential argument here though since empirically it seems deep learning is very subject to the Rashomon effect—it’s not uncommon for the same model trained with different random initializations to achieve similar training/test loss but work differently internally and have different failure modes etc.)
The big debate—to what extent does GPT-3 have common sense?
I don’t have a strong interest in wading through the reams of GPT-3 outputs people have posted online, much of which I suspect has been hand-picked to fit whatever narrative the author was trying to push. It’s not my cup of tea reading and analyzing GPT-3 prose/outputs and Gwern has already done it far more thoroughly than I ever could.
I think the failures are much more illuminating than the successes, because many of the failures are ones a human would never make (for instance answering “four” to “how many eyes does a horse have”). Just as humans are easy to mislead with the Cognitive Reflection Test, especially when sleep deprived or tired, GPT-3 is very easy to mislead too, sometimes embarrassingly so. My favorite examples of this come from Alyssa Vance, yet more can be found in Marcus and Davis’ MIT Tech Review article.
It seems GPT-3, like it’s predecessor GPT-2 has some common sense, but mainly only the system 1 gut reaction type—it still struggles with common sense reasoning. Many have made this observation already, including both Sara Constantine and Scott Alexander in the context of GPT-2 (as I side note, I highly recommend people read Sarah’s brilliant disquisition on System 1 vs System 2 entitled “Distinctions in Types of Thought”.).
Issues that seem solvable
There are some issues with transformers that appear very solvable to me and are in the process of being solved:
The first is lack of truthfulness. GPT-3 is great at question answering, the issue is it’s often plausible but wrong (see Alyssa Vance’s post “When GPT-3 Is Confident, Plausible, And Wrong”). Part of this is due to garbage-in garbage-out problem with transformers right now where they mimic human falsehoods that are in their training data.[12] Another issue is just not having enough memory to memorize all the relevant facts people may want to ask about. DeepMind seems to have solved the later issue with their Retrieval-Enhanced Transformer (RETRO) which utilizes a 2 trillion token database.[13]
A related issue is lack of coherence/lack of calibration. An optimal Bayesian agent considers all possibilities all the time, but any agent with finite resources can’t afford to do that—real world agents have finite memory, so they have to figure out when to forget disproven theories/facts/explanations. In the context of resource-bounded systems, it may be best to stick with a single best explanation rather than trying to hold multiple explanations [as an example, it seems reasonable to disregard old scientific theories once they have been robustly falsified even though from a Bayesian perspective they still have a tiny amount of non-zero probability attached to them]. Indeed, the human brain seems to have in-built bias against holding multiple contradictory theories at once (cognitive dissonance). Transformers, on the other hand, often give conflicting answers to similar questions, or even the same question when prompted multiple times. In other situations it makes sense for resource-bounded agents to keep track of multiple theories and weight them in a Bayesian manner. Just as CNNs are not well-calibrated for mysterious reasons, I suspect transformers are not well calibrated either. However, just as there are methods for fixing calibration in CNNs, I suspect there are methods to fix calibration in transformers too.
Another issue is lack of metacognition, or alerting the user about confidence. This is a big problem right now since humans want a question answering system to give correct answers and know when it doesn’t know something or isn’t sure. Interestingly, Nick Cammarata figured out that with careful prompting GPT-3 can identify nonsense questions (whether it counts as metacognition isn’t very clear). I think this is solvable by tweaking RETRO so it alerts the user when something isn’t in it’s database (maybe it already does this?). As with models like CNNS, where uncertainty can be added via dropout during inference or by adopting Bayesian training, there are probably other ways to add uncertainty quantification to transformers. MIRI’s “visible thoughts” approach is another way of attacking this problem.
Another issue is very weak compositionality. Like RNNs which came before,[14] transformers are really not good at composition, or chaining together a sequence of discrete tasks in a way it hasn’t seen before. Look for instance at how bad OpenAI’s Codex model is at chaining together components:[15]
This is very different behavior than humans, where the ability to accurately chain together two things implies the ability to accurately chain together a long sequence of things well. At least intuitively this seems solvable at least for many applications of interest by writing ad-hoc hard-coded methods to detect when chaining is needed and then do it.
The final issue is bias/toxicity. This problem is addressable both through dataset sanitization and via de-biasing word embeddings.[16] There have recently been a number of papers discussing and making progress on this.[17][18][19]
Aside: prediction vs explanation
“For even in purely practical applications, the explanatory power of a theory is paramount, and its predictive power only supplementary. If this seems surprising, imagine that an extraterrestrial scientist has visited the Earth and given us an ultra-high-technology “oracle” which can predict the outcome of any possible experiment but provides no explanations. According to the instrumentalists, once we had that oracle we should have no further use for scientific theories, except as a means of entertaining ourselves. But is that true? How would the oracle be used in practice? In some sense it would contain the knowledge necessary to build, say, an interstellar spaceship. But how exactly would that help us to build one? Or to build another oracle of the same kind? Or even a better mousetrap? The oracle only predicts the outcomes of experiments. Therefore, in order to use it at all, we must first know what experiments to ask it about. If we gave it the design of a spaceship, and the details of a proposed test flight, it could tell us how the spaceship would perform on such a flight. But it could not design the spaceship for us in the first place. And if it predicted that the spaceship we had designed would explode on takeoff, it could not tell us how to prevent such an explosion. That would still be for us to work out. And before we could work it out, before we could even begin to improve the design in any way, we should have to understand, among other things, how the spaceship was supposed to work. Only then could we have any chance of discovering what might cause an explosion on takeoff. Prediction – even perfect, universal prediction – is simply no substitute for explanation.”—David Deutsch, The Fabric of Reality
Of course, one could also ask a truly God-like oracle to predict how a human would write an instruction manual for building a spaceship, and then just follow that. The point of quoting this passage is to distinguish prediction from understanding. I don’t want to wade into the deep philosophical waters about what ‘explanation’ is, the Chinese Room, and all the rest. Rather, I just want to convince the reader that for the purpose of thinking about what GPT-N models can and can’t do, the distinction is real and important. Next word prediction is not everything. When we relentlessly optimize deep learning models only on predictive accuracy, they take shortcuts. They learn non-robust features, making them prone to adversarial examples. They memorize individual cases rather than trying to extract high-level abstract rules. And they then suffer when applied out of distribution.
Final thoughts—transformers are overhyped, at least right now
“We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run”—Roy Amara (“Amara’s Law”)
The debut of GPT-3 in May 2020 was accompanied by a lot of hype about how it would lead to a boom in startups and various economic activity. As far as I can tell, no company is actually making a profit with GPT-3 yet (I have Googled extensively and asked on Twitter about this multiple times. If you know an example, please comment below). It wasn’t until June 2021 that Microsoft themselves released their first commercial product that uses GPT-3, when they integrated a GPT-3-like model into Power Apps. The system allows users to put in a natural language input and get an output which is a string of code in a bespoke language developed at Microsoft called “Power Fx”. The resulting code can do things like manipulate Excel spreadsheets. This is cool, but also a bit underwhelming relative to the hype. In December, 2021, a South Korean company called Naver said they were starting to use a larger language model (trained on 6,500 more tokens than GPT-3) to help with product recommendations. This is also neat but underwhelming.
There is a pattern in AI where there is huge buzz around cool demos and lab demonstrations which then hits a brick wall during deployment. I see this all the time in my own field of AI for medical imaging. People drastically underestimate the difficulty of deploying things into the real world (AI systems that can easily be plugged into existing systems online, like for targeting ads, are a somewhat different matter). This is one of skeptical arguments from Rodney Brooks I agree with (for his argument, see section 7 here). The compute costs of training and inferencing GPT-like models also presents significant headwinds to translation to real-world use. Thompson et al. have argued that baring significant algorithmic improvements, hardware and compute costs will soon be fatal to the entire enterprise of scaling.[20][21] However, I am skeptical about the conclusions of their work since it appears to me they didn’t factor in Moore’s law well enough or the possibility of special-purpose hardware. See also Gwern’s comments in the comments section here.
As far as I can tell, in the next year we will see the following applications move from the lab to commercialization and real-world use:
incrementally better NPCs in videogames
incrementally better text summarization for things like product reviews or press releases
incrementally better translation
better code completion
Acknowledgements
Thank you to Stephen “Cas” Casper for proofreading an earlier draft of this post and providing useful comments.
How I’m thinking about GPT-N
There has been a lot of hand-wringing about accelerating AI progress within the AI safety community since OpenAI’s publication of their GPT-3 and Scaling Laws papers. OpenAI’s clear explication of scaling provides a justification for researchers to invest more in compute and provides a clear path forward for improving AI capabilities. Many in the AI safety community have rightly worried that this will lead to an arms race dynamic and faster timelines to AGI.
At the same time there’s also an argument that the resources being directed towards scaling transformers may have counter-factually been put towards other approaches (like reverse engineering the neocortex) that are more likely to lead to existentially dangerous AI. My own personal credence on transformers slowing the time to AGI is low, maybe 20%, but I think it’s important to weigh in.
There is also a growing concern within the AI safety community that simply scaling up GPT-3 by adding more data, weights, and training compute could lead to something existentially dangerous once a few other relatively simple components are added.
I have not seen the idea that scaling transformers will lead to existentially dangerous AI (after combining with a few other simple bits) defended in detail anywhere but it seems very much an idea “in the water” based on the few discussions with AI safety researchers I have been privy too. It has been alluded to various places online also:
Connor Leahy has said that a sufficiently large transformer model could serve as a powerful world model for an otherwise dumb and simple reinforcement learning agent, allowing it to rapidly learn how to do dangerous things in the world. For the record, I think this general argument is a super important point and something we should worry about, even though in this post I’ll mainly be presenting reasons for skepticism.
Gwern is perhaps the most well-known promoter of scaling being something we should worry about. He says “The scaling hypothesis regards the blessings of scale as the secret of AGI: intelligence is ‘just’ simple neural units & learning algorithms applied to diverse experiences at a (currently) unreachable scale.”
Observe the title of Alignment Newsletter #156: “The scaling hypothesis: a plan for building AGI”. Note: I’m not sure what Rohin Shah’s views are exactly, but from what I read they are pretty nuanced.
Zac Hatfield-Dodds (who later went on to do AI Safety work at Anthropic) commented on LessWrong 16 July 2021: “Now it looks like prosaic alignment might be the only kind we get, and the deadline might be very early indeed.”
lennart : “The strong scaling hypothesis is stating that we only need to scale a specific architecture, to achieve transformative or superhuman capabilities — this architecture might already be available.”
MIRI is famously secretive about what they are doing, but they’ve been pretty public that they’ve made a shift towards transformer alignment as a result of OpenAI’s work. Eliezer Yudkowsky told me he thinks GPT-N plus “a few other things” could lead to existentially dangerous AI (personal communication that I believe is consistent with his public views as they were expressed recently in the published MIRI conversations).
I do think a GPT-N model or a close cousin could be a component of an existentially dangerous AI. A vision transformer could serve a role analogous to the visual cortex in humans. A GPT type model trained on language might even make a good “System 1” for language, although I’m little less certain about that. So it definitely makes sense to be focusing a substantial amount of resources to transformer alignment when thinking about how to reduce AI x-risk.
While I’ve seen a lot of posts making the bullish case on LessWrong and the EA Forum, I’ve seen fewer posts making a bearish case. The only I have seen are a series of inciteful and interesting posts from nostalgebraist. [Interestingly, the bearish points I argue are very much distinct from the lines of attack nostalgebraist takes, so it’s worth looking at his posts too, especially his last one.] Another reason for writing this stems from my suspicion that too many AI safety resources are being put towards transformer alignment. Transformers are taking over AI right now, but I suspect they will be overtaken by a completely different architecture and approach soon (some strong candidates to take over in the near-term are the perciever architecture, Hopfield networks, energy based models, genetically/evolutionarily designed architectures, gated multi-layer perceptrons, and probably others I’m missing). The fact is we don’t really have any understanding of what makes a good architecture and there is no good reason to think transformers are the final story. Some of the transformer alignment work (like dataset sanitization) may transfer to whatever architecture replaces transformers, but I don’t we can predict with any certainty how much of it will transfer to future architectures and methods.
Given the number of AI safety orgs and academics already working transformer alignment, I question if it is a good investment for EAs on the current margin. A full discussion of neglectedness is beyond the scope of this post, however you can look at this EA Forum post that touches on the academic contribution to transformer alignment, and I’ll note there is also much work on aligning transformers going on in industry too.
Summary of main points
Transformers, like other deep learning models that came before, appear to work primarily via interpolation and have trouble finding theories that extrapolate. Having the capability to find theories that can extrapolate is at the very least a key to scientific progress and probably a prerequisite for existentially dangerous AI.
A recent paper shows CNNs have trouble grokking Conway’s Game of Life. Discussing the Rashomon effect, I make the case that grokking will be a pretty circumscribed / rare phenomena.
The degree to which GPT-3 can do common sense reasoning seems extremely murky to me. I generally agree with people who have said GPT-3 mostly does System 1 type stuff, and not System 2 stuff.
There are numerous other problems with transformers which appear solvable in the near term, some of which are already well on their way to being solved.
The economic utility of very large transformer models is overhyped at the moment.
Hypothesis: transformers work by interpolation only
(Figure caption: Some double descent curves, from [1].)Double descent is a phenomena which is critical to understanding how deep learning models work. Figure 1 shows double descent curves for two language models from OpenAI’s “Deep Double Descent” paper,[1:1] which Evan Hubinger has summarized on LessWrong. Notice how the test loss first decreases, bottoms out, and then increases. The error bottoms out and starts to increase because of overfitting. This is the bias-variance trade-off which can be derived from the classical theory of statistical modeling. Notice however that as model size continues to increase, the test loss curve bends back down. This is the double descent phenomena. At large enough model size the test loss eventually becomes lower than it was in the regime were the bias-variance trade-off applied, although you can’t see it in this particular figure.
Notice that the double descent test loss curve peaks when the training loss bottoms out near zero. This is the interpolation threshold. The model has memorized the training data precisely. (or nearly so. In CNNs it is typical for the training loss to reach precisely zero).
An important point about interpolation is that it works locally. Algorithms that work via interpolation are incapable of discovering global trends. My favorite illustration of this is the following:[2]
(Figure caption: figure from Hasson et al.[2:1])No matter how many parameters or data you put in a neural network, it will never figure out that the underlying trend is y = x^2.
What deep learning models appear to in effect is dimensionality reduction to a lower-dimension manifold followed by piece-wise linear interpolation, which is very similar to k-nearest neighbors. If I understand things correctly, Trenton Bricken has shown something similar for transformers, by drawing out a mathematical correspondence between the attention mechanism in transformers and sparse distributed memory, a high level model of how memory works in the brain (the main difference is that transformer representations aren’t actually sparse).[3]
At least three forms of double descent have been discovered. The first occurs as you increase the number of parameters. The second occurs during training—oddly enough, during training a model can have better test error, than worse, and then better! (It seems historically this was hidden by the widespread practice of early stopping.) The last occurs as more training data is added.
Why do I bring up these other forms of double descent? Mainly to point out these this is evidence these systems are very different than biological brains. Imagine working through some flashcards and then getting worse after a certain point. Or imagine a situation where adding more flashcards to the deck actually makes you worse at a language. These odd properties of transformers (which are shared with most if not all deep learning models) are clearly sub-optimal which leads me to assign higher credence to the view that eventually transformers (and a lot of other deep learning stuff) will be replaced by something significantly different.
CNNs trained past the interpolation threshold memorize their training data (input-label relationships, assuming one-to-one correspondence). Memorization is a big part of how GPT-3 works, too. When unprompted, about 1% of the text produced by large language models is copied verbatim from the training corpus.[4] (As a reminders on some of the relevant numbers: GPT-3 has 175 Gb parameters and the size of the training data was ~45 Tb). Using adversarial techniques it may be able to extract specific data about people etc that is in the training data.[5] It appears as models get larger they memorize more—the extraction of people’s names and personal information from much smaller models like BERT was found to be difficult. OpenAI’s Codex seems to utilize a lot of memorization, often returning small code samples verbatim that were in the training data from Github. Some memorization is of course necessary and important (for instance models need to memorize how to spell words). However when a lot of the model’s capability come from memorization, I tend to be less impressed. On the other hand, perception and System 1 in the brain also seems to rely on a lot of brute force memorization and interpolation.[2:2]
Sometimes GPT-3′s interpolation abilities can be quite impressive, for instance Alyssa Vance gave the prompt “Early this morning, in a shocking surprise attack, the international credit card and finance company Visa launched a full-scale invasion of the island nation of Taiwan” and GPT-3′s output is quite impressive. GPT-2′s “extrapolation” of Ginsberg’s Moloch is also quite impressive. However, this is only extrapolation in a loose sense, a different way of looking at it may be “interpolation within the space of Moloch and Moloch-like sentences”. In general though, I appears that transformers struggle in models /theories/explanations that extrapolate, that reach outside the context they were discovered in to give a truly new prediction. The best examples of such theories are in science. The generation of such theories seems to often require a creative leap and can’t be done just by brute force fitting or induction (more on this below). A different way of saying this is that by optimizing for an objective (like next-work prediction) you don’t explore the landscape of possible models/theories enough (cf Kenneth Stanley’s well-known arguments about this). [My favorite example of a creative leap, by the way, is when the Greek astronomer Aristarchus hypothesized that the stars are glowing orbs like the sun, just very far away].
To give a simple example of the distinction I’m trying to flesh out here—Galileo observed that the period of a pendulum doesn’t depend on the amount of mass attached to it but does depend on the length (longer length = longer period) which are two high level rules / principles that extrapolate. Could a GPT-N model, reading about the properties of pendulums, come up with similar rules and apply them consistently? I have a hard time believing that it would, and in the event that it could it would probably require a ton of training data. On the other hand, humans had trouble discovering this simple law too (pendulums I think were around long before Galileo). A better thing to look at here is how GPT-3 particularly struggles with multiple-choice conceptual physics questions at the ~High School / Early College level, achieving only 35% accuracy (random guessing = 25%). For college physics level questions it does just barely better than random chance.[6] Learning how to think abstractly and apply a small number of powerful rules and principles to an infinite number of diverse situations is the key to doing physics. My guess is the equations and principles of physics were in GPT-3′s training data, along with some physics problems, they just were a tiny part so it didn’t prioritize them much.
To try to summarize these various points, I think there’s a fairly strong argument that GPT-3, like other deep learning models, works mainly via some form of interpolation between stuff in its training data, and this constitutes a significant limitation which makes me less concerned about a scaled up GPT-like model being existentially dangerous.
There is an important exception to all this, however, where GPT-N does discover rules that extrapolate, called “grokking”:
Why I’m not so worried about grokking and emergent behavior during scaling
For most tasks in the GPT-3 paper, the performance scales smoothly with model size. For a few, however, there are sudden jumps in performance. The tasks exhibiting significant jumps were addition, subtraction, and symbol substitution. Labeling these jumps “phase changes” is a terrible abuse in terminology—on closer inspection they are not at all discontinuous jumps and the term misleadingly suggests the emergence of a new internal order (phase changes should occur uniformly throughout a medium/space—the interpolation threshold in double descent may be a sort of phase change, but not grokking).
Note added shortly after publication: I forgot to mention unpublished results on BIG-Bench which showed a rapid jump for IPA translation—technically not Grokking but an unexpected jump non-the-less. (see LessWrong discussion on this here). Also, there are more examples of unexpected jumps in performance here from Jacob Steinhardt.
More recently it has been shown that with enough training data and parameters simple transformer models can learn how to reproduce certain mathematical transformations exactly.[7] During training, the models exhibit jumps upwards to 100% accuracy, with varying degrees of sharpness in the jump. The authors call this “grokking”. The set of transformations they studied involved addition, subtraction, multiplication, and the modulo operator.
As a result of these findings, AI safety researchers are worried about unexpected emergent behavior appearing in large models as they are scaled up.[8]
Here’s the thing about grokking though—for the network to Grok (get perfect accuracy) the architecture has to be able to literally do the algorithm and SGD has to find it. In the case of transformers, that means the algorithm must be easily decomposable into a series of matrix multiplies (it appears maybe repeated multiplication is Turing complete, so that’s why I stress easily. Notice that all the examples of grokking with transformers involve simple operations that can be decomposed into things like swapping values or arithmetic, which can be easily expressed as a series of matrix multiplications. Division is notably absent from both the grokking and GPT-3 paper, I wonder why...
But grokking doesn’t always work, even when we know that the network can do the thing easily in principle. This was shown in a paper by Jacob M. Springer and Garrett T. Kenyon recently.[9] (I did a summer internship with Dr. Kenyon in 2010 and can vouch for his credibility). The authors set up a simple CNN architecture that in principle can learn the rules for Conway’s Game of Life, so given an input board state the CNN can reproduce the Game of Life exactly, given the right parameters. The network was trained on over one million randomly generated examples, but despite all this data the network could not learn the exact solution. In fact, the minimal architecture couldn’t even learn how to predict just two steps out! They then tested what happens when they duplicate the the filter maps in several layers, creating m times as many weights than are necessary. They found that the degree of overcompleteness m scaled very quickly with the number of steps the network could predict.
The authors argue that their findings are consistent with the Lottery Ticket Hypothesis (LTH) that deep neural nets must get lucky by having a subset of initial parameters that are close enough to the desired solution. In other words, SGD alone can’t always find the right solution—some luck is involved in the initial parameter settings—which explains why bigger models with a larger pool of parameters to work with do better. (I feel compelled to mention that attempts to validate the LTH have produced a mixed bag of murky results and it remains only a hypothesis, not a well-established theory or principle.)
There is another important fact about data modeling that implies grokking or even semi-grokking will be exceptionally rare in deep learning models—the Rashomon effect, first described by Leo Breiman.[10] The effect is simply the observation that for any dataset, there is an infinite number of functions which fit it exactly which are mechanistically very different from each other. In his original paper, Brieman demonstrates this effect empirically by training a bunch of decision trees which all get equivalent accuracy on a test set but work very differently internally. Any model that works by fitting a ton of parameters to large data is subject to the Rashomon effect. The Rashomon effect implies that in the general case SGD is very unlikely to converge to the true model—ie very unlikely to Grok. In fact, I doubt SGD would even find a good approximation to the true model. (By “true model” I mean whatever algorithm or set of equations is generating the underlying data).
Solomonoff induction tries avoid “Rashomon hell” by biasing the Bayesian updating towards models with shorter algorithmic descriptions, with assumption that shorter descriptions are always closer to the truth. [Side note:
I’m skeptical of Occam’s razor andhow successfully this strategy works in any real world setup is, to my knowledge, rather poorly understood, which is just one of many reasons Solomonoff induction is a bad model for intelligence in my view (Note: sorry to be so vague. A review of problems with Solomonoff Induction will be the subject of a future post/article at some point)].Even if biasing towards simpler models is a good idea, we don’t have a good way of doing this in deep learning yet, apart from restricting the number of parameters, which usually hurts test set performance to some degree [clarification: regularization methods bias towards simpler models that are easier to approximate, but they don’t really reduce the amount of compute needed to run a model in terms of FLOPs].
It used to be thought that SGD sought out “flat minima” in the loss (minima with low curvature) which result in simpler models in terms of how compressible they are, but further studies have shown this isn’t really true.[11]] . So we have reasons to believe transformers will be subject to the Rashomon effect and grokking will be very hard.(Sorry this section was rather sloppy—there are a lot of papers showing SGD leads to flatter minima which are associated with better generalization ability. I’m still think there’s a potential argument here though since empirically it seems deep learning is very subject to the Rashomon effect—it’s not uncommon for the same model trained with different random initializations to achieve similar training/test loss but work differently internally and have different failure modes etc.)The big debate—to what extent does GPT-3 have common sense?
I don’t have a strong interest in wading through the reams of GPT-3 outputs people have posted online, much of which I suspect has been hand-picked to fit whatever narrative the author was trying to push. It’s not my cup of tea reading and analyzing GPT-3 prose/outputs and Gwern has already done it far more thoroughly than I ever could.
I think the failures are much more illuminating than the successes, because many of the failures are ones a human would never make (for instance answering “four” to “how many eyes does a horse have”). Just as humans are easy to mislead with the Cognitive Reflection Test, especially when sleep deprived or tired, GPT-3 is very easy to mislead too, sometimes embarrassingly so. My favorite examples of this come from Alyssa Vance, yet more can be found in Marcus and Davis’ MIT Tech Review article.
It seems GPT-3, like it’s predecessor GPT-2 has some common sense, but mainly only the system 1 gut reaction type—it still struggles with common sense reasoning. Many have made this observation already, including both Sara Constantine and Scott Alexander in the context of GPT-2 (as I side note, I highly recommend people read Sarah’s brilliant disquisition on System 1 vs System 2 entitled “Distinctions in Types of Thought”.).
Issues that seem solvable
There are some issues with transformers that appear very solvable to me and are in the process of being solved:
The first is lack of truthfulness. GPT-3 is great at question answering, the issue is it’s often plausible but wrong (see Alyssa Vance’s post “When GPT-3 Is Confident, Plausible, And Wrong”). Part of this is due to garbage-in garbage-out problem with transformers right now where they mimic human falsehoods that are in their training data.[12] Another issue is just not having enough memory to memorize all the relevant facts people may want to ask about. DeepMind seems to have solved the later issue with their Retrieval-Enhanced Transformer (RETRO) which utilizes a 2 trillion token database.[13]
A related issue is lack of coherence/lack of calibration. An optimal Bayesian agent considers all possibilities all the time, but any agent with finite resources can’t afford to do that—real world agents have finite memory, so they have to figure out when to forget disproven theories/facts/explanations. In the context of resource-bounded systems, it may be best to stick with a single best explanation rather than trying to hold multiple explanations [as an example, it seems reasonable to disregard old scientific theories once they have been robustly falsified even though from a Bayesian perspective they still have a tiny amount of non-zero probability attached to them]. Indeed, the human brain seems to have in-built bias against holding multiple contradictory theories at once (cognitive dissonance). Transformers, on the other hand, often give conflicting answers to similar questions, or even the same question when prompted multiple times. In other situations it makes sense for resource-bounded agents to keep track of multiple theories and weight them in a Bayesian manner. Just as CNNs are not well-calibrated for mysterious reasons, I suspect transformers are not well calibrated either. However, just as there are methods for fixing calibration in CNNs, I suspect there are methods to fix calibration in transformers too.
Another issue is lack of metacognition, or alerting the user about confidence. This is a big problem right now since humans want a question answering system to give correct answers and know when it doesn’t know something or isn’t sure. Interestingly, Nick Cammarata figured out that with careful prompting GPT-3 can identify nonsense questions (whether it counts as metacognition isn’t very clear). I think this is solvable by tweaking RETRO so it alerts the user when something isn’t in it’s database (maybe it already does this?). As with models like CNNS, where uncertainty can be added via dropout during inference or by adopting Bayesian training, there are probably other ways to add uncertainty quantification to transformers. MIRI’s “visible thoughts” approach is another way of attacking this problem.
Another issue is very weak compositionality. Like RNNs which came before,[14] transformers are really not good at composition, or chaining together a sequence of discrete tasks in a way it hasn’t seen before. Look for instance at how bad OpenAI’s Codex model is at chaining together components:[15]
(From the OpenAI Codex paper.[15:1])
This is very different behavior than humans, where the ability to accurately chain together two things implies the ability to accurately chain together a long sequence of things well. At least intuitively this seems solvable at least for many applications of interest by writing ad-hoc hard-coded methods to detect when chaining is needed and then do it.
The final issue is bias/toxicity. This problem is addressable both through dataset sanitization and via de-biasing word embeddings.[16] There have recently been a number of papers discussing and making progress on this.[17][18][19]
Aside: prediction vs explanation
Of course, one could also ask a truly God-like oracle to predict how a human would write an instruction manual for building a spaceship, and then just follow that. The point of quoting this passage is to distinguish prediction from understanding. I don’t want to wade into the deep philosophical waters about what ‘explanation’ is, the Chinese Room, and all the rest. Rather, I just want to convince the reader that for the purpose of thinking about what GPT-N models can and can’t do, the distinction is real and important. Next word prediction is not everything. When we relentlessly optimize deep learning models only on predictive accuracy, they take shortcuts. They learn non-robust features, making them prone to adversarial examples. They memorize individual cases rather than trying to extract high-level abstract rules. And they then suffer when applied out of distribution.
Final thoughts—transformers are overhyped, at least right now
The debut of GPT-3 in May 2020 was accompanied by a lot of hype about how it would lead to a boom in startups and various economic activity. As far as I can tell, no company is actually making a profit with GPT-3 yet (I have Googled extensively and asked on Twitter about this multiple times. If you know an example, please comment below). It wasn’t until June 2021 that Microsoft themselves released their first commercial product that uses GPT-3, when they integrated a GPT-3-like model into Power Apps. The system allows users to put in a natural language input and get an output which is a string of code in a bespoke language developed at Microsoft called “Power Fx”. The resulting code can do things like manipulate Excel spreadsheets. This is cool, but also a bit underwhelming relative to the hype. In December, 2021, a South Korean company called Naver said they were starting to use a larger language model (trained on 6,500 more tokens than GPT-3) to help with product recommendations. This is also neat but underwhelming.
There is a pattern in AI where there is huge buzz around cool demos and lab demonstrations which then hits a brick wall during deployment. I see this all the time in my own field of AI for medical imaging. People drastically underestimate the difficulty of deploying things into the real world (AI systems that can easily be plugged into existing systems online, like for targeting ads, are a somewhat different matter). This is one of skeptical arguments from Rodney Brooks I agree with (for his argument, see section 7 here). The compute costs of training and inferencing GPT-like models also presents significant headwinds to translation to real-world use. Thompson et al. have argued that baring significant algorithmic improvements, hardware and compute costs will soon be fatal to the entire enterprise of scaling.[20][21] However, I am skeptical about the conclusions of their work since it appears to me they didn’t factor in Moore’s law well enough or the possibility of special-purpose hardware. See also Gwern’s comments in the comments section here.
As far as I can tell, in the next year we will see the following applications move from the lab to commercialization and real-world use:
incrementally better NPCs in videogames
incrementally better text summarization for things like product reviews or press releases
incrementally better translation
better code completion
Acknowledgements
Thank you to Stephen “Cas” Casper for proofreading an earlier draft of this post and providing useful comments.
References
Nakkiran, et al. “Deep Double Descent: Where Bigger Models and More Data Hurt”. 2019. ↩︎
Hasson et al. “Direct Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks”. Neuron. 105(3). pages 416-434. 2020.
Bricken, Trenton and Pehlevan, Cengiz. “Attention Approximates Sparse Distributed Memory”. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 34. 2021.
Lee, et al. “Deduplicating Training Data Makes Language Models Better”. arXiv e-prints. 2021.
Carlini et al. “Extracting Training Data from Large Language Models”. In Proceedings of the 30th USENIX Security Symposium. 2021.
Hendrycks et al. “Measuring Massive Multitask Language Understanding”. In Proceedings of the International Conference on Learning Representations (ICLR). 2021.
Power et al. “Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets”. In Proceedings of the 1st Mathematical Reasoning in General Artificial Intelligence Workshop, ICLR. 2021.
Steinhardt, Jacob. “On The Risks of Emergent Behavior in Foundation Models”. 2021.
Springer, J. M., & Kenyon, G. T. It’s Hard for Neural Networks to Learn the Game of Life. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN). 2021. (arXiv version here])
Breiman, Leo. “Statistical Modeling: The Two Cultures”. Statistical Science. 16 (3) pg 199 − 231. 2001.
Dinh et al. “Sharp Minima Can Generalize For Deep Nets”. 2017.
Lin et al. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”. arXiv e-prints. 2021.
Borgeaud et al. “Improving language models by retrieving from trillions of tokens”. arXiv e-prints”. 2021.
Lake et al. “Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks”. In Proceedings of the 35th International Conference on Machine Learning (ICML). 2018.
Chen et al. “Evaluating Large Language Models Trained on Code”. arXiv e-print. 2021.
Bolukbasi, et al. “Man is to Computer Programmer as Woman is to
Homemaker? Debiasing Word Embeddings”. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NeurIPS). 2016.
Askell, et al. “A General Language Assistant as a Laboratory for Alignment”. arXiv e-prints. 2021.
Webl et al. “Challenges in Detoxifying Language Models”. In Findings of EMNLP. 2021.
Weidinger, et al. “Ethical and social risks of harm from Language Models”. arXiv e-prints. 2021.
Thompson et al. “Deep Learning’s Diminishing Returns”. IEEE Spectrum. 2021.
Thompson et al. “The Computational Limits of Deep Learning”. arXiv e-prints. 2020.