Abstract
Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google’s PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4′s performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4′s capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.
Dumping more of the paper’s contents in the hope that it encourages people to look at the paper in more detail:
1 Introduction 4
1.1 Our approach to studying GPT-4’s intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Organization of our demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Multimodal and interdisciplinary composition 13
2.1 Integrative ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Image generation beyond memorization . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Image generation following detailed instructions (`a la Dall-E) . . . . . . . . . . . . . . 17
2.2.3 Possible application in sketch generation . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Coding 21
3.1 From instructions to code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Coding challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Real world scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Understanding existing code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1
arXiv:2303.12712v1 [cs.CL] 22 Mar 2023
4 Mathematical abilities 30
4.1 A mathematical conversation with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 A first generalization of the original question . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 A second variant of the original question . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3 Analysis of the limitations highlighted by conversation . . . . . . . . . . . . . . . . . . 34
4.2 Performance on mathematical problem datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Mathematical modeling in various domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Higher level mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Interaction with the world 43
5.1 Tool use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.1 Using multiple tools to solve more complex tasks . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Embodied Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.1 Warmup: navigating a map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.2 Text-based games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.3 Real world problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Interaction with humans 54
6.1 Understanding Humans: Theory of Mind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.1 Testing specific aspects of theory of mind . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.2 Testing theory of mind in realistic scenarios . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Talking to Humans: Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7 Discriminative Capabilities 69
7.1 PII Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Misconceptions and Fact-Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2.1 Why Are Current Metrics Insufficient? . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.2.2 GPT-4 as a Judge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8 Limitations of autoregressive architecture highlighted by GPT-4 76
8.1 Warm-up with two basic examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.2 Lack of planning in arithmetic/reasoning problems . . . . . . . . . . . . . . . . . . . . . . . . 77
8.3 Lack of planning in text generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9 Societal influences 82
9.1 Challenges of erroneous generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.2 Misinformation and manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.3 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.4 Human expertise, jobs, and economics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.5 Constellation of influences and considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10 Directions and Conclusions 92
10.1 Definitions of intelligence, AI, and AGI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.2 On the path to more general artificial intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.3 What is actually happening? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A GPT-4 has common sense grounding 101
B Appendix for multimodal and interdisciplinary composition 105
B.1 Further details on integrative ability results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B.2 Further details on vision results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
B.3 Graphic novel design example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
2
C Appendix for the Coding section 111
C.1 Measuring human performance on LeetCode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
C.2 Example of GPT-4 visualizing IMDb data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
C.3 More examples on visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
C.4 Example for 2D HTML game development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
C.5 Example for graphical user interface programming . . . . . . . . . . . . . . . . . . . . . . . . 116
C.6 Example for reverse engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
C.7 Testing GPT-4’s ability to execute (pseudo) code . . . . . . . . . . . . . . . . . . . . . . . . . 121
D Additional examples for mathematical reasoning 122
D.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
D.2 Further examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
D.3 Generating math problems with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
D.4 Mitigating calculation errors via external code execution . . . . . . . . . . . . . . . . . . . . . 139
E Additional Interpretability Examples 141
E.1 Explanation Agent Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
F Additional examples for interaction with the world 144
F.1 Interact with tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
F.2 Examples for interaction with environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
I’m honestly stunned by this. If it was indeed trained solely on text, how does it end up with such a good idea of how Euclidean space works? That’s either stupidly impressive, or a possible hint that the set of natural abstractions is even smaller and a bigger attractor in algorithm space than I thought. The labyrinth seems explicable, but the graphics?
Could a born blind human do this?
With enough training, sure. There are such things as born blind human painters.
Thanks, I did not know this. A quick search for his images seems to show that they use colour and perspective right at least as well as this does. Provided this is fully real and there’s nobody else in his process choosing colors and such. Tentatively marking this down as a win for natural abstraction.
There’s a fuckton of descriptions of images in text I guess.
And it’s consumed trillions of tokens.
It’s not just blind. It essentially has no senses whatsoever. It seems to have extrapolated “sense” from text data.
Perhaps of interest to this community is GPT-4 using a Linux terminal to iteratively problem-solve locating and infiltrating a poorly-secured machine on a local network:
Saying the quiet part out loud, I see!
It is followed by this sentence, though, which is the only place in the 154-page paper that even remotely hints at critical risks:
Very scarce references to any safety works, except the GPT-4 report and a passing mention to some interpretability papers.
Overall, I feel like the paper is a shameful exercise in not mentioning the elephant in the room. My guess is that their corporate bosses are censoring mentions of risks that could get them bad media PR, like with the Sydney debacle. It’s still not a good excuse.
I think an equally if not more likely explanation is that these particular researchers simply don’t happen to be that interested in alignment questions, and thought “oh yeah we should probably put in a token mention of alignment and some random citations to it” when writing the paper.
Which is somehow worse than doing it for corporate reasons.
Not allowing cycles of learning sounds like a bound on capability, but it might be a bound on capability of the part of the system that’s aligned, without a corresponding bound on the part that might be misaligned.
GPT-4 can do a lot of impresive things without thinking out loud with tokens in the context window, so where does this thinking take place? Probably with layers updating the residual stream. There are enough layers now that a sequence of their application might be taking on the role of context window to perform chain-of-thought reasoning, which is non-interpretable and not imitating human speech. This capability is being trained during pre-training, as the model is forced to read the dataset.
But the corresponding capability for studying deliberative reasoning in tokens is not being trained. The closest thing to it in GPT-4 is mitigation of hallucinations (see the 4-step algorithm in section 3.1 of the System Card part of GPT-4 report), and it’s nowhere near general enough.
This way, the inscrutable alien shoggoth is on track to wake up, while human-imitating masks that are plausibly aligned by default are being held back in situationally unaware confusion in the name of restricting capabilities for the sake of not burning the timeline.
I expected downvotes (it is cheeky and maybe not great for fruitful discussion), but instead I got disagreevotes. Big company labs do review papers for statements that could hurt the company! It’s not a conspiracy theory to suggest this shaped the content in some ways, especially the risks section.
On the other hand, I tried having GPT-4 play this game (selected mostly as a parser-based game that was new/niche enough not to be in its training set) and it didn’t particularly impress me with its intelligence:
When I was a child, I literally tried the same ineffective actions like 40 times in similar games, so I felt a bit for gpt given you only let it try the the ineffective actions a few times. Therefore, I tried the same test with ChatGPT-4 and let it use all of my 3hr limit (25): (EDIT:generated +25 moves)
Table 2, page 21 → (above) human-level performance on LeetCode.
Might be caused mostly by data leaks (training set contamination).
Probably not, from the paper: ‘We used LeetCode in Figure 1.5 in the introduction, where GPT-4 passes all stages of mock interviews for major tech companies. Here, to test on fresh questions,
we construct a benchmark of 100 LeetCode problems posted after October 8th, 2022, which is after GPT-4’s pretraining period.’
Good point. It’s a bit weird that performance on easy Codeforces questions is so bad (0/10) though.
https://twitter.com/cHHillee/status/1635790330854526981
Leetcode questions are not selected for novelty. In fact, the best way to get a problem turned into a Leetcode question is to post it to Leetcode’s discussion board and say someone asked you it in an interview at a big tech company. So it’s still possible that some or even many these questions appear nearly verbatim in the training data.
Deep chain-of-though reasoning and mathematical reasoning are some of its downfalls. Are the models able to make good enough abstractions inside of themselves to resolve arbitrarily long (even if not complex) math/logical problems?
I think this is a key question. I think the answer for transformer LLMs without external memory store is currently ‘No, not arbitrarily long computations’. I have looked into this in my research and found some labs doing work which would enable this. So it’s more of a question of when the big labs will decide to take integrate these ideas developed by academic groups into SoTA models, not a question of whether it’s possible. There’s a lot of novel capabilities like this that have been demonstrated to be possible but not yet been integrated, even when trying to rule out those which might get blocked by poor scaling/parallelizability.
So, the remaining hurdles seem to be more about engineering solutions to integration challenges, rather than innovation and proof-of-concept.
Let’s hope not.