larger language models may disappoint you [or, an eternally unfinished draft]

what this post is

The following is an incomplete draft, which I’m publishing now because I am unlikely to ever finish writing it.

I no longer fully endorse all the claims in the post. (In a few cases, I’ve added a note to say this explicitly.) However, there are some arguments in the post that I still endorse, and which I have not seen made elsewhere.

This post is the result of me having lots of opinions about LM scaling, at various times in 2021, which were difficult to write down briefly or independently of one another. This post, originally written in July 2021, is the closest I got to writing them all down in one place.

-nost, 11/​26/​21


0. caveat

This post will definitely disappoint you.

Or, anyway, it will definitely disappoint me. I know that even though I haven’t written it yet.

My drafts folder contains several long, abandoned attempts to write (something like) this post. I’ve written (something like) this post many times in my head. I just can’t seem to get it right, though. The drafts always sprawl out of control.

So, if I can’t do it right, why not do it wrong? Here’s the disorganized, incomplete, brain-dump version of the better post I wish I were writing. Caveat lector.

1. polarization

The topic of this post is large language models (LMs) like GPT-3. Specifically, what will happen as we make them larger and larger.

By my lights, everyone else seems either too impressed/​scared by the concept of LM scaling, or not impressed/​scared enough.

On LessWrong and related communities, I see lots of people worrying in earnest about whether the first superhuman AGI will be a GPT-like model. Both here and in the wider world, people often talk about GPT-3 like it’s a far “smarter” being that it seems to me.

On the other hand, the people who aren’t scared often don’t seem like they’re even paying attention. Faced with a sudden leap in machine capabilities, they shrug. Faced with a simple recipe that can make those machines even better—with eerie, physics-like regularity—they . . . still shrug. I wrote about the most infamous of these detractors here.

Meanwhile, I’m here in the middle. What do I think? Something like:

  • The newer (i.e. large transformer) LMs really are a huge advance in NLP over the prior state of the art

  • The prior state of the art was really bad, though. Before the new LMs, neural nets simply couldn’t “do” language the way they could “do” images, something I noted back in 2017.

  • Most of the “huge advance” happened in the smallest of the new models, like BERT-Base and GPT-2-small.

  • The effect of scaling up these models is mostly to “de-noise” capabilities already evident in the small ones. It makes their strengths more robust and easier to access, but doesn’t add fundamentally new strengths.

  • The larger language models of the future will be highly impactful, but banal.

    • They will probably allow us to fully automate all the routine linguistic tasks you could almost imagine automating with GPT-3.

    • People will make wonderful new things using them.

    • They won’t be “smart” in any way that GPT-3 is not, or indeed, really in any way that GPT-2 was not.

    • They will get better at abstract reasoning—in the sense that it will be easier to get them to spit out text that sounds like it is the product of abstract reasoning. (As even GPT-2 does frequently.) They will be weak at this relative to their other capabilities, as they are today, and little will come of it.

    • They might end up as sub-systems in an AGI one day.

The rest of the post will consist of some gestures where I try to make the above feel as natural to you as it does to me.

2. the enthusiast’s argument

First, let’s spell out the argument that has people thinking GPT will lead to AGI.

Roughly the same argument has been made elsewhere by gwern and bmk, among others.

  1. Loss scaling will continue. It will be straightforward to achieve lower and lower language modeling loss simply by using more compute + data.

    We can do this without making any new conceptual advances (except perhaps in hardware). Therefore someone will do it.

  2. Loss scaling could well continue indefinitely. I.e., more compute + data might push the loss asymptotically all the way to to the “intrinsic entropy of text”—the true noise left when all patterns have been accounted for, including arbitrarily hard ones.

    It could be the case that scaling will instead bottom out at some earlier point, when the only patterns left are “too hard” for the models. We don’t have much evidence one way or another on this point, and even if we did, we would have no idea how hard is “too hard.”

  3. Language modeling is AGI-complete. A model that truly understood all patterns in a language modeling dataset would possess a large fraction of all human capabilities taken together.

    Can you write a textbook on Étale cohomology? (If you can’t, then you’re missing out on some language modeling loss.) Can you play both sides of a roundtable between the world’s leading economists, imitating the distinct intellect of each one, the novel arguments they’d concoct of the spot, the subtle flecks of personal bias tainting those arguments? (If you can’t, then...) Can you translate between any pair of human languages that any linguist, anywhere, knows how to translate? (If you can’t, then...) And so on.

  4. Loss scaling makes models “smarter” fast enough to matter. This point is a crucial bridge between the abstract potential from points 2 and 3, and the quantitative near-term predictions from point 1.

    This point is easiest to explain by showing what its negation looks like.

    Suppose that points 2 and 3 are really true—that adding more compute/​data eventually turns a transformer LM into an AGI. That doesn’t tell you anything about how fast the process happens.

    How many orders of magnitude do we need to add to make the model non-negligibly smarter? If the answer is “1 OOM,” then the scaling projections from point 1 are relevant. If the answer is “100 OOM” . . . not so much.

    (Or, consider a variant of this scenario: suppose most of the abilities we care about, when we use the term “AGI,” are locked away in the very last tiny sliver of loss just above the intrinsic entropy of text. In the final 0.00[...many extra zeros...]1 bits/​character, in a loss difference so tiny we’d need vastly larger validation sets to for it to be distinguishable from data-sampling noise.)

I agree with points 1-3. Point 4 is where I and the enthusiasts diverge.

3. are we getting smarter yet?

Why do the enthusiasts believe point 4? That is, why would we expect a feasible, incremental scaling upgrade to yield a meaningful boost in intelligence?

Because it already did: GPT-3 is meaningfully smarter than GPT-2.

The enthusiast’s argument, in its most common form, relies entirely on this premise. The enthusiast knows perfectly well that AGI-completeness in principle is not enough: we need, not just an asymptotic result, but some idea of when we might get close enough.

As gwern puts it [my emphasis]:

The pretraining thesis, while logically impeccable—how is a model supposed to solve all possible trick questions without understanding, just guessing?—never struck me as convincing, an argument admitting neither confutation nor conviction. It feels too much like a magic trick: “here’s some information theory, here’s a human benchmark, here’s how we can encode all tasks as a sequence prediction problem, hey presto—Intelligence!” There are lots of algorithms which are Turing-complete or ‘universal’ in some sense; there are lots of algorithms like AIXI which solve AI in some theoretical sense (Schmidhuber & company have many of these cute algorithms such as ‘the fastest possible algorithm for all problems’, with the minor catch of some constant factors which require computers bigger than the universe).

Why think pretraining or sequence modeling is not another one of them? Sure, if the model got a low enough loss, it’d have to be intelligent, but how could you prove that would happen in practice? [...] It might require more text than exists, countless petabytes of data for all of those subtle factors like logical reasoning to represent enough training signal, amidst all the noise and distractors, to train a model. Or maybe your models are too small to do more than absorb the simple surface-level signals [...]

But apparently, it would’ve worked fine. [...] It just required more compute & data than anyone was willing to risk on it until a few true-believers were able to get their hands on a few million dollars of compute. [...]

If GPT-3 gained so much meta-learning and world knowledge by dropping its absolute loss ~50% when starting from GPT-2’s level, what capabilities would another ~30% improvement over GPT-3 gain?

But … are the GPTs getting meaningfully smarter already, as we scale them?

It’s tempting to casually answer “yes,” pointing to any one of the numerous ways that the bigger models just are better. (But see the section below on “continuity”!)

However, we should not take this question so lightly. A yes answer would “complete the circuit” of the enthusiast’s argument—“turn it on” as a live concern. A no answer would leave the argument in limbo until more evidence comes in.

So, let’s assess the state of the evidence.

4. on ecological evaluation

Consider an organism, say, or a reinforcement learning agent. How do we know whether it has some capability?

Easy. We put it in a situation where it needs to deploy that capability to get what it wants. We put food (or reward) at the end of the maze.

Assessing capabilities by prompting GPT is not like this. GPT does not “want” to show off its capabilities to you, the way a mouse wants food and an RL agent wants reward.

What GPT wants—what it was directly optimized to do—is to guess how a text will continue. This is not the same as “getting the right answer” or even “saying something sensible.”

GPT was trained on the writing of thousands of individual humans, possessed of various flavors and magnitudes of ignorance, and capable of saying all kinds of irrational, inexplicable, or just plain bizarre things on occasion. To put it rather over-dramatically: much of the task of language modeling is figuring out which capabilities you’re not supposed to reveal right now. Figuring out what sorts of mistakes the current writer is likely to make, and making them right on cue.

Thus, prompting tends to vastly underestimate (!) what any LM knows how to do in principle.

What is special about the “food in the maze” type of evaluation: it removes any uncertainty as to whether the model knows it’s supposed to do the thing you want. The model is given a direct signal, in its “native language,” about exactly what you want. This will tend to elicit the capability if it exists at all.

There’s probably a standard term for the “food in the maze” thing, but I don’t know it, so I’ll just make one up: “ecological evaluation.”


4b. the road not taken

It’s totally possible to do ecological evaluation with large LMs. (Indeed, lots of people are doing it.) For example, you can:

  • Take an RL environment with some text in it, and make an agent that uses the LM as its “text understanding module.”

    • If the LM has a capacity, and that capability is helpful for the task, the agent will learn to elicit it from the LM as needed. See e.g. this paper.

  • Just do supervised learning on a capability you want to probe.

Both of these can be done with the LM weights frozen, or with full fine-tuning, or with a frozen LM plus a new “head” on top.

A purist might argue that you have to freeze the LM weights, or else you aren’t really probing what the LM “already” knows. (The gradients from fine-tuning could induce new capabilities that weren’t there before.)

But I doubt it really matters, since it turns out you can get the benefits of full fine-tuning even if you only tune the bias terms—conceptually, just boosting or lowering the salience of patterns the LM could already recognize.

There is a divide—to me, a strange and inexplicable one—in the LM community, as to who does this ecological stuff and who doesn’t.

  • The people who do fine-tuning /​ extra heads /​ etc...

    • … generally don’t care about scaling (an exception: section 3.4 here)

    • ...generally use comparatively “tiny” models like BERT (an exception: T5)

    • are often just trying to get practical things done, not deepen our understanding of LM capabilities (an exception: “probing tasks” in BERTology)

  • The people who care about scaling and huge models...

    • … care about understanding LM capabilities

    • … mostly use non-ecological methods (prompting /​ few-shot), which are vastly unreliable measures of capability

    • … often use purely subjective (and thus bias-prone) measures, like whether samples from an LM “feel smart” or “sound human” to a particular reader

In other words, there are ways to really know what a big LM is capable of—but the GPT enthusiasts aren’t making use of them.


4c. non-ecological evaluation considered harmful

Non-ecological evaluation is epistemically bad. Whatever signal it provides is buried under thick layers of bias and noise, and can only be extracted with great care, if at all.

I don’t think the GPT enthusiasts realize just how bad it is. I think this is one crux of our disagreement.

Let’s survey some of the problems. (The names below are made-up and not meant very seriously—I just need some headings to make this section readable.)

Guess-what-I-mean bias, type 1

As discussed above, the model may not understand what specific thing you want it to do, even if it’s perfectly capable of doing that thing.

Result: a downward bias in capability estimates.

Guess-what-I-mean bias, type 2

The observed signal mixes together two components: “Can the model guess what you’re trying to make it do?”, and “Can the model actually do that thing?”

But when people interpret such results, they tend to round them off to measures only of the latter.

That is, when people see a bigger model do better on a few-shot task, they tend to think, “the model got better at the task!”—not “the model got better at guessing which task I mean!”

But bigger models tend to get better at these two things simultaneously. The better results you get from bigger models reflect some mixture of “true capability gains” and “better guessing of what the prompt writer was trying to measure.”

Result: an upward bias in capability scaling estimates.

Prompt noise and humans-in-the-loop

Guessing-what-you-mean is extremely sensitive to fine details of the prompt, even with huge models. (This is why “prompt programming” is a thing.)

Thus, if you just pick the first reasonable-seeming prompt that comes into your head, you’ll get a horribly noisy measure of the LM’s true abilities. Maybe a slightly different prompt would elicit far better performance.

(As you’d expect, the GPT-3 paper—which took the “first reasonable-seeming prompt that comes into your head” approach—ended up using severely suboptimal prompts for some tasks, like WiC.)

If possible, you want less noisy estimates. So you do prompt programming. You try a bunch of different things.

Even picking one “reasonable-seeming” prompt requires some human linguistic knowledge (to tell you what seems reasonable). Optimizing the prompt introduces more and more human linguistic knowledge, as you use what you know about language and the task to come up with new candidates and diagnose problems.

Now we’re not evaluating a machine anyone. We’re evaluating a (human + machine) super-system.

I don’t want to make too much of this. Like, if you can find some prompt that always works across every variation of the task, surely the LM must “really know how to do the task,” right?

(Although there are dangers even here. Are you doing the same amount of prompt-optimization with bigger models as with smaller ones? What performance might be coaxed out of GPT-2 124M, if you gave it as much attention as you’re giving GPT-3? Probably not much, I agree—but if you haven’t tried, that’s a source of bias.)

The issue I’m raising here is not that big LMs can’t be smart without humans in the loop. (I’m sure they can.) The issue is that, with a human involved, we can’t see clearly which parts would be easy for a machine alone, and hence which parts get us straightforwardly closer to AGI.

For example. In an ecological setting—with no human, only a machine (say an RL agent with an LM sub-system) -- would the machine need to do its own “prompt programming”?

How much worse would it be at this than you are? (The part that operates the LM from the outside knows nothing about language; that’s what the LM is there for.) What algorithms would work for this?

Or maybe that wouldn’t be necessary. Maybe the right information is there in the LM’s inner activations, even when it’s fed a “bad” prompt. Maybe the problem with “bad” prompts is only that they don’t propagate this interior info into the output in a legible way. I don’t know. No one does.

[Addendum 11/​26/​21: prompt/​P-tuning sheds some light on this question, cf. next Addendum]

4d. just how much does prompting suck?

But how much does all that really matter? Are we really missing out on nontrivial knowledge here?

Two case studies.

Case study: BERT in the maze

The GPT-3 paper measured model capabilities with “few-shot” prompting, i.e. filling up a long prompt with solved task examples and letting the model fill in the final-unsolved one. Typically they used 10 to 100 examples.

They compared GPT-3 against strong previous models on the same tasks.

These reference models used fine-tuning, generally with many more than 100 examples—but the gap here is not always very big. On some benchmarks of great academic interest, even the fine-tuned models only get to see a few hundred examples:

SuperGLUE data sizes

Some of the reference models were carefully designed by researchers for one specific task. Let’s ignore those.

In most cases, the paper also compared against a BERT baseline: literally just a vanilla transformer, like GPT-3, hooked up to the task with vanilla fine-tuning. (Fine-tuning BERT is literally so routine that a machine can do the entire process for you, even on a totally novel dataset.)

How well did GPT-3 do? On most tasks, about as well as a fine-tuned BERT-Large. Which is a transformer 500 times smaller than GPT-3.

These are not new feats of intelligence emerging at GPT-3′s vast scale. Apparently they’re already there inside models several orders of magnitude smaller. They’re not hard to see, once you put food at the end of the maze, and give the model a reason to show off its smarts.

(Once again, GPT-3 saw fewer examples than the reference models—but often not by much, and anyway you can make BERT do just fine with only 10-100 examples if you try hard and believe in yourself)

So. If even cute little BERT-Large is capable of all this … then what on earth is GPT-3 really capable of?

Either GPT-3 is far smarter than the few-shot results can possibly convey . . .

. . . or it isn’t—which would be a dramatic failure of scaling, with those 499 extra copies of BERT’s neural infrastructure hardly adding any intelligence!

No one knows, and no amount of prompting can tell you.

As I wrote last summer:

I called GPT-3 a “disappointing paper,” which is not the same thing as calling the model disappointing: the feeling is more like how I’d feel if they found a superintelligent alien and chose only to communicate its abilities by noting that, when the alien is blackout drunk and playing 8 simuntaneous games of chess while also taking an IQ test, it then has an “IQ” of about 100.

[Addendum 11/​26/​21:

“No one knows” here was wrong. The P-tuning paper, from March 2021, described an ecological evaluation method for GPTs that make them competitive with similarly-sized BERTs on SuperGLUE.

I think I had heard of prompt tuning when I wrote this, but I had not read that paper and didn’t appreciate how powerful this family of methods is.

I’m not currently aware of any P-tuning-like results with very large models like GPT-3. End addendum]

Case study: no one knows what few-shot results even mean

There’s an excellent blog post by janus called “Language models are 0-shot interpreters.” Go read it, if you haven’t yet. I’ll summarize parts of it below, but I’ll probably get it a bit wrong.

As stated above, the GPT-3 paper prompted the model with solved task examples. In fact, they compared three variants of this:

  • zero-shot: no examples

  • one-shot: a single example

  • few-shot: many (10-100) examples

Most of the time, more “shots” were better. And the bigger the LM, the more it benefitted from extra shots.

It is not immediately obvious what to make of this.

The GPT-3 paper takes care to be technically 100% agnostic about the underlying mechanism . . . if you read it carefully, including the fine-print (i.e. footnotes and appendices). At the same time, in its choice of words, it gestures suggestively in exciting directions that a casual reader is likely to take at face value.

For example, the paper makes extensive use of the term “meta-learning.” Read casually, it seems to be saying that LMs as big as GPT-3 have a novel capability—they can learn new tasks on the fly, without fine-tuning!

But what the paper means by “meta-learning” is probably not what you mean by “meta-learning.”

The paper’s own definition is provided in a footnote. It is (admirably) precise, non-standard, and almost tautologous. In short, meta-learning is “any mechanism that makes more ‘shots’ work better”:

These terms [“meta-learning” and “zero/​one/​few-shot” -nost] are intended to remain agnostic on the question of whether the model learns new tasks from scratch at inference time or simply recognizes patterns seen during training – this is an important issue which we discuss later in the paper, but “meta-learning” is intended to encompass both possibilities, and simply describes the inner-outer loop structure.

The same passage is quoted by janus in “Language models are 0-shot interpreters,” who goes on to say:

The later discussion is not very extensive, mostly just acknowledging the ambiguity inherent to few-shot [...]

This is the uncertainty that I will investigate in this blog post, expanding on the results published in Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm.

My purpose is also to challenge the ontology introduced by Language Models are Few-Shot Learners. Although the authors are careful to remain agnostic as to the mechanism of few-shot/​meta-learning, what we have found by probing the mechanism suggests that an alternative framework which emphasizes the means by which a task is communicated may be more salient in some contexts.

What does janus mean? The post goes on to describe a number of experiments, whose results suggest that

  • What matters is not the number of task examples (“shots”), but how well the prompt specifies the desired task.

  • Failures of zero- and one-shot are often failures to guess-what-I-mean on the basis of a legitimately ambiguous prompt.

  • The value of additional “shots” may only lie in their value as a proxy for clarity in task communication.

    • Once you have written a sufficiently clear one-shot (or even zero-shot) prompt, the model does not do any better with additional examples—the task has already been communicated.

    • In some cases, one-shot is actually worse than zero-shot—because it adds a new kind of ambiguity.

      (“We noticed that sometimes the model would respond to one-shot prompts as if the semantic content of the example translation was relevant to the new translation. Without multiple examples, it’s less clear that the translation instances are meant to be parallel and independent.”)

  • OpenAI made much of the fact that adding “shots” helps larger models more. (This was the result behind the whole “meta-learning” framing.) However...

    • Larger models are also far better at zero-shot.

    • Comparing a zero-shot prompt to a “control” prompt with no task information, larger models get a larger fraction of their few-shot performance out of the jump from control to zero-shot, and a smaller fraction from the additional examples.

    • In other words: GPT-3′s size lets it extract more information from examples, if you provide them. But its size also lets it extract far more information from the original question. So it doesn’t need the examples as much.


janus’s post delights me for two reasons. First, I enjoyed learning the new experimental evidence it presents. But second, and perhaps more importantly, there was the sense of relief that someone actually did the experiments!

OpenAI’s few-shot “learning” results are full of ambiguity. The GPT-3 paper left me confused on a basic philosophical level, as I noted at the time.

Surely the model isn’t learning French de novo from 100 paired sentences—either it speaks French at the outset, or it doesn’t. So what could it be “learning” from those 100 examples?

Likewise for virtually every result in the paper: grammar, commonsense reasoning, book-learning trivia quizzes… all things it’s clearly possible to learn from reading massive swaths of the internet, and all things it’s clearly impossible to learn from reading 10 to 100 examples. Yet the examples help? And I’m supposed to think that makes the model . . . smarter, somehow?

Well, for French --> English translation at least, it turns out that the examples help in pretty much the only way they possibly could: by informing an already competent translator that you are requesting a translation.

As we were attempting to replicate [OpenAI’s translation] results, we noticed that when the model was failing on the 0-shot prompt, the failures were often of catastrophic nature: the task was not attempted at all, e.g. the model would output a newline, or another (or the same) French phrase instead of an attempt at an English translation.

BLEU assigns a score from 0 to 1 to the accuracy of a translation, and would assign a score close to 0 to a catastrophic failure. The scores reported in the paper, however, are averaged over a large dataset, so the same score could hypothetically correspond to uniformly flawed attempts or a mix of perfect attempts and catastrophic failures.

It seemed possible that 0-shot prompts were much less reliable at getting the models to attempt the translation task, but result in equivalent accuracy in the event that they did attempt it.

[janus does an experiment investigating this hypothesis, and the results seem to confirm it]

How much of the apparent consistent monotonic improvement in performance on tasks relative to number of shots in OpenAI’s results can be attributed to an unhelpful zero-shot prompt? Much more extensive testing is needed to say, but I suspect that this is the case for most of the translation tasks, at least.

If the extra shots are just about clarifying the task, then what should we make of the claim that “larger models benefit more from extra shots?” That it’s . . . easier to clarify tasks to them using this one particular mechanism? When people say GPT-3 displays some new, frightening kind of intelligence, emerging only at its massive scale, surely they can’t mean that?

And that’s not even all. As janus shows, even though it is easier to clarify tasks to GPT-3 through the “shots” mechanism, it’s also easier for GPT-3 to guess what you mean with no shots at all.

“My friend has such sharp hearing. Why, you see see, conditional on her not hearing what you say the first time you say it, she will definitely hear it when you repeat yourself.” Quite probably true, but not a good way to make the point!

What does it even mean that “language models are few-shot learners”? What does that tell us about the model’s capabilities? We don’t know. We haven’t studied it in the level of depth appropriate for something that might actually matter.

After all, janus did a simple and innocuous set of experiments—just trying to figure out which prompts work best—and ended up drawing radically different conclusions about the whole thing than OpenAI did.

Oh, surely GPT-3 is plenty smart, I don’t doubt that. The key question is how much smarter it got from scale, and in which ways. I don’t think we’ll know that until we put the model to the test, ecologically.

5. what scaling does

LMs are trained with a convex loss function. This means they are not min-maxers. They prefer to spread out their capabilities.

Given two areas of skill, Thing A and Thing B, they’ll try to become equally good at both, even if that means not doing especially well at either. Given an extra marginal unit of potential-for-greatness, they’ll smear it out as far as possible over all the Things they know about.

Thanks to the convex loss, they do this in proportion to how bad they are at each Thing to begin with—leveling up their very worst abilities first, making themselves as un-specialized as they can.

As we’ve discussed above, LMs are also trained on very wide-ranging text corpora. Everything from fourth-tier clickbait news to advanced physics preprints to advanced-looking but crackpot physics preprints to badly-written furry porn to astonishingly well-written furry porn to mangled OCR transcripts of 18th-century law texts to et cetera, et cetera. And as far as they can manage, they will do exactly as well at modeling each and every (information-theoretic) bit of it.

Larger LMs achieve lower loss. We know that from the scaling laws. And lower loss means being better at predicting each individual word in that corpus, as uniformly as possible.


What does this imply?

First: that larger LMs are better at everything. It is difficult to find any capability which is present at all in smaller LMs, yet which does not improve with scale.

And second: that LMs abhor a skill vacuum.

Take a tiny LM, so tiny it really can’t make heads or tails of some particular type of text. Now start scaling it up. As its capacity grows, its first priority is to eliminate its greatest weaknesses.

That one type of text, that utterly baffled the tiny LM? Convex loss hates that. Every additional unit of capacity gets invested, disproportionately, in bringing such stragglers up to par. The LM desperately wants to be at least sort-of-decent-I-guess at everything—more than it wants to be a master of anything.

Given any one Thing, it will reach sort-of-decent-I-guess performance at that Thing at the smallest scale it can manage—given the competition from all the other Things it desperately needs to be sort-of-decent-I-guess at.

By subjective standards, to human eyes casually scanning over LM samples, this happens pretty fast.

GPT-3 is great at lots of individual Things. But take any one of those Things, and you can bet a much tinier LM can do it at the sort-of-decent-I-guess level.


Humans, I think, tend to expect intelligence to grow in discontinuous jumps. Stages of child development. They don’t understand that at this age. And then, a year or two later, they do understand—fully.

LMs work the other way around. They never perform a sudden jump into competence where they could instead make a slow, gradual rise from “sort of seeming like they understand 10% of the time” to “sort of seeming like they understand 11% of the time” and so on. And this for every Thing uniformly.

It’s very hard to find any point where scaling suddenly “flips a switch,” and the model didn’t Get It before, but now it Gets It.

(The one example I know of is GPT-3 arithmetic, for some reason. Note that few-shot learning—whether you call it “meta-learning” or not—is as gradual as everything else, not a switch that flips on with GPT-3.)

[Addendum 11/​26/​21:

I wrote this in ignorance of the BIG-Bench project, which is tracking returns to scale for a large and diverse set of tasks.

BIG-Bench has not published results yet, but they livestreamed some preliminary results in May 2021; see also LW discussion here.

In the livestream, they give two examples of tasks with smooth scaling and two examples of tasks with a “sudden switch-flip” around 100B params (this slide). They also show, in the aggregate over all tasks, “switch-flip” to faster scaling around 100B (this slide), although this is tricky to interpret since it depends on the task mixture. End addendum]


This confounds our intuitive assessments of LM scaling.

I have been on this train since the beginning, when tiny lil’ GPT-2 124M blew my mind. I’ve used every new big every model from almost the moment it came out, as excited as a kid on Christmas morning.

I did this with every step of the (in retrospect, rather silly) GPT-2 staged release. My tumblr bot started out as (I think) 774M. Then I jumped to 1.5B.

That was as far as free OpenAI models went, but when EleutherAI came out with a 2.7B model, I finetuned that one for my bot. I was willing to endure the absolute horrors of mesh-tensorflow (don’t ask) to get that 2.7B model up and running. Then, when EleutherAI made a 6.1B model, I got my bot using it in under a week.

I feel a kind of double vision, seeing these scale-ups happen in real time. The bigger models are better, at everything at once. Each increment of scale is barely perceptible, but once you’ve crossed enough of a scale gap, you feel it. When I go back and sample from 124M again, I feel deprived. It just isn’t as good.

And yet, 124M blew my mind the first time I saw it. And no bigger LM, not even GPT-3, has come close to that experience.

Even lil’ 124M is sort-of-decent-I-guess at so many things. It gets all the basics that older LMs missed: a true grasp of the regularity of syntax, the nuances of style, and at least the way meaning sounds if not meaning itself.

124M makes lots of little blunders, littered all over the place. Your probability of running into one increases as you read a sample, token by token.

You can glide along almost thinking “a human wrote this,” but soon enough, you’ll hit a point where the model gives away the whole game. Not just something weird (humans can be weird) but something alien, inherently unfitted to the context, something no one ever would write, even to be weird on purpose.

The bigger models? They smooth out all the trivial failings. They unwrinkle the creases. They babble on for longer and longer stretches without falling flat on their face. But eventually they fall, and they fall just as hard.


Play with GPT-3 for long, and you’ll see it fall hard too.

Here’s a sample where GPT-3 falls on its face. It starts out as a (near-?) verbatim regurgitation of a Wikipedia article on the 6th Harry Potter film. It’s factually perfect up until the plot summary, which immediately goes off the rails into metafiction:

Following a Harry Potter fan’s dream that Harry’s late headmaster Albus Dumbledore is alive, and in a critical condition at the Ministry of Magic, Harry Potter and his friends Ron Weasley and Hermione Granger, decide to rescue him, as the school year comes to a close.

This sort of fanfictional “plot summary” proceeds with barely even an internal kind of coherence, as the tone veers from tense, dark drama to inappropriate anticlimax:

The two engage in a fierce duel in which Snape calls on his master to save him. Harry is unaffected by the curse due to his ability to cast a shield charm. He manages to shield himself and fight back, and in his distraction, Snape accidentally breaks his neck and dies.

Dumbledore explicitly dies, yet is somehow alive later on:

Lucius disarms Dumbledore, and an enraged Bellatrix kills him. [...]


Harry wakes up to find Dumbledore, Sirius, and Remus in the hospital wing, as well as his friends and the rest of the school, and he realizes that he is safe.

At the very end, both the Wikipedia article and the extended plot summary abruptly fall away, and we are suddenly reading an inane film review, by someone who must have a dark sense of humor:

I liked this movie as it is full of action and adventure. The plot is great as well as the dialogue. It is a well made movie and it is very entertaining.

This movie is definitely a must-watch, as it has plenty of action as well as being very humorous.

Advertisements

This sample is a failure. No one would have written this, not even as satire or surrealism or experimental literature. Taken as a joke, it’s a nonsensical one. Taken as a plot for a film, it can’t even keep track of who’s alive and who’s dead. It contains three recognizable genres of writing that would never appear together in this particular way, with no delineations whatsoever.


Remember janus’s point about taking averages over success and “catastrophic failure”? Human reactions to LM samples are like that too. Even the smallest ones do all sorts of things OK, and can string you along for a while. The bigger ones do this for longer. The average consists of stringing people along, and then failing.

When you see a statistic like “humans could distinguish LM samples from human-written text in 52% of cases,” this doesn’t mean people are squinting at every single text like Blade Runner characters, scrutinizing it for the subtle, nearly invisible “tells” of cold, whirring machinery.

It means most of the texts look like generic news stories, and then occasionally you get one where Dumbledore dies, and is alive in the hospital, and I liked this movie as it is full of action and adventure.

When we get GPT-4, Dumbledore will die and be reborn a little less often, but mark my works, it’ll happen.

Subjectively, LM scaling converges pointwise but not uniformly. The bigger models string you along for longer stretches, and then they fall, and they fall exactly as hard. Like the jumps in a Fourier series, striving to fit a square wave and never succeeding, not at infinite scale.

Animation of the Gibbs Phenomenon, from Wikipedia

5b. what curation ratios miss

Bigger models write “better” samples, by our subjective lights. Is that possible to quantify?

There’s a simple way to do this, proposed by gwern and recently expanded upon by janus.

Sample from an LM, which some threshold of perceived quality in mind. Keep only the samples that good enough to pass your threshold, discarding the rest. (This is simply “curation” or “cherry-picking”). Divide the number of samples you keep by the total number you generated, and you get a ratio.

gwern:

Objective metrics hard to interpret. How much better is (un-finetuned base) GPT-3? The likelihood loss is an absolute measure, as are the benchmarks, but it’s hard to say what a decrease of, say, 0.1 bits per character might mean, or a 5% improvement on SQuAD, in terms of real-world use or creative fiction writing. It feels like a large improvement, definitely a larger improvement than going from GPT-2-345M to GPT-2-1.5b, or GPT-2-1.5b to GPT-3-12b, but how much?

Screening gains: 1:100 → 1:5 or 20× better? For fiction, I treat it as a curation problem: how many samples do I have to read to get one worth showing off? [...] With GPT-2-117M poetry, I’d typically read through a few hundred samples to get a good one, with worthwhile improvements coming from 345M→774M→1.5b; by 1.5b, I’d say that for the crowdsourcing experiment⁠, I read through 50–100 ‘poems’ to select one. But for GPT-3, once the prompt is dialed in, the ratio appears to have dropped to closer to 1:5—maybe even as low as 1:3! [...] I would have to read GPT-2 outputs for months and probably surreptitiously edit samples together to get a dataset of samples like this page.

I completely understand why you would want to compute this metric. But it doesn’t capture everything about subjective quality.

Suppose I take a math test. If it’s scored in the simplest way, with no partial credit, then my score on the test is straightforwardly a “curation ratio.” If I got 50% of the questions right, my ratio is 2:1.

But there are a lot of different ways to get 50% of a math test wrong. I could

  1. remember the techniques needed solve a 50% subset of the questions, and be totally stumped by the other 50%, or...

  2. be capable of doing all the problems, but fail to use it 50% of the time perhaps due to getting distracted or tired later in the test, or...

  3. be capable of doing all the problems, but fail to use it 50% of the time in a completely random fashion no one can predict or explain, or...

  4. be capable of doing all the problems, but also be sensitive to question phrasing in a complex and illegible way, so that I flub some questions that say “solve for X” where I could have gotten them if they had said “find X”, or...

  5. ignore the nominal purpose of a math test (“demonstrate what you know”) and instead set myself the task of roleplaying a self-consistent student character of initially unspecified skill, so that after making one stray mistake, the “correct” thing to do is now to make the same mistake everywhere, just like “my character” would, or...

#1 and #2 are the kinds of mistakes we expect humans to make. When we interpret human test scores, we assume we are averaging over these types of mistakes. This is why we feel comfortable making inferences from the scores to the human’s understanding of the mathematical material.

LMs do #3-5 ubiquitously. (In particular, whatever else they are doing, LMs are always doing the role-playing of #5.)

Even as curation ratios trend towards 1:1, I don’t think this distinction shrinks in size. That is, the LMs really are making fewer mistakes, but when they do make mistakes, I don’t think they make them for increasingly “appropriate” reasons.

As I’ll describe next, I’ve read a whole lot of LM samples, across many different model sizes. For better or for worse, this is the subjective sense I come away with.

It is for hard for me, inside my intuitive mental model, to credit any LM with any concrete “capability.” To rely on it to know something, the way I’m relying on you to know some things (or else this post would be incomprehensible).

In my model there are only probabilities of different failures, and the probabilities decline, but the failures themselves remain perfectly devoid of sense and reason, concentrated diamonds of non-thought and non-seeing. Not amendable to Dennett’s intentional stance. Non-even-wrong rather than wrong.

This movie is definitely a must-watch, as it has plenty of action as well as being very humorous.

Advertisements

5c. my own subjective experience

What does it feel like to compare one GPT model to another one that’s just a tad bigger? (You can try this right now on one of the free web APIs, say with 774M and 1.5B.)

To be honest, I find this distinction almost imperceptible. So close to imperceptible that I’d be willing to chalk the remainder up to confirmation bias.

I’ve made this leap three times with my tumblr bot. Add up several nearly-imperceptible changes and, of course, you get a perceptible one. The bot really does feel a little “smarter” at 6.1B than at 774M.

But smarter how, exactly? I can’t point you to one single thing it “learned” or “became capable of” when I went from 774M to 1.5B—like I said, I could hardly notice a difference, much less a discrete one like “learning something new.” Likewise, I cannot point to any one new capability I got from the next two scale-ups. (The model did start occasionally writing non-English text at 2.7B, but I think that was just the change in pre-training corpus.)

It just . . . sounds a little more like it’s making sense, now, on average. Equivalently, it takes more tokens on average for it to stop making sense.

My mental model of these things does not contain abstractions like “knowing a fact” or “mastering a skill.” If it ever did, it doesn’t now.

I don’t even know how many tens of thousands of LM samples I’ve read by now. (Just my bot alone has written 80,138 posts—and counting—and while I no longer read every new one these days, I did for a very long time.)

Read enough, and you will witness the LM both failing and succeeding at anything your mind might want to carve out as a “capability.” You see the semblance of abstract reasoning shimmer across a strings of tokens, only to yield to suddenly to absurd, direct self-contradiction. You see the model getting each fact right, then wrong, then right. I see no single, stable trove of skills being leveraged here and there as needed. I just see stretches of success and failure at imitating ten thousand different kinds of people, all nearly independent of one another, the products of barely-coupled subsystems.

This is hard to refute, but I think this is something you only grok when you read enough LM samples—where “enough” is a pretty big number.

GPT makes many mistakes, but many of these mistakes are of types which it only makes rarely. Some mistake the model makes only every 200 samples, say, is invisible upon one’s first encounter with GPT. You don’t even notice that model is “getting it right,” any more than you would notice a fellow human “failing to forget” that water flows downhill. It’s just part of the floor you think you’re standing on.

The first time you see it, it surprises you, a crack in the floor. By the fourth time, it doesn’t surprise you as much. The fortieth time you see the mistake, you don’t even notice it, because “the model occasionally gets this wrong” has become part of the floor.

Eventually, you no longer picture of a floor with cracks in it. You picture a roiling chaos which randomly, but regularly, coalesces into ephemeral structures possessing randomly selected subsets of the properties of floors.

Once you have this picture, I find, it never goes away. That “bad” Harry Potter sample was exceptional; I did have to dig through plenty of unobjectionable stuff to find it, or stuff that was merely wrong in some more limited way (factually, tonally). Compared to the ones I knew, the model went on for longer stretches not making each mistake—before, at last, it made it.

I’ve played with GPT-3 in the API too, and still, I just don’t see the “phase transition” that some people see. I don’t see a new level of abstract reasoning, just stochastically longer intervals in which the text fails to reveal a lack of abstract reasoning.

5c. presidents

What do the different GPTs “know” about the historical presidents of the U.S.?

Here is a picture from OpenAI’s second scaling paper:

The accompanying text:

[...] we also observe some qualitative “phases of learning”, with small models having difficulty understanding the question being asked of them, larger models showing some rudimentary understanding, and the largest models correctly answering the questions. [...]

Tiny models appear to have trouble understanding the question, and don’t place any significant probability on the correct answer. Larger models understand that we’re requesting a US president, but fail to understand that the “second president” and “first president” are different requests, placing most of their weight for both questions on “George Washington”. Only larger models understand both aspects of the questions, answering both correctly.

Is this information about “what the different models know” about the presidents?

The inset text in the picture seems to say so: “100M-parameter models think every president is George Washington … 3B-parameter models understand ordering or presidents.”

Like many offhand statements about big LMs, this one made me feel kind of offended on behalf of smaller ones. I’ve seen sub-1B models converse about all kinds of more niche topics with factual accuracy far above chance. Do we really need 3B parameters for something as basic as “ordering of presidents”?

So I looked into it a bit.

I wrote a different prompt, not in OpenAI’s Q&A format. (I expect my results don’t generalized well across prompts, as LM “knowledge” is always mysteriously sensitive to details of context.)

My prompt looks like:

List of presidents of the United States
George Washington, April 30, 1789 – March 4, 1797
John Adams, March 4, 1797 – March 4, 1801
Thomas Jefferson, March 4, 1801 – March 4, 1809
James Madison, March 4, 1809 – March 4, 1817
James Monroe, March 4, 1817 – March 4, 1825
John Quincy Adams, March 4, 1825 – March 4, 1829
Andrew Jackson, March 4, 1829 – March 4, 1837
Martin Van Buren, March 4, 1837 – March 4, 1841
William Henry Harrison, March 4, 1841 – April 4, 1841
John Tyler, April 4, 1841 – March 4, 1845
James K. Polk, March 4, 1845 – March 4, 1849
Zachary Taylor, March 4, 1849 – July 9, 1850

and so on, ending with Trump. I feed the entire thing into the model, once, and read off its predictions for the first token in each new president’s name.

(In terms of shots, the problem is 0-shot for the first president, 1-shot for the second, 2-shot for the third, etc.)

Here are the probabilities you get from the different GPT-2 models:

What if we look at the rank of the right answer—whether it’s the model’s first guess (rank 0), or its runner-up (rank 1), etc.?

This lens reveals knowledge in the smaller models that’s invisible if you look only at top-1 guesses or probabilities. These models correctly assign low probabilities to all answers—they’re not great at this task, and they know it—but press them for a top-10 list, and you’ll often find the right answer in it.

And here’s the four models from the OpenAI API:

[11/​26/​21: Unfinished. I don’t remember what the rest of this section was going to be about. Probably I was going to say that we intuitively assume models learn things like “ordering of presidents” in stages, and that we tend to interpret coarse-grained evidence in this light, when finer-grained evidence would show continuity with no stages]

6. objective metrics, or: are you smarter than a 5th-grader?

[11/​26/​21: Never started writing this section. It and the next section were intended to call into question the sense that “we have evidence showing GPT-3 is smarter than GPT-2,” by reviewing this evidence.

This section was about changes in numeric metrics. It was going to argue, using LAMBADA as an example, that lot of the standard NLP benchmarks are testing capabilities that seem easy relative to what we know about GPT-2/​3, and thus that they are not good probes of further capability growth.

The BIG-Bench project is now addressing this point, with a suite of diverse tasks, many of them very hard.]

7. getting smarter: subjective evidence?

[11/​26/​21: Never started writing this section. It and the previous section were intended to call into question the sense that “we have evidence showing GPT-3 is smarter than GPT-2,” by reviewing this evidence.

This section was about the subjective feeling that GPT-3 is smarter. It was going to make several arguments, including:

  • Subjective impressions of GPT models are distorted by an anchoring effect.

    • People have a “wow moment” the first time they witness a GPT model.

    • The “wow” moment looks similar in people who had it with GPT-2, vs. people who missed GPT-2 and only became aware of these models with GPT-3.

    • People who see GPT-3 for the first time, without having seen GPT-2, are impressed by many of the things GPT-2 can already do, and only secondarily by new traits of GPT-3; they are also unable to tell which of these traits are which.

    • In both cases, the model was advertised as being importantly bigger than earlier models; this advertisement was more aggressive in the case of GPT-3.

    • So, people anchor to whichever model size they saw first, and attribute the “wow effect” to that size.

    • GPT-2 was once considered scarily big and impressive, maybe even dangerously so; a year later, GPT-2 became “small” and “dumb” and the role of model-that-wows-you transferred to GPT-3; yet the impressive traits are in fact largely the same ones across models.

  • Subjective impressions of GPT models are distorted by framing and salesmanship.

    • Public discussion of GPT models has tended to uncritically accept OpenAI’s choices of framing, e.g.

      • Discussion of GPT-2 during the staged release focused heavily on the ethics of the staged release. GPT-3 is ostensibly much more powerful, but I don’t remember many (any?) people asking why they didn’t do a staged release.

      • Discussion of GPT-3 has focused heavily on the idea that GPT-3 is not just “smarter” but somehow “differently smart,” by exhibiting “meta-learning.” This corresponds to OpenAI’s framing but, as shown above, does not survive a close read of the evidence.

]

8. pancakes

[11/​26/​21: Never started writing this section. It was going to discuss the following idea mentioned in Sharma and Kaplan 2020:

Scaling could also end for other, more interesting reasons. For example, perhaps beyond a certain point the loss can only improve by exploring a higher dimensional data manifold. This is possible if the data manifold has a pancake-like structure, with a small width that can only be dissected by models with very large capacity.

In string theory, spacetime lives in 11 dimensions, but looks mostly like a 4-dimensional slice; orthogonal to the slice, there are tiny but importantly structured protrusions into the other dimensions.

The data manifold might look like this too, and the “tiny extra dimensions” might encode a lot of the high-level capabilities we care about for AGI. As argued in the paper, the scaling exponents are determined by the (effective) dimension of the data manifold. So, scaling might slow down when the model has mastered the lower-dimensional “width axis of the pancake,” and remaining gains lie only in the higher-dimensional “height axis of the pancake.”

]