“Textbooks Are All You Need” was published yesterday by Microsoft Research. It’s the worst-named paper I’ve seen recently: it’s not about textbooks, it’s not all you need, and gratuitously imitating the title of a paper that introduced a different type of thing is dumb. But there’s a reason I’m writing about it.
What they did was basically this:
started with The Stack (a 3 TB collection of code) and text from StackOverflow
used a LLM to select 6B “high-quality” tokens from (1)
used GPT-3.5 to generate 1B tokens of text similar to textbooks
trained a small (1.3B parameter) model (“phi-1”) on (2) and (3)
used GPT-3.5 to generate text similar to textbook exercises
fine-tuned phi-1 on (5)
tested phi-1 on HumanEval to evaluate its programming ability
The results were pretty good, better than models 10x the size trained on 100x the data. So, it seems that scaling up isn’t the only thing that matters, and data quality can be more important than data quantity or parameter count. (You hear that, gwern?)
Going by the listed OpenAI API prices, running GPT-3.5 on The Stack to evaluate quality would’ve been maybe ~$6M. What the authors did instead was:
Use GPT-4 to evaluate a small fraction of it.
Use a much smaller code-specific model to generate embeddings.
Use a classifier to predict which embeddings are from what GPT-4 evaluates as good content.
How about if you bootstrap a model using its own evaluation for filtering? One of the authors says “I’m almost sure you can beat the teacher model” and I agree. That can give you recursive self-improvement of a type you see in both individual people and the culture of societies. People develop better taste and consume better content which makes them smarter so they develop better taste, and so on. Children hear the stories their grandfathers like, and culture develops.
That’s a weak sort of self-improvement, which tends to plateau for people. Humans do other things too, so by itself it’s weaker self-improvement than it appears to be for people. This is a technique I previously spent some time thinking about, so there are some other reasons I think it tends to plateau by itself. But still—recursive self-improvement!
Yes, in theory, if you have a much bigger model trained on a bigger dataset including the good selected data, and you can engineer prompts such that you get into a mode that models the good data specifically, then you can get the same results. In that sense, the performance reachable with this method is limited to what’s possible from model scaling plus prompt engineering. The amount of scaling needed for that seems to be potentially 100x, and getting into exactly the right mode with prompt engineering might be impractical, but still, that provides some rough limits on potential here.
gwern’s take on a similar paper (Tinystories), in case anyone was wondering. Notable part for me:
Apparently someone didn’t actually read my scaling hypothesis essay (specifically, the parts about why pretraining works and the varieties of blessings of scale). I have been pointing out for a long time that NNs are overparameterized and almost all training data is useless (which is a big part of why RL will be important, because RL lets you make the right data, or see meta-learning or dataset distillation or active-learning or core-sets...), and the reason scaling works is that all the prior approaches throw the baby out with the bathwater—the bitter lesson of Internet scrapes working so well is that we were too stupid to handengineer selecting only the right data, so if you admit defeat and train on all the data, at least you can’t screw that up. The optimal amount of data, whether natural or synthetic, you need to train an AGI will be many orders of magnitude smaller than the amount the first training run will actually use; this is one of the most important overhangs.
We are getting smarter about choosing the right data, but it’s far from obvious that we’re smart enough yet...
I would mention “The False Promise of Imitating Proprietary LLMs” as as an example of this: people thought they could take a quick cheap shortcut and clone GPT-3.5/4 by behavior-cloning its outputs, but they were operating under false premises—RLHF doesn’t magically teach a model lots of things it didn’t know before, all RLHF does is collapse the POMDP by hardwiring a few specific latent values ie. specialize a model down to things it already knows & provides a more user-friendly interface to what a model could already do. So, you get a sugar-rush from imitation-learning on RLHFed models which optimizes it for the benchmarks, but you don’t get the general broad capabilities you actually wanted, and that’s why—for all the goldrush hype declaring small open models king and how OA has no moat etc—you don’t see people using them all that much or in interesting ways outside of the benchmark tasks which no one cares about.
(This is the same mistake people made with GPT-3: “what benchmarks miss” is the broad flexible generalization and universal knowledge, and so if you say “GPT-3 isn’t important because smaller specialized finetuned models match its zero-shot”, or if you say “LLaMA is important because it’s a small specialized finetuned model which matches GPT-3.5′s zero-shot”, those benchmark numbers may be true, but they’re still irrelevant to what users really want.)
So, we’ll see. But at least training on textbooks is more plausible in terms of eliciting ‘what benchmarks miss’ than training on kindergarten-level fiction stories!
Your “scaling hypothesis essay”? I was thinking of stuff like this thread.
https://gwern.net/scaling-hypothesis
I know how to use google. The linked tweet is gwern liking MLP-mixer even more than transformers, because it’s more simpler and all you need is further-simplified architecture plus even more scale. But a couple years later, ViTs seem to be a lot better than that, and hybrid convolutional-transformer systems are more efficient than those.
I don’t know how that’s relevant. Liking MLP-Mixers doesn’t show that I think that datasets right now are optimal-sized and cannot be made much smaller, nor does it show that I didn’t argue the latter when this was a big part of my Tool AI essay and my explanation for why GPT-3 pretraining could work.
But, since you want to bring it up: I stand by that tweet. What I said then remains true today, as far as I know:
Arguments from silence are only compelling if there ought to be a lot of noise. Nor am I particularly worried that it’s been all of 2 years and we haven’t thrown out all the Transformers in favor of some more MLP-esque architecture:
architecture changes, as obvious and simple as they may seem in hindsight, can take an awful long time.
For example, the architectural tweaks that made deep fully-connected archs work and brought stuff like MLP-Mixer back to the mainstream, despite being trivial on the level of ‘divide by a constant’, nevertheless took something like 7 years to be invented after the early studies showing ‘fully-connected layers don’t scale’. This is pretty quick compared to many things—residual layers have been around since ~1988 before their 2014 reinvention, and most of the Bitter Lesson examples took decades. So, I’ll start worrying in about, oh say, a decade. (A better counterargument here would be, ‘perhaps they’ll win in the long run, but in the long run, we’re all dead’.)
there is no strong evidence against MLP-style approaches thus far; there have been no airtight theoretical proofs nor large-scale empirical benchmarkings showing them flatlining.
The available scaling laws, in fact, look pretty similar, like in Tay et al 2022. Considering how vastly less effort has gone into MLP-Mixers, to the point where Tay et al 2022 has to benchmark it only on encoders “since Mixers have not been used in autoregressive decoding”, and how we know that bad hyperparameters & simple arch flaws can destroy scaling, I consider it quite promising that its scaling curve already looks reasonable and beats some of the others.
I would also note that this is what you would expect of a more general architecture: starting off with a worse constant, but similar or better scaling, which means people neglect it in favor of what currently scores the highest on benchmarks (like ViT). Nothing new there! That’s how Transformers for images worked back when I started heretically suggesting CNNs might be eventually beaten by Transformer-CNN hybrids and someday soon, even all-Transformer image models, as the CNN inductive bias will wash out and we eventually start using multi-billion-parameter & n image models. So if you are going to cite ViT, I would remind you that the idea of a ViT could’ve been criticized for exactly the same reasons c. 2019: “it’s been 2 years after Vaswani, but CNNs, and hybrid-convolutional-Transformer systems, seem to be a lot better and more efficient than Transformers”...
there is extensive lockin and lack of experimentation with variants at scale.
Look at how everyone is still using BPEs, despite the disaster they have been for text2image models and their increasingly bizarre consequences like GPT-3 models that cannot genuinely rhyme but also will refuses to write non-rhyming poems. If OA can’t even move away from BPEs, then I’m not holding my breath for bigger novelties. (Stuff doesn’t just happen on its own, you know. Someone has to do it.)
as Transformers scale, they become ever more just ‘MLP archs with some self-attention’, and yet, they continue to work well and scale as usual. This should trouble people who believe self-attention is magic!
Roller points out that in OPT, the FLOPS spent in a model goes from being 40% MLP at GPT-2 to 80% MLP at GPT-3 scale. Presumably if you kept scaling up Transformers, the self-attention would just keep getting smaller… which raises the question of, in the long run, do you really need any explicit self-attention, or is whatever self-attention does mostly useful at small model sizes, and can be done by a more uniform large architecture?
I’d also point out the scaling studies of Nie et al 2021, where self-attention outperforms MLPs… but not by much, decreasing by scale, and even a tiny amount of self-attention essentially closes the gap. (And Liu et al 2021.)
In the other direction, MLP layers can be trained to imitate realistic self-attention layers with high accuracy: “Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers”, Bozic et al 2023:
I’m not impressed by that paper:
...
Such short lengths make simple memorization work better, reducing architectural advantages.
Their Transformer baseline BLEU score for english-french translation is 0.276 vs 0.43 for NLLB-200. Yet their modified design results were strictly worse than their Transformer baselines.
It’s increasing architecture complexity.
People have tried them. You just don’t get published unless you show progress.
You think you know something about tokenizers that OpenAI et al don’t, huh? Yes, current tokenizers have some problems, but I can tell you why they were used instead of something simpler: because the overall performance was better. Perhaps something like Meta’s MegaByte will replace them, but that’s not a design you’d suggested.
I know what the self-attention does and the answer is “no”. I will not be posting an explanation until something close enough and not too obscure is published.
ViTs aren’t increased architecture complexity compared to what they replaced.
I see.
Yep. I know from talking to OAers that they did not know the consequences of choosing BPEs on things like rhyming or anagrams. Other people are ignorant too; even computer poetry people don’t know it, eg in April Cynthia Rudin’s comments on her old GPT poetry research shows she still doesn’t understand why GPT-2 wouldn’t rhyme or why she’s also wrong when she claims ChatGPT can rhyme (or, for that matter, why the oddities of GPT poetry do not provide a good justification of the need for her interpretability work: because I didn’t need any interpretability research to figure out BPEs were the problem, her methods have not figured it out yet, and it’s far from obvious that any of her interpretability techniques would’ve diagnosed the problem given more work put into them).
And this is normal. Inventing something doesn’t mean you know everything about it. You shouldn’t be surprised that users of OA’s stuff learn things before OA; why would OA know as much about GPT poetry as I do? No one there researches GPT poetry. So it’s no surprise that they weren’t looking for reasons why GPT-2/3 couldn’t rhyme. More importantly than that, OA has been as surprised as anyone by things like inner-monologue. OA doesn’t know many things about its models, like the reason for the ‘unspeakable tokens’ or why DALL-E 2 couldn’t do anime*. Or consider how CLIP or ChatGPT took off. No, OA is certainly not omniscient. (Which is part of why OA has been talking about revisiting tokenization in order to better support foreign languages, which are harshly punished by current BPEs: even if you massively expand the BPE vocab to a million, you’re still handling poorly synthetic languages.)
I couldn’t’ve suggested MegaByte because it’s not meaningfully different from many other local->global Transformer variants, other than tailoring it slightly to byte encoding, and those have all proven to be flawed on LRA and/or scale poorly (eg Performer in that Tay paper before). If I wanted to point to a good proof of concept, I’d point to ByT5 still and also the text2image work showing how the economically-valuable task of generating images with arbitrary text is sabotaged by non-byte encodings even at PaLM scale. Which Google didn’t know even though they invented the image models in question. :thinking_face:
Byte encoding is my favored encoding in the long run and which I do expect to eventually take over, but I don’t know what the architecture is going to look like there or if byte-level tokenization will even require a ‘solution’ to giant context windows. Something else may win there, like a retrieval mechanism or going back to recurrency in a Perceiver-esque way.
I don’t think they have ablated tokenizations, and I definitely do not think they have proper benchmarks which would even tell them the answer if they did.
I see.
* DALL-E 2 anime was another case where OA insiders blew me off until I compiled enough cases that they had to admit there was something anomalous there and seem to have done a little bit about it given an early upgrade and then DALL-2-exp. Unfortunately, they never confirmed whether my CLIP-censoring hypothesis was correct. On the bright side, at least the DALL-E 2 paper acknowledged the damage done by BPEs.
Sorry, I wasn’t clear enough, or maybe I misunderstood your position.
I saw you liked MLP-mixer type designs because they’re simpler, and per your tweets and comment above, you seem to think larger models should need less complexity.
I consider complex network designs for training and for inference to be the same type of complexity, and “textbooks are all you need” is then evidence for more structural complexity being helpful.
You clarified things a bit here: apparently the basis for your position is that “architectural complexity is generally just inductive bias that becomes less important as more-flexible capabilities increase”. I disagree with that model...or at least think you’re taking it too far, misapplying heuristics you developed from seeing eg AlphaZero beating hardcoded chess heuristics.
Depending on the details of your position, “textbooks are all you need” may or may not be evidence against it—do you consider such a data → evaluation → training structure “inductive bias obviated by scaling”?
Karpathy seemed to understand the issues. He went to OpenAI from Tesla somewhat recently, but I’d include Tesla in “OpenAI et al” and you were implying that all of those major AI labs (“everyone”) didn’t understand that.
This Tay paper? I see a bunch of sparse attention approaches in that survey, but I don’t see what MegaByte does in there. Maybe I missed an earlier paper, but the Performer design is completely different. MegaByte is certainly simpler than much of that stuff, but I guess people were looking in the wrong direction.
I do think byte encoding works well with MegaByte type designs—as people have already found in testing.
I guess we disagree about that.
Speaking of MLPs and how supposedly they don’t scale & silently fail whenever anyone tries, this just dropped on Arxiv: “Scaling MLPs: A Tale of Inductive Bias”, Bachmann et al 2023 (excerpts)
Much like the others, MLPs just need more regularization and/or data because of their greater capacity/less hardwired inductive bias. (By the way, note that they investigate Chinchilla scaling to see if 1:1 is optimal like for Transformers; it is not, and MLPs require more data per parameter because they are more powerful. Good news for MLPs...)
Oh no, there are many more examples than ‘just’ ViTs and MuZero (and machine translation). ‘The curves cross’ goes back at least to Highleyman in the 1960s with nearest-neighbors image classification, and you could add the original Breiman tabular ML movement in the 1990s which focused on beating logistic regression. (I would also highlight my prediction that despite an almost unquestioned consensus post-diffusion models asserting GANs had failed due to intrinsic and possibly unfixable flaws, that they would do fine if scaled up.)
This isn’t a point in favor of architectures and ‘structural complexity’, but data. The question, of course, is to what extent it’s the right data and is not building in covertly (the way so many past failed methods do) expert hand-engineered priors/inductive-biases which earns buzz-clicks-cites right now but will ultimately hold back models compared to just ‘learn all the things!’ approaches to base models. It’s never really worked before, but people greet every new ‘our model beats GPT-3 with 1/100th the parameters’ research paper as the messiah—not that you remember any of those from 2020, or 2021, and 2022-2023 aren’t looking so hot either...
(It’s worth remembering that people, and especially academics, want it to be true that small cheap complicated models+data+algorithms, which require loving human expertise and curation and lots of published papers while costing little $, can solve all their problems; there are very few people who genuinely prefer large expensive simple models with hilariously-dumb architectures but which cost $$$. It is an acquired taste, to be sure. Like being kicked in the nuts. “Please sir, Mr Bitter Lesson, may I have another lesson that renders 2 more years of my life a complete waste of time?”)
Yes. Consider how an arch like MuZero turns compute into performance: it does so via data. Or consider what you spend extra compute on, as Bottou has been pointing out since the mid-2000s: you spend it on processing more data. SGD go brrrr.
You linked a 2023 tweet by Karpathy who has been reading me rant regularly about BPEs for at least 3 years now (note the first reply, BTW), and I’ve even discussed it with him in person! He didn’t think that before I did, so your example shows the opposite of what you take it to mean.
My point there is that all of the linear-attention Transformer variants seem to fail in some way and show bad scaling, like Performer does when explicitly investigated, and I expect the rest to also show bad scaling because none of them dominate Performer.
Byte encoding works well with non MegaByte type designs too...
The “baseline” comparisons in that paper are pretty funny; they messed with existing models in such a dumb way I think it was intentional. Anyway, a normal EfficientNet is ~2x as FLOP-efficient as their highlighted MLP example, it can reach higher accuracy more easily, and its advantage gets bigger with higher resolution than 64x64. Do you read the stuff you link to?
Maybe that’s true among engineers, but corporate directors and other people with lots of money definitely prefer that. And I also seem to see a decent number of people on twitter who feel some sense of superiority from “accepting the Bitter Lesson” harder than the next person. Then there are some people who just don’t want to think about design anymore and consider that an excuse to stop thinking about it...which I guess I’m fine with. Yeah, you do that.
I see; somehow I thought he was smarter than that. The few people at big AI labs I knew all understood those issues, but there’s obvious selection bias there.
That is fundamentally different from what MegaByte does. It’s not a linear-attention scheme. I’m not sure why you see this as relevant.
You are again missing the point: as I already explained, we expect the constant to be worse. (If the constant was better in addition to a similar exponent, we would not be debating ‘will MLPs someday replace Transformers’, because it would already be happening.)
I will also point out that there is a very big difference between ‘it literally doesn’t work, everyone’s tried it and it doesn’t work, it doesn’t work so badly they won’t even publish anything I can cite to prove that MLPs are idiotic and have zero chance of ever going anywhere and and and -’ and ‘OK fine yeah they work and scale smoothly in the usual way but look this paper’s basic implementation is slower & lower accuracy at small scale than good baselines from NNs with more hardwired inductive biases zomg do you even read the stuff you link’. (That whoosh you hear is the much-abused goalposts moving at the speed of submitting a comment.)
Anyway, I’ve fleshed out one idea I have of how a scaled-up MLP architecture could surpass Transformers.
You are seeing a selected sample. Don’t look at your social media feed, look at what people do. Look at the allocations of supercomputer time. Look at a single day of Arxiv papers (not the AKs, the actual site). There is not much research that takes it seriously or does research that will have long-term value, like running scaling law sweeps; it’s almost all stuff which relies on a fundamental denial of scaling & Bitter-Lesson-like logic, full of special-case tweaking or proving irrelevancies or creating a complicated architecture which saves a small constant-factor etc. Look at the Best Paper awards. Look at the surveys where the overwhelming majority deny scaling works even in principle (while also, interestingly, believing they are the brave minority, which they are not).
(Wow, way to just throw Karpathy under the bus.)
I wonder if anyone other than Karpathy has been reading my observations about BPEs all these years...
It’s not strictly linear in memory, but it’s doing the same thing of reducing the quadratic by some degree using a local->global hierarchy, as so many attention variants have done, and could reduce it to effectively linear (since stuff like nlogn are effectively just n). So MegaByte is stuck: either it continues being mostly vanilla over fixed-sized patches and eventually that sub-quadratic becomes expensive as the global sequence length & model must scale up, or it reduces the growth to linear by expanding the hierarchy further (adding in additional layers of ‘patches’) and just becomes another variant of local->global like Swin Transformer etc. (I thought they should’ve done a better job contextualizing it, in addition to benchmarking on LRA.) Maybe MegaByte somehow threads the needle just right with its particular set of tweaks, and will be the attention variant which finally cracks the nut, but I’m not too optimistic. /shrug. What would get my attention is wins on LRA benchmarking or exponents in scaling law sweeps like Tay. We’ll see.
Have fun playing with MLPs. I’m not trying to stop you, I’m just stating my position for audience members who understand it.
Might be good to post a hashed claim.
I think it would also be interesting if you could factor the models into
smaller models that represent reliably known knowledge well, such as this textbook model, and
models that sample far and wide but wouldn’t need to reproduce all the details in the more optimized models.
Something I have not seen yet but hope exists and I just missed it: some notion of how much structure a given type of data can hold. For example, it seems obvious that in a comparison of math versus english, math would have much more structure available. Or maybe, since in principal you can describe all of math in english, what I really mean is structure density, because math is basically nothing else and english is a lossier map to the same thing and more besides.
I feel like if we had this concept, we could milk some gains from making predictions or at least assumptions about how much structure we should expect in unstructured data of various kinds.