How Big a Deal are MatMul-Free Transformers?

If you’re already familiar with the technical side of LLMs, you can skip the first section.

The story so far

Modern Large Language Models—your ChatGPTs, your Geminis—are a particular kind of transformer, a deep learning architecture invented about seven years ago. Without getting into the weeds, transformers basically work by turning an input into numbers, and then doing tons and tons of matrix operations on those numbers. Matrix operations, and in particularly matrix multiplication (henceforth MatMul), are computationally expensive. How expensive? Well, graphics cards are unusually good at matrix multiplication, and NVIDIA, the main company making these, was the most valuable company on Earth earlier this month.

Over the last few years, spurred on by extreme investment, transformers have gotten larger and stronger. How good transformers are is multidimensional, and is roughly captured by scaling laws: basically, models get better when you give them more (high quality) data, make them bigger, or train them for longer.

I’ve written before about the data wall, the hypothesis that we’re running out of new data to train cutting edge AI systems on. But another path to much stronger AI would be if we trained them more efficiently: if you have to do way fewer (or way easier) mathematical operations when training an AI, you can do a lot more training on the same (gigantic) budget.

Basically, holding training data constant, if you can train a model twice as efficiently, you can also make it twice as big.^[1] Which is a big deal in a world where there may be bottlenecks for other ways to make better AI: if it isn’t the data wall, it may well be a wall of regulation preventing the insane power consumption requirements of a trillion-dollar cluster.

Cutting edge labs are in an intense race to make transformative AI, so we don’t know what kinds of efficiency advances they’ve been making for the past few years. But there has been hubbub the last few weeks about a new kind of model, which avoids the need for MatMul.

So, what’s the deal? Is the new research a flash in the pan, a small incremental win, or a bold new paradigm we’ll all be citing (and making birthday posts for) in seven years?

I’ll make a brief examination of the paper’s claims and why they’re exciting, then give reasons for restraint.

What’s the new architecture?

The new paper, by Rui-Jie Zhu et al., is Scalable MatMul-free Language Modeling. It came out on June 4th. In the places where a typical transformer would do MatMul, the paper instead does something different and more akin to addition. The technical details are pretty complicated, but the intuition that addition is easier/simpler than multiplication is spot on.

How much better does their new method work? Here’s the relevant graph from their paper:

The star is the hypothetical point where you’d get equal bang for your buck from their MatMul-free style and the current (public) SOTA^[2] - a little under 10^23 floating point operations (FLOPs). To quote them on the significance of this number:

Interestingly, the scaling projection for the MatMul-free LM exhibits a steeper descent compared to that of Transformer++. This suggests that the MatMul-free LM is more efficient in leveraging additional compute resources to improve performance. As a result, the scaling curve of the MatMul-free LM is projected to intersect with the scaling curve of Transformer++ at approximately 10^23 FLOPs. This compute scale is roughly equivalent to the training FLOPs required for Llama-3 8B (trained with 15 trillion tokens) and Llama-2 70B (trained with 2 trillion tokens), suggesting that MatMul-free LM not only outperforms in efficiency, but can also outperform in terms of loss when scaled up.

Basically, if they’re right about their scaling laws, their proposed architecture would become both more efficient and more effective at current high-end industrial levels of investment, and strongly more efficient in the future.

So yes, this is a pretty big deal. It doesn’t seem to be a hoax or strongly overhyped. If I were a top lab, and my own secret sauce wasn’t obviously better than this, I’d want to look into it.

What are the limitations?

There are several. Broadly:

The current architectural paradigm is complicated and expensive to change
The new approach hasn’t been tested (publicly) with very large model sizes
The new approach hasn’t been tested against the actual state of the art

We’ll take it from the top.

Transformers architecture is sticky

Cutting edge AI is a dance between software and hardware. Some particular software process gets good results. Whatever hardware happens to run that process best is now in demand, spurring investment both in whoever came up with the software process and whoever manufactures the relevant hardware. The hardware manufacturers optimize their hardware even better for the software, and the software developers optimize their software to leverage the new-and-improved hardware even better.^[3]

For several years now, MatMul has been favored, which means GPUs that are good at MalMul are favored, and those GPUs are optimized to be extra good at—you guessed it—MatMul. Even if this research result is totally correct and a MatMul-free architecture would perform better in the abstract, many different stakeholders would have to get together to make it happen.

And it’s not just hardware! The qualified engineers, too—of which there is a rather limited capacity on the entire planet, fueling their very high salaries—have honed their instincts on the current paradigm. There’s a lot of new math to learn, then master, then make absolutely secondhand. Math like this (from the paper):

Nothing the most talented machine learning engineers in the world can’t figure out, but there’s a difference between understanding and deep understanding. The actual state of the art sometimes advances by yolo runs, and the powerfully honed instinct behind these runs isn’t developed overnight.

So even if this new architecture is a superior alternative, it might take a long time before its benefits outweigh its costs in practice. At least until we start hitting some walls, and architectural shifts are the only way forward.

Not tested at scale

The authors list this limitation in their conclusion, along with an exhortation for top labs to give their new architecture a try, like so:

However, one limitation of our work is that the MatMul-free LM has not been tested on extremely large-scale models (e.g., 100B+ parameters) due to computational constraints. This work serves as a call to action for institutions and organizations that have the resources to build the largest language models to invest in accelerating lightweight models.

Fair enough! And indeed, outside of major labs (which keep their lips zipped), you’re not going to find the compute you need to test this stuff at 100B scale. But part of the reason transformer scaling laws are so important is because they have been proven to work over quite a large range. With a new architecture, you can’t totally assume that new, different scaling laws will be equally ironclad.

Nor would it be that easy to test, even for top labs. Training huge models is expensive, and doing it right would likely require new hardware and lots of retraining for many of the most in-demand employees in the world. It’s not easy to see if or when that’ll be a priority.^[4]

Not tested vs. cutting edge

This objection feels a little mean, but while the MatMul-free paper does test its new architecture against a strong open source implementation, it doesn’t benchmark itself against the cutting edge in absolute terms, because the real cutting edge is behind lock and key at top labs.

Of course, you have to measure against what you have, and it’s not like they are pumping up their numbers by comparing to obviously outdated architecture. But they aren’t proving that their new methods do well against the current best AI models. Which, again, is a reason for some doubt.

So… sell NVIDIA?

This is so not investment advice, but I mean, I haven’t.

I do think it’s an exciting sign (or a scary one) that we’re seeing research, in public, that suggests fundamental improvements to transformers architecture. Whether they truly are improvements, and whether those improvements can overcome years of inertia, remains to be seen.

In the meantime, provisionally, it’s time for new Anki cards.

^
Okay, not literally, since computational complexity doesn’t scale linearly with model size. But if you thought of that, why didn’t you skip the introduction?
^
They represent the current state of the art with the architecture of Llama 2, a not-exactly-cutting-edge but pretty good (and relatively open source) model.
^
Notably, one of the things Zhu et al. do in the paper is build custom hardware to better support their new architecture. Doing this as a proof of concept is really exciting, but very different from changing industrial processes at scale!
^
Or maybe it was already a priority two years ago, in secret, and all the top players are already doing it! Simply no way to know, though we might have seen reverberations in NVIDIA chip design, if that were true.