The ‘Bitter Lesson’ is Wrong
There are serious problems with the idea of the ‘Bitter Lesson’ in AI. In most cases, things other than scale prove to be extremely useful for a time, and then are promptly abandoned as soon as scaling reaches their level, when they could just as easily combine the two, and still get better performance. Hybrid algorithms for all sorts of things are good in the real world.
For instance, in computer science, quicksort is easily the most common sorting algorithm, who uses a pure quicksort? Instead they add on an algorithm that changes the base case, or handles lists with small numbers of entries, and so on. People could have learned the lesson that quicksort is just better than these small-list algorithms once you reach any significant size, but that would have prevented improving quicksort.
Another, unrelated personal example. When I started listening to Korean music, it bothered me that I couldn’t understand what they were singing, so on some of my favorite songs, I searched fairly significantly for translations.
When I couldn’t find translations for some songs, I decided to translate them myself. I didn’t know more than a phrase or two of Korean at the time, so I gave them to an AI (in this case, Google Translate, which had already transitioned to deep learning methods at the time.)
Google Translate’s translations were of unacceptably low quality of course, so I used it as a word or short phrase reference. Crucially, I didn’t know Korean grammar at the time either (which is in a different order than English.) It took me a long time to translate those first songs, but not anywhere near enough to be much training on the matter.
So how did my translations of a language I didn’t know compare to the DL used by Google? It was as different as night and day in favor of my translation. It’s not because I’m special, but because I used those methods people discarded in favor of ‘just scale’. What’s a noun? What’s a verb? How do concepts fit together? Crucially, what kind of thing would a person be trying to say in a song, and how would they say it? Plus, checking that backwards, if they were trying to say this thing, would they have said it in that way? I do admit to a bit of prompt-engineering even though I didn’t know what that was at the time, but that means I knew that a search through similar inputs would give a better result instinctively.
I have since improved massively by learning how grammar works in Korean (which I started learning 8 months later). By learning more about the concepts used in English grammar. What the words mean. Etc. Simple practice too, of course. Leveled off after a bit, but it started improving again when I consulted an actual dictionary; why don’t we give these LLM’s real dictionaries? But I could improve on it without any of that by using simple concepts and approaches we refuse to supply these AIs with.
Google Translate has since improved quite a lot, but is still far behind even those first translations where all I added were the simpler than even grammar structure of language and thought, along with actual grounding in English.. Notably, that’s despite the fact that Google Translate is far ahead of more general models like GPT-J. Sure, my scale dwarfs them, obviously, but I was trained with almost no Korean data at all, but because I had these concepts, and a reference, I could do much better.
Even the example of a thing like AlphaGo wasn’t so much a triumph of deep-learning over everything, as, if you combine insights like search algorithms (Monte Carlo tree search) with trained heuristics from countless things, it goes much better.
The real bitter lesson is that we give up on improvements out of some misguided purity.
I think you’ve misunderstood the lesson, and mis-generalized from your experience of manually improving results.
First, I don’t believe that you could write a generally-useful program to improve translations—maybe for Korean song lyrics, yes, but investing lots of human time and knowledge in solving a specific problem is exactly the class of mistakes the bitter lesson warns against.
Second, the techniques that were useful before capable models are usually different to the techniques that are useful to ‘amplify’ models—for example, “Let’s think step by step” would be completely useless to combine with previous question-answering techniques.
Third, the bitter lesson is not about deep learning; it’s about methods which leverage large amounts of compute. AlphaGo combining MCTS with learned heuristics is a perfect example of this:
I think your response shows I understood it pretty well. I used an example that you directly admit is against what the bitter lesson tries to teach as my primary example. I also never said anything about being able to program something directly better.
I pointed out that I used the things people decided to let go of so that I could improve the results massively over the current state of the machine translation for my own uses, and then implied we should do things like give language models dictionaries and information about parts of speech that it can use as a reference or starting point. We can still use things as an improvement over pure deep learning, by simply letting the machine use them as a reference. It would have to be trained to do so, of course, but that seems relatively easy.
The bitter lesson is about ‘scale is everything,’ but AlphaGo and its follow-ups use massively less compute to get up to those levels! Their search is not an exhaustive one, but a heuristic one that requires very little compute comparatively. Heuristic searches are less general, not more. It should be noted that I only mentioned AlphaGo to show that even it wasn’t a victory of scale like some people commonly seem to believe. It involved taking advantage of the fact that we know the structure of the game to give it a leg up.
AlphaGo isn’t an example of the bitter lesson. The bitter lesson is that AlphaZero, which was trained using pure self play, was able to defeat AlphaGo, with all of its careful and optimization and foreknowledge of heuristics.
The entire thing I wrote is that marrying human insights, tools, etc with the scale increases leads to higher performance, and shouldn’t be discarded, not that you can’t do better with a crazy amount of resources than a small amount of resources and human insight.
Much later, with much more advancement, things improve. Two years after the things AlphaGo was famous for happened, they used scale to surpass it, without changing any of the insights. Generating games against itself is not a change to the fundamental approach in a well defined game like Go. Simple games like Go are very well suited to the brute-force approach. It isn’t in the post, but this is more akin to using programs to generate math data for a network you want to know math. We could train a network on an infinite amount of symbolic math because we have easy generative programs, limited only by the cost we wanted. We could also just give those programs to an AI, and train it to use them. This is identical to what they did for AlphaZero. AlphaZero still uses the approach humans decided upon, not reinventing things on its own.
Massive scale increases surpassing the achievements of earlier things is not something I argue against in the above post. Not using human data is hardly the same thing as not using human domain insights.
It isn’t until 2 years after AlphaZero that they managed to make a version that actually learned how to play it on its own with MuZero. Given the scale rate increases in the field during that time, it’s hardly interesting that eventually it happened, but the scaling required an immense increase in money in the field in addition to algorithmic improvements.
I’m not sure how any of what you said actually disproves the Bitter Lesson. Maybe AlphaZero isn’t the best example of the Bitter Lesson, and MuZero is a better example. So what? Scale caught up eventually, though we may bicker about the exact timing.
AlphaZero didn’t use any human domain insights. It used a tree search algorithm that’s generic across a number of different games. The entire reason that AlphaZero was so impressive when it was released was that it used an algorithm that did not encode any domain specific insights, but was still able to exceed state-of-the-art AI performance across multiple domains (in this case, chess and Go).
I was pretty explicit that scale improves things and eventually surpasses any particular level that you get to earlier with the help of domain knowledge...my point is that you can keep helping it, and it will still be better than it would be with just scale. MuZero is just evidence that scale eventually gets you to the place you already were, because they were trying very hard to get there and it eventually worked.
AlphaZero did use domain insights. Just like AlphaGo. It wasn’t self-directed. It was told the rules. It was given a direct way to play games, and told to. It was told how to search. Domain insights in the real world are often simply being told which general strategies will work best. Domain insights aren’t just things like, ‘a knight is worth this many points’ in chess, or whatever the human-score equivalent is in Go (which I haven’t played.). Humans tweaked and altered things until they got the results they wanted from training. If they understood that they were doing so, and accepted it, they could get better results sooner, and much more cheaply.
Also, state of the art isn’t the best that can be done.
No, that’s not what domain insights are. Domain insights are just that, insights which are limited to a specific domain. So something like, “Trying to control the center of the board,” is a domain insight for chess. Another example of chess-specific domain insights is the large set of pre-computed opening and endgame books that engines like Stockfish are equipped with. These are specific to the domain of chess, and are not applicable to other games, such as Go.
An AI that can use more general algorithms, such as tree search, to effectively come up with new domain insights is more general than an AI that has been trained with domain specific insights. AlphaZero is such an AI. The rules of chess are not insights. They are constraints. Insights, in this context, are ideas about which moves one can make within the constraints imposed by the rules in order to reach the objective most efficiently. They are heuristics that allow you to evaluate positions and strategies without having to calculate all the way out to the final move (a task that may be computationally infeasible).
AlphaZero did not have any such insights. No one gave AlphaZero any heuristics about how to evaluate board positions. No one told it any tips or tricks about strategies that would make it more likely to end up in a winning position. It figured out everything on its own and did so at a level that was better than similar AIs that had been seeded with those heuristics. That is the true essence of the Bitter Lesson: human insights often make things worse. They slow the AI down. The best way to progress is just to add more scale, add more compute, and let the neural net figure things out on its own within the constraints that it’s been given.
No. That’s a foolish interpretation of domain insight. We have a massive number of highly general strategies that nonetheless work better for some things than others. A domain insight is simply some kind of understanding involving the domain being put to use. Something as simple as whether to use a linked list or an array can use a minor domain insight. Whether to use a monte carlo search or a depth limited search and so one are definitely insights. Most advances in AI to this point have in fact been based on domain insights, and only a small amount on scaling within an approach (though more so recently). Even the ‘bitter lesson’ is an attempted insight into the domain (that is wrong due to being a severe overreaction to previous failure.)
Also, most domain insights are in fact an understanding of constraints. ‘This path will never have a reward’ is both an insight and a constraint. ‘Dying doesn’t allow me to get the reward later’ is both a constraint and a domain insight. So is ‘the lists I sort will never have numbers that aren’t between 143 and 987’ (which is useful for and O(n) type of sorting). We are, in fact, trying to automate the process of getting domain insights via machine with this whole enterprise in AI, especially in whatever we have trained them for.
Even, ‘should we scale via parameters or data’ is a domain insight. They recently found out they had gotten that wrong (Chinchilla) too because they focused too much on just scaling.
Alphazero was given some minor domain insights (how to search and how to play the game), years later, and ended up slightly beating a much earlier approach, because they were trying to do that specifically. I specifically said that sort of thing happens. It’s just not as good as it could have been (probably).
And now we have the same algorithms that were used to conquer Go and chess being used to conquer matrix multiplication.
Are you still sure that AlphaZero is “domain specific”? And if so, what definition of “domain” covers board games, Atari video games, and matrix multiplication? At what point does the “domain” in question just become, “Thinking?”
A link for those who haven’t read it before: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
I probably should have included that or an explicit definition in what I wrote.
This illustrates how sample efficiency is another way of framing robust off-distribution behavior, the core problem of alignment. Higher sample efficiency means that on-distribution you need to fill fewer gaps between the data points, that data points can be more spaced out while still conveying their intent. This suggests that at the border of the training distribution, you can get further away without breaking robustness if there’s higher sample efficiency.
And the ultimate problem of alignment is unbounded reflection/extrapolation off-distribution, which corresponds to unbounded sample efficiency. I mean this in the somewhat noncentral sense where AlphaZero has unbounded sample efficiency, as it doesn’t need to look at any external examples to get good at playing. But it also doesn’t need to remain aligned with anything, so not a complete analogy. IDA seeks to make the analogy stronger.
This bitter lesson ?http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Yes, though I’m obviously arguing against what I think it means in practice, and how is it used, not necessarily how it was originally formulated. I’ve always thought it was the wrong take on AI history, tacking much too hard toward scale based approaches, and forgetting the successes of other methods could be useful too, as an over-correction from when people made the other mistake.