How much chess engine progress is about adapting to bigger computers?
(This question comes from a discussion with Carl Shulman.)
In this post I describe an experiment that I’d like to see run. I’m posting a $1,000 - $10,000 prize for a convincing implementation of these experiments. I also post a number of smaller prizes for relevant desk research or important corrections to this request.
Motivation
In order to understand the dynamics of the singularity, I’d like to understand how easy it is to improve algorithms and software.
We can learn something about this from looking at chess engines. It’s not the most relevant domain to future AI, but it’s one with an unusually long history and unusually clear (and consistent) performance metrics.
In order to quantify the quality of a chess engine, we can fix a level of play and ask “How much compute is needed for the engine to play at that level?”
One complication in evaluating the rate of progress is that it depends on what level of play we use for evaluation. In particular, newer algorithms are generally designed to play at a much higher level than older algorithms. So if we quantify the compute needed to reach modern levels of play, we will capture both absolute improvements and also “adaptation” to the new higher amounts of compute.
So we’d like to attribute progress in chess engines to three factors:
Better software.
Bigger computers.
Software that is better-adapted to new, bigger computers.
Understanding the size of factor #1 is important for extrapolating progress given massive R&D investments in software. While it is easy to separate factors #1 and #2 from publicly available information, it is not easy to evaluate factor #3.
Experiment description
Pick two (or more) software engines from very different times. They should both be roughly state of the art, running on “typical” machines from the era (i.e. the machines for which R&D is mostly targeted).
We then carry out two matches:
Run the old engine on its “native” hardware (the “old hardware”). Then evaluate: how little compute does the new engine need in order to beat the old engine?
Run the new engine on its “native” hardware (the “new hardware”). Then evaluate: how much compute does the old engine need in order to beat the new engine?
With some effort, we can estimate a quantitative ratio of “ops needed” for each of these experiments. For example, we may find that the new engine is able to beat the old engine using only 1% of the “old hardware.” Whereas we may find that the old engine would require 10,000x the “new hardware” in order to compete with the new engine.
The first experiment tells us about the absolute improvements in chess engines on the task for which the old engine was optimized. (This understates the rate of software progress to the extent that people stopped working on this task.) The second experiment gives us the combination of absolute improvements + adaptation to new hardware. Typical measures of “rate of software progress” will be somewhere in between, and are sensitive to the hardware on which the evaluation is carried out.
I believe that understanding these two numbers would give us a significantly clearer picture of what’s really going on with software progress in chess engines.
Experiment details
Here’s some guesses about how to run this experiment well. I don’t know much about computer chess, so you may be able to make a better proposal.
Old engine, old hardware: my default proposal is the version of Fritz that won the 1995 world computer chess championship, using the same amount of hardware (and time controls) as in that championship. This algorithm seems like a particularly reasonable “best effort” at making full use of available computing resources. I don’t want to compare an engine running on a very expensive old machine to an engine running on a cheap modern machine. You may have to be opportunistic about what kind of thing you can actually run.
New engine, new hardware: my default proposal is the version of Stockfish that won TCEC Season 20, on the same hardware+time used in that competition.
Running a new engine on old hardware: We should use whatever modern engine works best with teensy computers. It’s not important it be the same as the modern engine on modern hardware. We’d prefer if there was a dedicated team continuing to work on this problem, but absent that we want to use the best thing that exists.
Memory use. When running on the old hardware we need to match the memory use of the old machine. I’m not sure how to handle scaling of memory. One possibility is to hold the ratio fixed and scale them both up/down together, but something more realistic would be welcome.
Matchup vs ELO: I proposed experiments organized around 1:1 contests. I think there are lots of ways that could go wrong. It would be really good to at least sanity-check the results by comparing against a third reference engine. Quantification of this factor is welcome.
More engines, more hardware: you could do the same experiment with additional software engines, and could evaluate at more levels of compute. I think you get a lot of the benefits from the first few measurements, but more data helps and it might be a lot cheaper (especially if you need more engines to get good ELO estimates anyway).
Even older engines. I’m interested in results even older than Fritz, but the further back we go the more uncertainty I have about how to actually do the comparison.
Endgame tables, opening book, learned heuristics: some of the knowledge produced by chess engines was produced using large computers (by playing or observing very large numbers of games, brute-forcing endgames, and so on). We’d prefer exclude these factors if they use more compute than would have been affordable for the old engine. If we can’t, we at least want to quantify their influence. Including these factors could significantly overstate the returns of human R&D; it’s an important thing to know about, but for the purpose of forecasting the impacts of AI we really want to separate it out. This may be a major constraint for considering new engines.
What counts as an “operation”? I don’t think that making “hardware” comparisons between new and old computers will be straightforward. I think it’s easier if we restrict attention to consumer microprocessors, and I’m hoping that there’s only a little bit of uncertainty (e.g. a factor of 2). I think you can do the experiments before figuring this out, and then try to clarify relevant issues by seeing how fast the computers can run simple relevant backgrounds. (The “old hardware” run should probably be on just one processor.)
Timing out. It may be prohibitively expensive to give the old engine enough compute to beat the new engine. I think you should do a little bit of that work, to try to figure out the basic picture (how expensive it would be, rough bounds on the numbers, some very noisy estimates from playing just a few games). We can figure out where to go from there.
Different time controls. I’m expecting time controls to be comparable between the old and new competitions. If old competitions ran on much longer time controls, I’d prefer scale them down to something comparable (to try to better match what would have been realistic to experiment with during R&D / profitable for non-competition purposes). And similarly if new competitions are longer.
Ponder. Thinking during the opponent’s turn could change the answer a tiny amount (up to a factor of 2), but it really doesn’t seem worth dealing with, even if some of the engines are optimized to use it.
Inefficiencies from running on new computers. You might have to mess things up a lot to run old engines on new computers (or to deal with weird amounts of memory or so on). Ideally it would be possible to abstract out some of the details of “how long it actually takes” to talk about what the computation or memory costs ought to be if implemented without overhead. We’ll have to see how that goes.
Prize structure
I’m not spending long evaluating any of this and my decisions are going to be completely unaccountable. Dealing with me may be a waste of time. Your only recourse will be to make an angry comment on this post.
I know that all makes it less appealing to participate. But please have it in mind before spending any time on this project, and consider yourself warned!
I’m planning to give away at least $1,000 sometime in the next 3 months if anyone runs any plausible version of this experiment, or even points me to public information from which it’s possible to back out a similar number. I’m fine if that means I have to give $1,000 to a random LW comment.
I’d give away $10,000 if someone ran a clean version of the experiment, I found it convincing, and they were transparent about their methods. Before giving away the full prize I’d likely have a waiting period where I offered a separate prize for people to post replications or caveats or so on.
I’m generally expecting/hoping to wrap this up over the next couple of months, but will adjust based on responses.
If multiple people do experiments I’ll make some arbitrary call about how to allocate prize money. Timing will matter a bit but it’s a secondary consideration to quality of experiment. Earlier submissions can get paid based if they helped pave the way for later submissions.
If you are planning to spend time on this I encourage you to make a comment. I’m very happy to provide thoughts on a proposed experiment, e.g. to tell you what size of prize I’d expect to give it or what concerns I’d have about a proposal. None of this is binding.
If receiving a prize would be weird or problematic for some reason, I’m still interested in the results and you can opt to receive a comparable quantity of gratitude instead of $.
Desk research prizes
I’d also pay out $100-$1000 at my discretion for any of the following:
Plausible (or better yet convincing) estimates of the total investment in chess engine R&D over time.
Good analysis of the relative importance of hardware/software using public information (must at least improve over the analysis here). Or pointers to other similar experiments that have already been run.
Any consideration that I feel should significantly change the experimental setup, e.g. quantifying the importance of endgame tables, or noting a critical difference between old and new chess engines, or suggesting a reason that Fritz is a bad engine to compare to.
Any contributions that make it significantly easier to run this experiment (for example tracking down a usable implementation of an old chess engine).
https://www.lesswrong.com/posts/75dnjiD8kv2khe9eQ/measuring-hardware-overhang ?
Thanks for the link (and thanks to hippke for doing the experiments), that’s great.
To clarify my stance on prizes:
I will probably offer Gwern a $100 small prize for the link.
I will probably offer hippke a $1000 prize for the prior work.
I would probably have offered hippke something like a $3000 prize if the experiment hadn’t already been done.
The main thing to make the prize bigger would have been (i) doing the other half, of evaluating old engines on new hardware, (ii) more clarity about the numbers including publishing the raw data and ideally sufficiently detailed instructions for reproducing, (iii) more careful controls for memory, endgame tables, (iv) I would post a call for critiques to highlight reservations with the numbers before awarding the rest of the prize.
Someone could still earn a $10,000 prize for closing all of those gaps (and hippke could earn some large fraction of this).
Thank you for your interest: It’s good to see people asking similar questions! Also thank-you for incentivizing research with rewards. Yes, I think closing the gaps will be straightforward. I still have the raw data, scripts, etc. to pick it up.
i) old engines on new hardware—can be done; needs definition of which engines/hardware
ii) raw data + reproduction—perhaps everything can be scripted and put on GitHub
iii) controls for memory + endgame tables—can be done, needs definition of requirements
iv) Perhaps the community can already agree on a set of experiments before they are performed, e.g. memory? I mean, I can look up “typical” values of past years, but I’m open for other values.
i) I’m interested in any good+scalable old engine. I think it’s reasonable to focus on something easy, the most important constraint is that it is really state of the art and scales up pretty gracefully. I’d prefer 2000 or earlier.
ii) It would be great if where was at least a complete description (stuff like: these numbers were looked up from this source with links, the population was made of the following engines with implementations from this link, here’s the big table of game results and the elo calculation, here was the code that was run to estimate nodes/sec).
iii) For the “old” experiment I’d like to use memory from the reference machine from the old period. I’d prefer basically remove endgame tables and opening book.
My ideal would be to pick a particular “old” year as the focus. Ideally that would be a year for which we (a) have an implementation of the engine, (b) have representative hardware from the period that we can use to compute nodes/sec for each of our engines. Then I’m interested in:
Compute nodes/sec for the old and new engine on both the old and new hardware. This gives us 4 numbers.
Evaluate elos both of those engines, running on both “old memory” and “new memory,” as a function of nodes/turn. This gives us 4 graphs.
(I assume that memory affects performance slightly independently of nodes/turn, at least for the new engine? If nodes/turn is the wrong measure, whatever other measure of computational cost makes sense, the important thing is that the cost is linear in the measurement.)
i) To pick a reference year, it seems reasonable to take the mid/late 1990s:
- Almost all chess engines before ~1996 lacked (or had serious inefficiencies) using multi-cores (very lengthy discussion here).
- Chess protocols became available, so that the engine and the GUI separated. That makes it straightforward to automate games for benchmarking.
- Modern engines should work on machines of that age, considering RAM constraints.
- The most famous human-computer games took place in 1997: Kasparov-Deep Blue. That’s almost a quarter of a century ago (nice round number...). Also, at the time, commercial algorithms were considerably below human-level play.
ii) Sounds good
iii) The influence of endgames tables and opening books is typically small. It is reasonable to neglect it in our experiments.
iv) Yes, the 4-case-test is a good idea:
- 1997 PC with 1997 engine: ELO XXXX
- 1997 PC with 2021 engine: ELO XXXX
- 2021 PC with 1997 engine: ELO XXXX
- 2021 PC with 2021 engine: ELO XXXX
One main result of these experiments will be the split: Where does the ELO gain come from? Is it the compute, or the algo improvement? And the answer will be about 70% compute, 30% algo (give or take 10 percentage points) over the last 25 years. Without serious experiments, have a look at the Stockfish evolution at constant compute. That’s a gain of +700 ELO points over ~8 years (on the high side, historically). For comparison, you gain ~70 ELO per double compute. Over 8 years one has on average gained ~400x compute, yielding +375 ELO. That’s 700:375 ELO for compute:algo, or a rounded 70%-30% (SF has improved rather fast).
To baseline the old machine, we don’t need to boot up old hardware. There is plenty of trustworthy old benchmarking still available that has these numbers.
As the modern baseline, I would certainly recommend Stockfish:
- It is the best (or amongst the very top) for the last decade or so
- It is open source and has a very large dev community. Steps in improvements can be explained.
- Open source means it can be compiled on any machine that has a C++ compiler
Other modern engines will perform similarly, because they use similar methods. After all, SF is open source.
As a bonus, one could benchmark a Neural Network-based engine like LC0. There will be issues when using it without a GPU, however.
As for the old engine, it is more difficult to choose. Most engines were commercial programs, not open source. There is an old version of Fritz 5 (from 1998) freely available that supports protocols. I got it installed on a modern Windows with some headache. Perhaps that could be used. Fritz was, at the time of the Kasparov-Deep Blue match, the strongest commercial engine.
I like using Fritz.
It sounds like we are on basically the same page about what experiments would be interesting.
Isn’t that 70:30 algo:compute?
Yes, sorry, I got that the wrong way around. 70%=algo
Stockfish 12 and newer have neural network (NNUE)-based evaluation enabled by default so I wouldn’t say that Stockfish is similar to other non-NN modern engines.
https://nextchessmove.com/dev-builds is based on playing various versions of Stockfish against each other. However, it is known that this overestimates the ELO gain. I believe +70 ELO for doubling compute is also on the high side, even on single-core computers.
I was imagining using Stockfish from before the introduction of NNUE (I think that’s August 2020?). Seems worth being careful about.
I am very interested in the extent to which “play against copies of yourself” overstates the elo gains.
I am hoping to get some mileage / robustness out of the direct comparison—how much do we have to scale up/down the old/new engine for them to be well-matched with each other? Hopefully that will look similar to the numbers from looking directly at Elo.
(But point taken about the claimed degree of algorithmic progress above.)
Good point: SF12+ profit from NNs indirectly.
Regarding the ELO gain with compute: That’s a function of diminishing returns. At very small compute, you gain +300 ELO; after ~10 doublings that reduces to +30 ELO. In between is the region with ~70 ELO; that’s where engines usually operate on present hardware with minutes of think time. I currently run a set of benchmarks to plot a nice graph of this.
Very tangential to the discussion so feel free to ignore, but given that you have put some though before on prize structures I am curious about the reasoning for why you would award a different prize for something done in the past versus something done in the future
(Preface: I know about go engines, less about chess ones). I don’t think this experiment will forecast the impact of AI without further addressing neural networks. In particular,
The strongest engines are MuZero-like engines that use neural network heuristics trained on self-play. Training such large networks on commodity CPUs is implausible, much less on 20 year old hardware.
Given trained networks, new engines will almost always beat old ones. For example in go, the open source engine KataGo running on a single core CPU, doing only one playout per move has a 5d rank (2300 ELO) on the Online Go Server. Old engines can’t reach this rank even on large computing clusters.
The large improvement in strength is mainly attributable to neural networks being practical with new hardware, not new algorithms. The algorithms for both new and old engines are based on comparatively old literature (Monte Carlo Tree Search, decision theory, etc.)
Conclusion: Most of what you want to measure comes down to neural network training. The training framework is not directly comparable or backwards-compatible with old techniques, so the experiment formulation has to address this.
This seems right if “the dynamics of ML R&D are unrelated to other software R&D—you can’t learn about neural net efficiency improvements by looking at efficiency improvements in other domains.” But I’m not so sure about that (and haven’t seen any evidence for it).
ETA: to clarify, I’m mostly interested in how much future AI will get improved as we massively scale up R&D investment (by applying AI to AI development). This includes e.g. “Tweaking neural net architectures” or “Better optimization algorithms for neural networks” or “better ways to integrate neural networks with search” or whatever. Those improvements are indeed different from “better forms of tree search” or “better position evaluations” and so on. But I still think they are related—if I learn that for a few different domains “doubling R&D doubles performance,” then that gives me evidence that neural net performance will be similar, and if I learn that this kind of return is very rare then I’ll be more skeptical about that kind of extrapolation holding up even if I observe it for the first few orders of magnitude for neural networks.
As you can see in my Figure in this post (https://www.lesswrong.com/posts/75dnjiD8kv2khe9eQ/measuring-hardware-overhang), Leela (Neural Network based chess engine) has a very similar log-linear ELO-FLOPs scaling as traditional algorithms. At least in this case, Neutral Networks scale slightly better for more compute, and worse for less compute. It would be interesting to determine if the bad scaling to old machines is a universal feature of NNs. Perhaps it is: NNs require a certain amount of memory, etc., which gives stronger constraints. The conclusion would be that the hardware overhang is reduced: Older hardware is less suitable for NNs.
I’d recommend posting about your challenge on http://talkchess.com/forum3/index.php. You will find people who are experienced in testing old and new chess programs and some might be interested in the prizes. If you don’t have an account there, I can post a link for you.
I don’t think this research, if done, would give you strong information about the field of AI as a whole.
I think that, of the many topics researched by AI researchers, chess playing is far from the typical case.
An unusually long history implies unusually slow progress. There are problems that computers couldn’t do at all a few years ago that they can do fairly efficiently now. Are there problems where people basically figured out how to do that decades ago and no significant progress has been made since?
The consistency of chess performance looks like more selection bias. You aren’t choosing a problem domain where there was one huge breakthrough that. You are choosing a problem domain that has had slow consistent progress.
For most of the development of chess AI (All the way from Alpha Beta pruning to Alpha Zero) Chess AI’s improved by an accumulation of narrow, chess specific tricks. (And more compute) How to represent chess states in memory in a fast and efficient manor. Better evaluation functions. Tables of starting and ending games. Progress on chess AI’s contained no breakthroughs, no fundamental insights, only a slow accumulation of little tricks.
There are cases of problems that we basically knew how to solve from the early days of computers, any performance improvements are almost purely hardware improvements.
There are problems where one paper reduces the compute requirements by 20 orders of magnitude. Or gets us from couldn’t do X at all, to able to do X easily.
The pattern of which algorithms are considered AI and which are considered maths and which are considered just programming is somewhat arbitrary. A chess playing algorithm is AI, a prime factoring algorithm is maths, a sorting algorithm is programming or computer science. Why? Well those are the names of the academic departments that work on them.
You have a spectrum of possible reference classes for transformative AI that range from the almost purely software driven progress, to the almost totally hardware driven progress.
To gain more info about transformative AI, someone would have to make either a good case for why it should be at a particular position on the scale, or a good case for why its position on the scale should be similar to the position of some previous piece of past research. In the latter case, we can gain from examining the position of that research topic. If hypothetically that topic was chess, then the research you propose would be useful. If the reason you chose chess was purely that you thought it was easier to measure, then the results are likely useless.
Is your prediction that e.g. the behavior of chess will be unrelated to the behavior of SAT solving, or to factoring? Or that “those kinds of things” can be related to each other but not to image classification? Or is your prediction that the “new regime” for chess (now that ML is involved) will look qualitatively different than the old regime?
I’m aware of very few examples of that occurring for problems that anyone cared about (i.e. in all such cases we found the breakthroughs before they mattered, not after). Are you aware of any?
Factoring algorithms, or primality checking, seem like fine domains to study to me. I’m also interested in those and would be happy to offer similar bounties for similar analyses.
I think it’s pretty easy to talk about what distinguishes chess, SAT, classification, or factoring from multiplication. And I’m very comfortable predicting that the kind of AI that helps with R&D is more like the first four than like the last (though these things are surely on a spectrum).
You may have different intuitions, I think that’s fine, in which case this explains part of why this data is more interesting to me than you.
Can you point to a domain where increasing R&D led to big insights that improved performance?
Perhaps more importantly, machine learning is also “a slow accumulation of little tricks,” so the analogy seems fine to me. (You might think that future AI is totally different, which is fine and not something I want to argue about here.)
If Alice says this and so never learns about anything, and Bob instead learns a bunch of facts about a bunch of domains, I’m pretty comfortable betting on Bob being more accurate about most topics.
I think the general point is: different domains differ from one another. You want to learn about a bunch of them and see what’s going on, in order to reason about a new domain.
I agree with the basic point that board games are selected to be domains where there is an obvious simple thing to do, and so progress started early. I think in that way they are similar to SAT solving and factoring, and (slightly) different from image classification, for which it’s arguably hard to measure progress before the 90s.
I think that the more important difference between chess and image classification (in terms of making the chess data clean) is that there is a homogeneous measure of performance for chess, whereas image classification has moved on to harder and harder tasks. I think this does slightly change the nature of the task, but mostly it just makes the data clean.
I think that the main difference between chess and SAT solving is mostly that chess is more naturally interesting to people so they’ve been working on it longer, and that is a real factor that makes the data cleaner (without making it less useful as an analogy). SAT solving also has some of the image classification problem, of depending a lot on the distribution of instances.
(With respect to “aren’t choosing a domain where there was one huge breakthrough,” I’m definitely interested in domains with such breakthroughs.)
The oldest Chess engine you can find for free online is cray blitz https://craftychess.com/downloads/crayblitz/ which was the world computer chess champion in 1983. Unfortunately A: it is not UCI compatible and B: the oldest version available is the 1989 version. I asked Harry Lewis Nelson himself yesterday and he said that he doesn’t have the source code for the 1983 version anymore. Unfortunately any farther back and the original author doesn’t appear to be alive anymore. (harry himself is 89)
The oldest chess engine that can be called a champion is kaissa https://www.chessprogramming.org/Kaissa however the best I can do is give you a rewrite in turbo C http://greko.su/index_en.html I figure 1977 is as old as you’re going to get. Hippke can probably do most of the heavy lifting from here.
I agree with the use of stockfish 11 as it appears to be the best engine with poor hardware, NNUE requires more RAM than actually exists on a 1997 machine.
A Time-Leap Challenge for SAT Solving seems related:
I think you can compare modern chess programs with each other to evaluate this.
Some comparisons have been made between different modern chess engines in TCEC.
Stockfish is particularly well adapted to using lots of cores. i.e. Stockfish has a much larger advantage over the other modern programs when lots of CPU cores a available as they have optimized hash table contention very well.
If you compare NNUE stockfish to classic stockfish is also the question of how much strength stockfish NNUE loses when playing on hardware that does not support SIMD.
Similarly you can look at compare Leela Chess with and without GPU.
On the old vs new:
Old chess engines will have been optimized to use much less memory, where as modern chess engines use a very large hash table.
i.e. 3. Software that is better-adapted to new, bigger computers. is quite a large component.
I’m pretty interested in understanding the size of this effect by scaling down the memory use as well as compute to historical levels. (This is one of my concerns about hippke’s experiment, though it seems like they think it’s not a big factor.)