Benchmarking an old chess engine on new hardware
I previously explored the performance of a modern chess engine on old hardware (1, 2). Paul Christiano asked for the case of an old engine running on modern hardware. This is the topic of the present post.
State of the art
Through an online search, I found the CCRL Blitz Rating list. It is run on an i7-4770k at 9.2 MNodes/s. The time controls are 2min+1s per move, i.e. 160s per 40 moves, or 4s per move. On the 4770k, that’s 36.8 MNodes/move. The current number one on that list is Stockfish 14 at 3745 ELO. The list includes Fritz 5.32 from 1997, but on old hardware (Pentium 90. Over the years, CCRL moved through P90-P200-K62/450-Athlon1200-Athlon4600-4770k). I screened the list, but found no case of old-engine on modern hardware.
One solution may be to reach out to the CCRL and ask for a special test run, calibrated against several modern engines.
I can also imagine that the Swedish Chess Computer Association is able to perform the test directly, using their Rating List procedure. It covers 155,019 games played by 397 computers, going back to 2 MHz machines from the year 1984. They may be able to transplant an old version onto a new machine for cross-calibration. The right person to contact would be Lars Sandin.
But the devil is in the details...
Making own experiments
In principle, the experiment should be trivial. The standard tool to compare chess engines is the command-line interface cutechess-cli. Set up old and new and get it running.
Problem 1: Find an old engine with a working interface
Paul and I independently favored the program Fritz from the most famous chess year 1997, because it was at the time well respected, had seen serious development effort over many years, and won competitions. The problem is that it only supports today’s standard interface, UCI, since version 7 from the year 2001. So with that we can go back 20 years, but no more. Earlier versions used a proprietary interface (to connect to its GUI and to ChessBase), for which I found no converter.
I was then adviced by Stefan Pohl to try Rebel 6.0 instead. Rebel was amongst the strongest engines between 1980 and 2005. For example, it won the WCCC 1992 in Madrid. It was converted to UCI by its author Ed Schröder. I believe that old Rebel makes for a similarly good comparison as old Fritz.
A series of other old engines that support UCI are available for download. The following experiments should be possible with any of these.
Problem 2: Configuration
We can download Rebel 6 and set up cutechess like so:
cutechess-cli
-engine cmd="C:\SF14.exe" proto=uci option.Hash=128 tc=40/60+0.6 ponder=off
-engine cmd="rebeluci.exe" dir="C:\Rebel 6.0" timemargin=1000 proto=uci tc=40/60+0.6 ponder=off -rounds 20
One can define hash (RAM), pondering, and other settings through UCI in cutechess. However, Rebel does not accept these settings through the UCI interface. Instead, they must be defined in its config file wb2uci.eng:
Ponder = false
Set InitString = BookOff/n
Program = rebel6.exe w7 rebel6.eng
The “wX” parameter sets hash/RAM: w0=2 MB, w1=4 MB,.. w9=512 MB
So, we can test RAM settings between 2 MB and 512 MB for the old engine. Stockfish is typically run at its default of 128 MB. That amount would have been possible on old machines (486 era), although it would have been not common.
For Rebel 6, the interface runs through an adaptor, which takes time. If we would ignore the fact, it would simply lose due to time control violations. So, we need to give it 1000ms of slack with “timemargin=1000”. Now, let’s go!
Problem 3: Measuring the quality of a player that losses every single game
I ran this experiment over night. The result after 1,000 matches? Stockfish won all of them. So it appears that Rebel is worse, but by how much? You can’t tell if it loses almost every game.
Rebel 6.0 has ELO 2415 on a P90, Stockfish 13 is 3544 (SF14 is not in that list yet). That’s a difference of 1129 ELO, with expectations to draw only one game in 2199; and lose the rest. Of course, Rebel 6 will have more than 2415 ELO when running on a modern machine—that’s what we want to measure. Measuring the gap costs a lot of compute, because many games need to be played.
OK, can’t we just make SF run slower until it is more equal? Sure, we can do that, but that’s a different experiment: The one of my previous post. So, let’s keep the time controls to something sensible, at least at Blitz level for SF. With time controls of 30s, a game takes about a minute. We expect to play for at most 2199 games (worth 36 hours) until the first draw occurs. If we want to collect 10 draws to fight small number statistics, that’s worth 15 days of compute. On a 16-core machine, 1 day.
Unfortunately, I’m currently on vacation with no access to a good computer. If somebody out there has the resources to execute the experiment, let me know—happy to assist in setting it up!
Experiment: SF3 versus SF13
There is a similar experiment with less old software that can be done on smaller computers: Going back 8 years between new and old SF versions.
The Stockfish team self-tests new versions against old. This self-testing of Stockfish inflates the ELO score:
Between their SF3-SF13 is a difference of 631 ELO.
In the CCRL Rating list, which compares many engines, the difference is only 379 ELO (the list doesn’t have SF14 yet).
Thus, self-testing inflates scores.
Let us compare these versions (SF3 vs SF13) more rigorously. The timegap between 2013 and 2021 is ~8 years. Let us choose a good, but not ridiculous computer for both epochs (something like <1000 USD for the CPU). That would buy us an Intel Core i7 4770K in 2013, and an AMD 5950X in 2021. Their SF multicore speed is 10 vs. 78 MN/s; a compute factor of 8x.
We can now test:
How much more compute does SF3 require to match SF13?
Answer: 32x (uncertainty: 30-35x)How much of the ELO gap does SF3 close at 8x compute?
Answer: 189 i.e. 50% of 379 ELO, or 30% of 631 ELO.
Interpretation: If we accept SF as amongst the very best chess programs in the last decade, we can make a more general assessment of chess compute vs. algorithm. Compute explains 30-50% of the computer chess ELO progress; algorithm improvements explain 50-70%.
Experiment method:
cutechess-cli -engine cmd="C:\sf3.exe" proto=uci tc=40/10 -engine cmd="C:\sf14.exe" proto=uci tc=40/20 -rounds 100
This runs a set of 100 rounds, changing colors each round, between SF3 and SF13 with their default parameters. The time controls in this example are 40 moves in 20 vs. 10 seconds etc. The time values must be explored in a wide range to determine consistency: The result may be valid for “fast” time controls, but not for longer games.
From my experience, it is required to play at least 100 games for useful uncertainties.
Due to time constraints, I have only explored the blitz regime so far (30s per game and less). Yet, the results are consistent at these short time controls. I strongly assume that it also holds for longer time controls. Then, algorithms explain ~50% of the ELO gain for Stockfish over the last 8 years. Others are invited to execute the experiment at longer time settings.
Minor influence factors
So far, we have used no pondering (default in cutechess), no endgame tables, no opening books, default RAM (128 MB for Stockfish).
Endgame databases: A classical space-compute trade-off. Decades ago, these were small; constrained by disk space limitations. Today, we have 7-stone endgame databases through the cloud (they weigh in at 140 TB). They seem to be worth about 50 ELO.
The influence of opening books is small. I also suspect that its influence diminished as engines got better.
Pondering: If the engine continues to calculate with opponents time, it can use ~50% more time. Typical “hit rates” are about 60%. Thus, the advantage should be similar to 30% longer time control. ELO with time is not linear, thus no fixed ELO gain can be given.
RAM sizes (hash table sizes) have a very small influence (source 1, 2). For Stockfish, it appears to be only a few ELO.
- Voting Results for the 2021 Review by 1 Feb 2023 8:02 UTC; 66 points) (
- 3 Dec 2021 4:18 UTC; 6 points) 's comment on Biology-Inspired AGI Timelines: The Trick That Never Works by (
I ended up referring back to this post multiple times when trying to understand the empirical data on takeoff speeds and in-particular for trying to estimate the speed of algorithmic progress independent of hardware progress.
I also was quite interested in the details here in order to understand the absolute returns to more intelligence/compute in different domains.
One particular follow-up post I would love to see is to run the same study, but this time with material disadvantage. In-particular, I would really like to see, in both chess and go, how much material advantage AIs could make up for with greater intelligence/compute. This one seems particularly relevant for a bunch of takeoff dynamics in my current model of the world.