I will probably offer Gwern a $100 small prize for the link.
I will probably offer hippke a $1000 prize for the prior work.
I would probably have offered hippke something like a $3000 prize if the experiment hadn’t already been done.
The main thing to make the prize bigger would have been (i) doing the other half, of evaluating old engines on new hardware, (ii) more clarity about the numbers including publishing the raw data and ideally sufficiently detailed instructions for reproducing, (iii) more careful controls for memory, endgame tables, (iv) I would post a call for critiques to highlight reservations with the numbers before awarding the rest of the prize.
Someone could still earn a $10,000 prize for closing all of those gaps (and hippke could earn some large fraction of this).
Thank you for your interest: It’s good to see people asking similar questions! Also thank-you for incentivizing research with rewards.
Yes, I think closing the gaps will be straightforward. I still have the raw data, scripts, etc. to pick it up.
i) old engines on new hardware—can be done; needs definition of which engines/hardware
ii) raw data + reproduction—perhaps everything can be scripted and put on GitHub
iii) controls for memory + endgame tables—can be done, needs definition of requirements
iv) Perhaps the community can already agree on a set of experiments before they are performed, e.g. memory? I mean, I can look up “typical” values of past years, but I’m open for other values.
i) I’m interested in any good+scalable old engine. I think it’s reasonable to focus on something easy, the most important constraint is that it is really state of the art and scales up pretty gracefully. I’d prefer 2000 or earlier.
ii) It would be great if where was at least a complete description (stuff like: these numbers were looked up from this source with links, the population was made of the following engines with implementations from this link, here’s the big table of game results and the elo calculation, here was the code that was run to estimate nodes/sec).
iii) For the “old” experiment I’d like to use memory from the reference machine from the old period. I’d prefer basically remove endgame tables and opening book.
My ideal would be to pick a particular “old” year as the focus. Ideally that would be a year for which we (a) have an implementation of the engine, (b) have representative hardware from the period that we can use to compute nodes/sec for each of our engines. Then I’m interested in:
Compute nodes/sec for the old and new engine on both the old and new hardware. This gives us 4 numbers.
Evaluate elos both of those engines, running on both “old memory” and “new memory,” as a function of nodes/turn. This gives us 4 graphs. (I assume that memory affects performance slightly independently of nodes/turn, at least for the new engine? If nodes/turn is the wrong measure, whatever other measure of computational cost makes sense, the important thing is that the cost is linear in the measurement.)
i) To pick a reference year, it seems reasonable to take the mid/late 1990s: - Almost all chess engines before ~1996 lacked (or had serious inefficiencies) using multi-cores (very lengthy discussion here). - Chess protocols became available, so that the engine and the GUI separated. That makes it straightforward to automate games for benchmarking. - Modern engines should work on machines of that age, considering RAM constraints. - The most famous human-computer games took place in 1997: Kasparov-Deep Blue. That’s almost a quarter of a century ago (nice round number...). Also, at the time, commercial algorithms were considerably below human-level play.
ii) Sounds good
iii) The influence of endgames tables and opening books is typically small. It is reasonable to neglect it in our experiments.
iv) Yes, the 4-case-test is a good idea: - 1997 PC with 1997 engine: ELO XXXX - 1997 PC with 2021 engine: ELO XXXX - 2021 PC with 1997 engine: ELO XXXX - 2021 PC with 2021 engine: ELO XXXX
One main result of these experiments will be the split: Where does the ELO gain come from? Is it the compute, or the algo improvement? And the answer will be about 70% compute, 30% algo (give or take 10 percentage points) over the last 25 years. Without serious experiments, have a look at the Stockfish evolution at constant compute. That’s a gain of +700 ELO points over ~8 years (on the high side, historically). For comparison, you gain ~70 ELO per double compute. Over 8 years one has on average gained ~400x compute, yielding +375 ELO. That’s 700:375 ELO for compute:algo, or a rounded 70%-30% (SF has improved rather fast).
To baseline the old machine, we don’t need to boot up old hardware. There is plenty of trustworthy old benchmarking still available that has these numbers.
As the modern baseline, I would certainly recommend Stockfish: - It is the best (or amongst the very top) for the last decade or so - It is open source and has a very large dev community. Steps in improvements can be explained. - Open source means it can be compiled on any machine that has a C++ compiler
Other modern engines will perform similarly, because they use similar methods. After all, SF is open source.
As a bonus, one could benchmark a Neural Network-based engine like LC0. There will be issues when using it without a GPU, however.
As for the old engine, it is more difficult to choose. Most engines were commercial programs, not open source. There is an old version of Fritz 5 (from 1998) freely available that supports protocols. I got it installed on a modern Windows with some headache. Perhaps that could be used. Fritz was, at the time of the Kasparov-Deep Blue match, the strongest commercial engine.
70% compute, 30% algo (give or take 10 percentage points) over the last 25 years. Without serious experiments, have a look at the Stockfish evolution at constant compute. That’s a gain of +700 ELO points over ~8 years (on the high side, historically). For comparison, you gain ~70 ELO per double compute. Over 8 years one has on average gained ~400x compute, yielding +375 ELO. That’s 700:375 ELO for compute:algo
Stockfish 12 and newer have neural network (NNUE)-based evaluation enabled by default so I wouldn’t say that Stockfish is similar to other non-NN modern engines.
https://nextchessmove.com/dev-builds is based on playing various versions of Stockfish against each other. However, it is known that this overestimates the ELO gain. I believe +70 ELO for doubling compute is also on the high side, even on single-core computers.
Stockfish 12 and newer have neural network (NNUE)-based evaluation enabled by default so I wouldn’t say that Stockfish is similar to other non-NN modern engines.
I was imagining using Stockfish from before the introduction of NNUE (I think that’s August 2020?). Seems worth being careful about.
https://nextchessmove.com/dev-builds is based on playing various versions of Stockfish against each other. However, it is known that this overestimates the ELO gain. I believe +70 ELO for doubling compute is also on the high side, even on single-core computers.
I am very interested in the extent to which “play against copies of yourself” overstates the elo gains.
I am hoping to get some mileage / robustness out of the direct comparison—how much do we have to scale up/down the old/new engine for them to be well-matched with each other? Hopefully that will look similar to the numbers from looking directly at Elo.
(But point taken about the claimed degree of algorithmic progress above.)
Regarding the ELO gain with compute: That’s a function of diminishing returns. At very small compute, you gain +300 ELO; after ~10 doublings that reduces to +30 ELO. In between is the region with ~70 ELO; that’s where engines usually operate on present hardware with minutes of think time. I currently run a set of benchmarks to plot a nice graph of this.
Very tangential to the discussion so feel free to ignore, but given that you have put some though before on prize structures I am curious about the reasoning for why you would award a different prize for something done in the past versus something done in the future
To clarify my stance on prizes:
I will probably offer Gwern a $100 small prize for the link.
I will probably offer hippke a $1000 prize for the prior work.
I would probably have offered hippke something like a $3000 prize if the experiment hadn’t already been done.
The main thing to make the prize bigger would have been (i) doing the other half, of evaluating old engines on new hardware, (ii) more clarity about the numbers including publishing the raw data and ideally sufficiently detailed instructions for reproducing, (iii) more careful controls for memory, endgame tables, (iv) I would post a call for critiques to highlight reservations with the numbers before awarding the rest of the prize.
Someone could still earn a $10,000 prize for closing all of those gaps (and hippke could earn some large fraction of this).
Thank you for your interest: It’s good to see people asking similar questions! Also thank-you for incentivizing research with rewards. Yes, I think closing the gaps will be straightforward. I still have the raw data, scripts, etc. to pick it up.
i) old engines on new hardware—can be done; needs definition of which engines/hardware
ii) raw data + reproduction—perhaps everything can be scripted and put on GitHub
iii) controls for memory + endgame tables—can be done, needs definition of requirements
iv) Perhaps the community can already agree on a set of experiments before they are performed, e.g. memory? I mean, I can look up “typical” values of past years, but I’m open for other values.
i) I’m interested in any good+scalable old engine. I think it’s reasonable to focus on something easy, the most important constraint is that it is really state of the art and scales up pretty gracefully. I’d prefer 2000 or earlier.
ii) It would be great if where was at least a complete description (stuff like: these numbers were looked up from this source with links, the population was made of the following engines with implementations from this link, here’s the big table of game results and the elo calculation, here was the code that was run to estimate nodes/sec).
iii) For the “old” experiment I’d like to use memory from the reference machine from the old period. I’d prefer basically remove endgame tables and opening book.
My ideal would be to pick a particular “old” year as the focus. Ideally that would be a year for which we (a) have an implementation of the engine, (b) have representative hardware from the period that we can use to compute nodes/sec for each of our engines. Then I’m interested in:
Compute nodes/sec for the old and new engine on both the old and new hardware. This gives us 4 numbers.
Evaluate elos both of those engines, running on both “old memory” and “new memory,” as a function of nodes/turn. This gives us 4 graphs.
(I assume that memory affects performance slightly independently of nodes/turn, at least for the new engine? If nodes/turn is the wrong measure, whatever other measure of computational cost makes sense, the important thing is that the cost is linear in the measurement.)
i) To pick a reference year, it seems reasonable to take the mid/late 1990s:
- Almost all chess engines before ~1996 lacked (or had serious inefficiencies) using multi-cores (very lengthy discussion here).
- Chess protocols became available, so that the engine and the GUI separated. That makes it straightforward to automate games for benchmarking.
- Modern engines should work on machines of that age, considering RAM constraints.
- The most famous human-computer games took place in 1997: Kasparov-Deep Blue. That’s almost a quarter of a century ago (nice round number...). Also, at the time, commercial algorithms were considerably below human-level play.
ii) Sounds good
iii) The influence of endgames tables and opening books is typically small. It is reasonable to neglect it in our experiments.
iv) Yes, the 4-case-test is a good idea:
- 1997 PC with 1997 engine: ELO XXXX
- 1997 PC with 2021 engine: ELO XXXX
- 2021 PC with 1997 engine: ELO XXXX
- 2021 PC with 2021 engine: ELO XXXX
One main result of these experiments will be the split: Where does the ELO gain come from? Is it the compute, or the algo improvement? And the answer will be about 70% compute, 30% algo (give or take 10 percentage points) over the last 25 years. Without serious experiments, have a look at the Stockfish evolution at constant compute. That’s a gain of +700 ELO points over ~8 years (on the high side, historically). For comparison, you gain ~70 ELO per double compute. Over 8 years one has on average gained ~400x compute, yielding +375 ELO. That’s 700:375 ELO for compute:algo, or a rounded 70%-30% (SF has improved rather fast).
To baseline the old machine, we don’t need to boot up old hardware. There is plenty of trustworthy old benchmarking still available that has these numbers.
As the modern baseline, I would certainly recommend Stockfish:
- It is the best (or amongst the very top) for the last decade or so
- It is open source and has a very large dev community. Steps in improvements can be explained.
- Open source means it can be compiled on any machine that has a C++ compiler
Other modern engines will perform similarly, because they use similar methods. After all, SF is open source.
As a bonus, one could benchmark a Neural Network-based engine like LC0. There will be issues when using it without a GPU, however.
As for the old engine, it is more difficult to choose. Most engines were commercial programs, not open source. There is an old version of Fritz 5 (from 1998) freely available that supports protocols. I got it installed on a modern Windows with some headache. Perhaps that could be used. Fritz was, at the time of the Kasparov-Deep Blue match, the strongest commercial engine.
I like using Fritz.
It sounds like we are on basically the same page about what experiments would be interesting.
Isn’t that 70:30 algo:compute?
Yes, sorry, I got that the wrong way around. 70%=algo
Stockfish 12 and newer have neural network (NNUE)-based evaluation enabled by default so I wouldn’t say that Stockfish is similar to other non-NN modern engines.
https://nextchessmove.com/dev-builds is based on playing various versions of Stockfish against each other. However, it is known that this overestimates the ELO gain. I believe +70 ELO for doubling compute is also on the high side, even on single-core computers.
I was imagining using Stockfish from before the introduction of NNUE (I think that’s August 2020?). Seems worth being careful about.
I am very interested in the extent to which “play against copies of yourself” overstates the elo gains.
I am hoping to get some mileage / robustness out of the direct comparison—how much do we have to scale up/down the old/new engine for them to be well-matched with each other? Hopefully that will look similar to the numbers from looking directly at Elo.
(But point taken about the claimed degree of algorithmic progress above.)
Good point: SF12+ profit from NNs indirectly.
Regarding the ELO gain with compute: That’s a function of diminishing returns. At very small compute, you gain +300 ELO; after ~10 doublings that reduces to +30 ELO. In between is the region with ~70 ELO; that’s where engines usually operate on present hardware with minutes of think time. I currently run a set of benchmarks to plot a nice graph of this.
Very tangential to the discussion so feel free to ignore, but given that you have put some though before on prize structures I am curious about the reasoning for why you would award a different prize for something done in the past versus something done in the future