hippke

Karma: 431

hippke Jul 12, 2021, 6:22 PM
1 point
in reply to: magfrump’s comment on: How much chess engine progress is about adapting to bigger computers?
Yes, sorry, I got that the wrong way around. 70%=algo

hippke Jul 9, 2021, 7:29 PM
LW: 10 AF: 1
AF
in reply to: paulfchristiano’s comment on: How much chess engine progress is about adapting to bigger computers?
i) To pick a reference year, it seems reasonable to take the mid/late 1990s:
- Almost all chess engines before ~1996 lacked (or had serious inefficiencies) using multi-cores (very lengthy discussion here).
- Chess protocols became available, so that the engine and the GUI separated. That makes it straightforward to automate games for benchmarking.
- Modern engines should work on machines of that age, considering RAM constraints.
- The most famous human-computer games took place in 1997: Kasparov-Deep Blue. That’s almost a quarter of a century ago (nice round number...). Also, at the time, commercial algorithms were considerably below human-level play.

ii) Sounds good

iii) The influence of endgames tables and opening books is typically small. It is reasonable to neglect it in our experiments.

iv) Yes, the 4-case-test is a good idea:
- 1997 PC with 1997 engine: ELO XXXX
- 1997 PC with 2021 engine: ELO XXXX
- 2021 PC with 1997 engine: ELO XXXX
- 2021 PC with 2021 engine: ELO XXXX
One main result of these experiments will be the split: Where does the ELO gain come from? Is it the compute, or the algo improvement? And the answer will be about 70% compute, 30% algo (give or take 10 percentage points) over the last 25 years. Without serious experiments, have a look at the Stockfish evolution at constant compute. That’s a gain of +700 ELO points over ~8 years (on the high side, historically). For comparison, you gain ~70 ELO per double compute. Over 8 years one has on average gained ~400x compute, yielding +375 ELO. That’s 700:375 ELO for compute:algo, or a rounded 70%-30% (SF has improved rather fast).

To baseline the old machine, we don’t need to boot up old hardware. There is plenty of trustworthy old benchmarking still available that has these numbers.

As the modern baseline, I would certainly recommend Stockfish :
- It is the best (or amongst the very top) for the last decade or so
- It is open source and has a very large dev community. Steps in improvements can be explained.
- Open source means it can be compiled on any machine that has a C++ compiler

Other modern engines will perform similarly, because they use similar methods. After all, SF is open source.

As a bonus, one could benchmark a Neural Network-based engine like LC0. There will be issues when using it without a GPU, however.

As for the old engine, it is more difficult to choose. Most engines were commercial programs, not open source. There is an old version of Fritz 5 (from 1998) freely available that supports protocols. I got it installed on a modern Windows with some headache. Perhaps that could be used. Fritz was, at the time of the Kasparov-Deep Blue match, the strongest commercial engine.

hippke Jul 9, 2021, 6:39 PM
1 point
in reply to: paulfchristiano’s comment on: Measuring hardware overhang
The MIPS are only a “lookup table” from the year, based on a CPU list. It’s for the reader’s convenience to show the year (linear), plus a rough measure of compute (exponential).

The nodes/s measure has the problem that it is engine-dependent.

The real math was done by scaling down one engine (SF8) by time-per-move, and then calibrating the time to the computers of that era (e.g., a Quad i7 from 2009 has 200x the nodes/s compared to a PII-300 from 1999)

hippke Jul 9, 2021, 9:55 AM
1 point
in reply to: paulfchristiano’s comment on: Measuring hardware overhang
Yes, that is a correct interpretation. The SF8 numbers are:

MIPS = [139814.4, 69907, 17476, 8738, 4369, 2184, 1092, 546.2, 273.1, 136.5, 68.3, 34.1, 17.1, 8.5, 4.3]

ELO = [3407,3375,3318,3290,3260,3225,3181,3125,3051,2955,2831,2671,2470,2219,1910]

Note that the range of values is larger than the plotted range in the Figure. The Figure cuts off at a 80486DX 33 MHz, 27 MIPS, introduced May 7, 1990.

To derive an analytical result, it is reasonable to interpolate with a spline and then subtract. Let me know if you have a specific question (e.g. for the year 2000).

hippke Jul 9, 2021, 6:28 AM
1 point
in reply to: paulfchristiano’s comment on: How much chess engine progress is about adapting to bigger computers?
As you can see in my Figure in this post (https://www.lesswrong.com/posts/75dnjiD8kv2khe9eQ/measuring-hardware-overhang), Leela (Neural Network based chess engine) has a very similar log-linear ELO-FLOPs scaling as traditional algorithms. At least in this case, Neutral Networks scale slightly better for more compute, and worse for less compute. It would be interesting to determine if the bad scaling to old machines is a universal feature of NNs. Perhaps it is: NNs require a certain amount of memory, etc., which gives stronger constraints. The conclusion would be that the hardware overhang is reduced: Older hardware is less suitable for NNs.

hippke Jul 9, 2021, 6:07 AM
LW: 8 AF: 3
AF
in reply to: paulfchristiano’s comment on: How much chess engine progress is about adapting to bigger computers?
Thank you for your interest: It’s good to see people asking similar questions! Also thank-you for incentivizing research with rewards. Yes, I think closing the gaps will be straightforward. I still have the raw data, scripts, etc. to pick it up.

i) old engines on new hardware—can be done; needs definition of which engines/hardware

ii) raw data + reproduction—perhaps everything can be scripted and put on GitHub

iii) controls for memory + endgame tables—can be done, needs definition of requirements

iv) Perhaps the community can already agree on a set of experiments before they are performed, e.g. memory? I mean, I can look up “typical” values of past years, but I’m open for other values.

hippke Mar 6, 2021, 10:48 AM
5 points
on: Fun with +12 OOMs of Compute
At the Landauer kT limit, you need $10^{14}$ kWh to perform your $10^{35}$ FLOPs. That’s 10,000x the yearly US electricity production. You’d need a quantum computer or a Dyson sphere to solve that problem.

hippke Dec 1, 2020, 6:43 PM
3 points
in reply to: Tim Liptrot’s comment on: SETI Predictions
Like AGI winter, a time of reduced funding

SETI Predictions

hippkeNov 30, 2020, 8:09 PM

23 points

8 comments1 min readLW link

The next AI winter will be due to energy costs

hippkeNov 24, 2020, 4:53 PM

70 points

8 comments2 min readLW link

How GPT-N will escape from its AI-box

hippkeAug 12, 2020, 7:34 PM

7 points

9 comments1 min readLW link

hippke Aug 6, 2020, 5:33 PM
LW: 1 AF: 1
AF
in reply to: gwern’s comment on: Measuring hardware overhang
Right. My experiment used 1 GB for Stockfish, which would also work on a 486 machine (although at the time, it was almost unheard of...)

hippke Aug 6, 2020, 5:31 PM
LW: 7 AF: 3
AF
in reply to: Peter Jin’s comment on: Measuring hardware overhang
(a) The most recent data points are from CCRL. They use an i7-4770k and the listed tournament conditions. With this setup, SF11 has about 3500 ELOs. That’s what I used as the baseline to calibrate my own machine (an i7-7700k).

(b) I used the SF8 default which is 1 GB.

(c) Yes. However, the hardware details (RAM, memory bandwidth) are not all that important. You can use these SF9 benchmarks on various CPUs. For example, the AMD Ryzen 1800 is listed with 304,510 MIPS and gets 14,377,000 nodes/sec on Stockfish (i.e., 19.9 nodes per MIPS). The oldest CPU in the list, the Pentium-150 has 282 MIPS and reaches 5,626 nodes/sec (i.e., 47.2 nodes per MIPS). That’s about a factor of two difference, due to memory and related advantages. As we’re getting that much every 18 months due to Moore’s law, it’s a small (but relevant) detail, and decreases the hardware overhang slightly. Thanks for bringing that up!

Giving Stockfish more memory also helps, but not a lot. Also, you can’t give 128 GB of RAM to a 486 CPU. The 1 GB is probably already stretching it. Another small detail which reduces the overhang by likely less than one year.

There are a few more subtle details like endgame databases. Back then, these were small, constrained by disk space limitations. Today, we have 7-stone endgame databases through the cloud (they weigh in at 140 TB). That seems to be worth about 50 ELO.

Measuring hardware overhang

hippke5 Aug 2020 19:59 UTC

116 points

14 comments4 min readLW link

hippke 29 Jul 2020 13:25 UTC
1 point
in reply to: oceaninthemiddleofanisland’s comment on: Predictions for GPT-N
Regarding (1): Of course a step is possible; you never know. But for arithmetic, it is not a step. That may appear so from their poor Figure, but the data indicates otherwise.

hippke 29 Jul 2020 13:22 UTC
2 points
in reply to: Ricardo Meneghin’s comment on: Predictions for GPT-N
True. Do these tests scale out to super-human performance or are they capped at 100%?

hippke 29 Jul 2020 13:22 UTC
10 points
in reply to: romeostevensit’s comment on: Predictions for GPT-N
Except if you have an idea to monetarize one of these sub-tasks? An invest of order 10m USD in compute is not very large if you can create a Pokemon Comedy TV channel out of it, or something like that.

Predictions for GPT-N

hippke29 Jul 2020 1:16 UTC

36 points

31 comments1 min readLW link

hippke

SETI Predictions

The next AI win­ter will be due to en­ergy costs

How GPT-N will es­cape from its AI-box

Mea­sur­ing hard­ware overhang

Pre­dic­tions for GPT-N

The next AI winter will be due to energy costs

How GPT-N will escape from its AI-box

Measuring hardware overhang

Predictions for GPT-N