ELO itself is a relative system, defined by “If [your rating] - [their rating] is X, then we can compute your expected score [where win=1, draw=0.5, loss=0] as a function of X (specifically 11+10−X/400).”
that is detached from the FIDE ELO
Looking at the Wiki, one of the complaints is actually that, as the population of rated human players changes, the meaning of a given rating may change. If you could time-teleport an ELO 2400 player from 1950 into today, they might be significantly different from today’s ELO 2400 players. Whereas if you have a copy of Version N of a given chess engine, and you’re consistent about the time (or, I guess, machine cycles or instructions executed or something) that you allow it, then it will perform at the same level eternally. Now, that being the case, if you want to keep the predictions of “how do these fare against humans” up to date, you do want to periodically take a certain chess engine (or maybe several) and have a bunch of humans play against it to reestablish the correspondence.
Also, I’m sure that the underlying model with ELO isn’t exactly correct. It asserts that, if player A beats player B 64% of the time, and player B beats player C 64% of the time, then player A must beat player C 76% of the time; and if we throw D into the mix, who C beats 64% of the time, then A and B must beat D 85% and 76% of the time, respectively. It would be a miracle if that turned out to be exactly and always true in practice. So it’s more of a kludge that’s meant to work “well enough”.
… Actually, as I read more, the underlying validity of the ELO model does seem like a serious problem. Apparently FIDE rules say that any rating difference exceeding 400 (91% chance of victory) is to be treated as a difference of 400. So even among humans in practice, the model is acknowledged to break down.
This is expensive to calculate
Far less expensive to make computers play 100 games than to make humans play 100 games. Unless you’re using a supercomputer. Which is a valid choice, but it probably makes more sense in most cases to focus on chess engines that run on your laptop, and maybe do a few tests against supercomputers at the end if you feel like it.
and the error bar likely increases as you use more intermediate engines.
It does, though to what degree depends on what the errors are like. If you’re talking about uncorrelated errors due to measurement noise, then adding up N errors of the same size (i.e. standard deviation) would give you an error of √N times that size. And if you want to lower the error, you can always run more games.
However, if there are correlated errors, due to substantial underlying wrongness of the Elo model (or of its application to this scenario), then the total error may get pretty big. … I found a thread talking about FIDE rating vs human online chess ratings, wherein it seems that 1 online chess ELO point (from a weighted average of online classical and blitz ratings) = 0.86 FIDE ELO points, which would imply that e.g. if you beat someone 64% of the time in FIDE tournaments, then you’d beat them 66% of the time in online chess. I think tournaments tend to give players more time to think, which tends to lead to more draws, so that makes some sense...
But it also raises possibilities like, “Perhaps computers make mistakes in different ways”—actually, this is certainly true; a paper (which was attempting to correspond FIDE to CCRL ratings by analyzing the frequency and severity of mistakes, which is one dimension of chess expertise) indicates that the expected mistakes humans make are about 2x as bad as those chess engines make at similar rating levels. Anyway, it seems plausible that that would lead to different … mechanics.
Here are the problems with computer chess ELO ratings that Wiki talks about. Some come from the drawishness of high-level play, which is also felt at high-level human play:
Human–computer chess matches between 1997 (Deep Blue versus Garry Kasparov) and 2006 demonstrated that chess computers are capable of defeating even the strongest human players. However, chess engine ratings are difficult to quantify, due to variable factors such as the time control and the hardware the program runs on, and also the fact that chess is not a fair game. The existence and magnitude of the first-move advantage in chess becomes very important at the computer level. Beyond some skill threshold, an engine with White should be able to force a draw on demand from the starting position even against perfect play, simply because White begins with too big an advantage to lose compared to the small magnitude of the errors it is likely to make. Consequently, such an engine is more or less guaranteed to score at least 25% even against perfect play. Differences in skill beyond a certain point could only be picked up if one does not begin from the usual starting position, but instead chooses a starting position that is only barely not lost for one side. Because of these factors, ratings depend on pairings and the openings selected.[48] Published engine rating lists such as CCRL are based on engine-only games on standard hardware configurations and are not directly comparable to FIDE ratings.
Thanks for adding a much more detailed/factual context! This added more concrete evidence to my mental model of “ELO is not very accurate in multiple ways” too. I did already know some of the inaccuracies in how I presented it, but I wanted to write something rather than nothing, and converting vague intuitions into words is difficult.
ELO itself is a relative system, defined by “If [your rating] - [their rating] is X, then we can compute your expected score [where win=1, draw=0.5, loss=0] as a function of X (specifically 11+10−X/400).”
Looking at the Wiki, one of the complaints is actually that, as the population of rated human players changes, the meaning of a given rating may change. If you could time-teleport an ELO 2400 player from 1950 into today, they might be significantly different from today’s ELO 2400 players. Whereas if you have a copy of Version N of a given chess engine, and you’re consistent about the time (or, I guess, machine cycles or instructions executed or something) that you allow it, then it will perform at the same level eternally. Now, that being the case, if you want to keep the predictions of “how do these fare against humans” up to date, you do want to periodically take a certain chess engine (or maybe several) and have a bunch of humans play against it to reestablish the correspondence.
Also, I’m sure that the underlying model with ELO isn’t exactly correct. It asserts that, if player A beats player B 64% of the time, and player B beats player C 64% of the time, then player A must beat player C 76% of the time; and if we throw D into the mix, who C beats 64% of the time, then A and B must beat D 85% and 76% of the time, respectively. It would be a miracle if that turned out to be exactly and always true in practice. So it’s more of a kludge that’s meant to work “well enough”.
… Actually, as I read more, the underlying validity of the ELO model does seem like a serious problem. Apparently FIDE rules say that any rating difference exceeding 400 (91% chance of victory) is to be treated as a difference of 400. So even among humans in practice, the model is acknowledged to break down.
Far less expensive to make computers play 100 games than to make humans play 100 games. Unless you’re using a supercomputer. Which is a valid choice, but it probably makes more sense in most cases to focus on chess engines that run on your laptop, and maybe do a few tests against supercomputers at the end if you feel like it.
It does, though to what degree depends on what the errors are like. If you’re talking about uncorrelated errors due to measurement noise, then adding up N errors of the same size (i.e. standard deviation) would give you an error of √N times that size. And if you want to lower the error, you can always run more games.
However, if there are correlated errors, due to substantial underlying wrongness of the Elo model (or of its application to this scenario), then the total error may get pretty big. … I found a thread talking about FIDE rating vs human online chess ratings, wherein it seems that 1 online chess ELO point (from a weighted average of online classical and blitz ratings) = 0.86 FIDE ELO points, which would imply that e.g. if you beat someone 64% of the time in FIDE tournaments, then you’d beat them 66% of the time in online chess. I think tournaments tend to give players more time to think, which tends to lead to more draws, so that makes some sense...
But it also raises possibilities like, “Perhaps computers make mistakes in different ways”—actually, this is certainly true; a paper (which was attempting to correspond FIDE to CCRL ratings by analyzing the frequency and severity of mistakes, which is one dimension of chess expertise) indicates that the expected mistakes humans make are about 2x as bad as those chess engines make at similar rating levels. Anyway, it seems plausible that that would lead to different … mechanics.
Here are the problems with computer chess ELO ratings that Wiki talks about. Some come from the drawishness of high-level play, which is also felt at high-level human play:
Thanks for adding a much more detailed/factual context! This added more concrete evidence to my mental model of “ELO is not very accurate in multiple ways” too. I did already know some of the inaccuracies in how I presented it, but I wanted to write something rather than nothing, and converting vague intuitions into words is difficult.