This works for corn plants because the underlying measurement “amount of protein” is something that we can quantify (in grams or whatever) in addition to comparing two different corn plants to see which one has more protein. IQ tests don’t do this in any meaningful sense; think of an IQ test more like a Moh’s hardness scale, where you can figure out a new material’s position on the scale by comparing it to a few with similar hardness and seeing which are harder and which are softer. If it’s harder than all of the previously tested materials, it just goes at the top of the scale.
IQ tests include sub-tests which can be cardinal, with absolute variables. For example, simple & complex reaction time; forwards & backwards digit span; and vocabulary size. (You could also consider tests of factual knowledge.) It would be entirely possible to ask, ‘given that reaction time follows a log-normalish distribution in milliseconds and loads on g by r = 0.X and assuming invariance, what would be the predicted lower reaction time of someone Y SDs higher than the mean on g?’ Or ‘given that backwards digit span is normally distributed...’ This is as concrete and meaningful as grams of protein in maize. (There are others, like naming synonyms or telling different stories or inventing different uses of an object etc, where there is a clear count you could use, beyond just relative comparisons of ‘A got an item right and B got an item wrong’.)
Psychometrics has many ways to make tests harder or deal with ceilings. You could speed them up, for example, and allot someone 30 seconds to solve a problem that takes a very smart person 30 minutes. Or you could set a problem so hard that no one can reliably solve it, and see how many attempts it takes to get it right (the more wrong guesses you make and are corrected on, the worse). Or you could make problems more difficult by removing information from it, and see how many hints it takes. (Similar to handicapping in Go.) Or you could remove tools and references, like going from an open-book test to a closed-book test. For some tests, like Raven matrices, you can define a generating process to create new problems by combining a set of rules, so you have a natural objective level of difficulty there. There was a long time ago an attempt to create an ‘objective IQ test’ usable for any AI system by testing them on predicting small randomly-sampled Turing machines—it never got anywhere AFAIK, but I still think this is a viable idea.
(And you increasingly see all of these approaches being taken to try to create benchmarks that can meaningfully measure LLM capabilities for just the next year or two...)
I think these are good ideas. I still agree with Erick’s core objection that once you’re outside of “normal” human range + some buffer, IQ as classically understood is no longer a directly meaningful concept so we’ll have to redefine it somehow, and there are a lot of free parameters for how to define it (eg somebody’s 250 can be another person’s 600).
Yeah, something along the lines of an ELO-style rating would probably work better for this. You could put lots of hard questions on the test and then instead of just ranking people you compare which questions they missed, etc.
I follow chess engines very casually as a hobby. Trying to calibrate chess engine’s computer against computer ELO with human ELO is a real problem. I doubt extrapolating IQ over 300 will provide accurate predictions.
It is very hard to find chess engines confidently telling you what their FIDE ELO is.
Interpretation / Guess: Modern chess engines probably need to use like some intermediate engines to transitively calculate their ELO. (Engine A is 200 ELO greater than players at 2200, Engine B is again 200 ELO better than A...) This is expensive to calculate and the error bar likely increases as you use more intermediate engines.
ELO itself is a relative system, defined by “If [your rating] - [their rating] is X, then we can compute your expected score [where win=1, draw=0.5, loss=0] as a function of X (specifically 11+10−X/400).”
that is detached from the FIDE ELO
Looking at the Wiki, one of the complaints is actually that, as the population of rated human players changes, the meaning of a given rating may change. If you could time-teleport an ELO 2400 player from 1950 into today, they might be significantly different from today’s ELO 2400 players. Whereas if you have a copy of Version N of a given chess engine, and you’re consistent about the time (or, I guess, machine cycles or instructions executed or something) that you allow it, then it will perform at the same level eternally. Now, that being the case, if you want to keep the predictions of “how do these fare against humans” up to date, you do want to periodically take a certain chess engine (or maybe several) and have a bunch of humans play against it to reestablish the correspondence.
Also, I’m sure that the underlying model with ELO isn’t exactly correct. It asserts that, if player A beats player B 64% of the time, and player B beats player C 64% of the time, then player A must beat player C 76% of the time; and if we throw D into the mix, who C beats 64% of the time, then A and B must beat D 85% and 76% of the time, respectively. It would be a miracle if that turned out to be exactly and always true in practice. So it’s more of a kludge that’s meant to work “well enough”.
… Actually, as I read more, the underlying validity of the ELO model does seem like a serious problem. Apparently FIDE rules say that any rating difference exceeding 400 (91% chance of victory) is to be treated as a difference of 400. So even among humans in practice, the model is acknowledged to break down.
This is expensive to calculate
Far less expensive to make computers play 100 games than to make humans play 100 games. Unless you’re using a supercomputer. Which is a valid choice, but it probably makes more sense in most cases to focus on chess engines that run on your laptop, and maybe do a few tests against supercomputers at the end if you feel like it.
and the error bar likely increases as you use more intermediate engines.
It does, though to what degree depends on what the errors are like. If you’re talking about uncorrelated errors due to measurement noise, then adding up N errors of the same size (i.e. standard deviation) would give you an error of √N times that size. And if you want to lower the error, you can always run more games.
However, if there are correlated errors, due to substantial underlying wrongness of the Elo model (or of its application to this scenario), then the total error may get pretty big. … I found a thread talking about FIDE rating vs human online chess ratings, wherein it seems that 1 online chess ELO point (from a weighted average of online classical and blitz ratings) = 0.86 FIDE ELO points, which would imply that e.g. if you beat someone 64% of the time in FIDE tournaments, then you’d beat them 66% of the time in online chess. I think tournaments tend to give players more time to think, which tends to lead to more draws, so that makes some sense...
But it also raises possibilities like, “Perhaps computers make mistakes in different ways”—actually, this is certainly true; a paper (which was attempting to correspond FIDE to CCRL ratings by analyzing the frequency and severity of mistakes, which is one dimension of chess expertise) indicates that the expected mistakes humans make are about 2x as bad as those chess engines make at similar rating levels. Anyway, it seems plausible that that would lead to different … mechanics.
Here are the problems with computer chess ELO ratings that Wiki talks about. Some come from the drawishness of high-level play, which is also felt at high-level human play:
Human–computer chess matches between 1997 (Deep Blue versus Garry Kasparov) and 2006 demonstrated that chess computers are capable of defeating even the strongest human players. However, chess engine ratings are difficult to quantify, due to variable factors such as the time control and the hardware the program runs on, and also the fact that chess is not a fair game. The existence and magnitude of the first-move advantage in chess becomes very important at the computer level. Beyond some skill threshold, an engine with White should be able to force a draw on demand from the starting position even against perfect play, simply because White begins with too big an advantage to lose compared to the small magnitude of the errors it is likely to make. Consequently, such an engine is more or less guaranteed to score at least 25% even against perfect play. Differences in skill beyond a certain point could only be picked up if one does not begin from the usual starting position, but instead chooses a starting position that is only barely not lost for one side. Because of these factors, ratings depend on pairings and the openings selected.[48] Published engine rating lists such as CCRL are based on engine-only games on standard hardware configurations and are not directly comparable to FIDE ratings.
Thanks for adding a much more detailed/factual context! This added more concrete evidence to my mental model of “ELO is not very accurate in multiple ways” too. I did already know some of the inaccuracies in how I presented it, but I wanted to write something rather than nothing, and converting vague intuitions into words is difficult.
This works for corn plants because the underlying measurement “amount of protein” is something that we can quantify (in grams or whatever) in addition to comparing two different corn plants to see which one has more protein. IQ tests don’t do this in any meaningful sense; think of an IQ test more like a Moh’s hardness scale, where you can figure out a new material’s position on the scale by comparing it to a few with similar hardness and seeing which are harder and which are softer. If it’s harder than all of the previously tested materials, it just goes at the top of the scale.
IQ tests include sub-tests which can be cardinal, with absolute variables. For example, simple & complex reaction time; forwards & backwards digit span; and vocabulary size. (You could also consider tests of factual knowledge.) It would be entirely possible to ask, ‘given that reaction time follows a log-normalish distribution in milliseconds and loads on g by r = 0.X and assuming invariance, what would be the predicted lower reaction time of someone Y SDs higher than the mean on g?’ Or ‘given that backwards digit span is normally distributed...’ This is as concrete and meaningful as grams of protein in maize. (There are others, like naming synonyms or telling different stories or inventing different uses of an object etc, where there is a clear count you could use, beyond just relative comparisons of ‘A got an item right and B got an item wrong’.)
Psychometrics has many ways to make tests harder or deal with ceilings. You could speed them up, for example, and allot someone 30 seconds to solve a problem that takes a very smart person 30 minutes. Or you could set a problem so hard that no one can reliably solve it, and see how many attempts it takes to get it right (the more wrong guesses you make and are corrected on, the worse). Or you could make problems more difficult by removing information from it, and see how many hints it takes. (Similar to handicapping in Go.) Or you could remove tools and references, like going from an open-book test to a closed-book test. For some tests, like Raven matrices, you can define a generating process to create new problems by combining a set of rules, so you have a natural objective level of difficulty there. There was a long time ago an attempt to create an ‘objective IQ test’ usable for any AI system by testing them on predicting small randomly-sampled Turing machines—it never got anywhere AFAIK, but I still think this is a viable idea.
(And you increasingly see all of these approaches being taken to try to create benchmarks that can meaningfully measure LLM capabilities for just the next year or two...)
I think these are good ideas. I still agree with Erick’s core objection that once you’re outside of “normal” human range + some buffer, IQ as classically understood is no longer a directly meaningful concept so we’ll have to redefine it somehow, and there are a lot of free parameters for how to define it (eg somebody’s 250 can be another person’s 600).
You can definitely extrapolate out of distribution on tests where the baseline is human performance. We do this with chess ELO ratings all the time.
Yeah, something along the lines of an ELO-style rating would probably work better for this. You could put lots of hard questions on the test and then instead of just ranking people you compare which questions they missed, etc.
I follow chess engines very casually as a hobby. Trying to calibrate chess engine’s computer against computer ELO with human ELO is a real problem. I doubt extrapolating IQ over 300 will provide accurate predictions.
Can you explain in more detail what the problems are?
Take with a grain of salt.
Observation:
Chess engines during development only play against themselves, so they use a relative ELO system that is detached from the FIDE ELO. https://github.com/official-stockfish/Stockfish/wiki/Regression-Tests#normalized-elo-progression https://training.lczero.org/?full_elo=1 https://nextchessmove.com/dev-builds/sf14
It is very hard to find chess engines confidently telling you what their FIDE ELO is.
Interpretation / Guess: Modern chess engines probably need to use like some intermediate engines to transitively calculate their ELO. (Engine A is 200 ELO greater than players at 2200, Engine B is again 200 ELO better than A...) This is expensive to calculate and the error bar likely increases as you use more intermediate engines.
ELO itself is a relative system, defined by “If [your rating] - [their rating] is X, then we can compute your expected score [where win=1, draw=0.5, loss=0] as a function of X (specifically 11+10−X/400).”
Looking at the Wiki, one of the complaints is actually that, as the population of rated human players changes, the meaning of a given rating may change. If you could time-teleport an ELO 2400 player from 1950 into today, they might be significantly different from today’s ELO 2400 players. Whereas if you have a copy of Version N of a given chess engine, and you’re consistent about the time (or, I guess, machine cycles or instructions executed or something) that you allow it, then it will perform at the same level eternally. Now, that being the case, if you want to keep the predictions of “how do these fare against humans” up to date, you do want to periodically take a certain chess engine (or maybe several) and have a bunch of humans play against it to reestablish the correspondence.
Also, I’m sure that the underlying model with ELO isn’t exactly correct. It asserts that, if player A beats player B 64% of the time, and player B beats player C 64% of the time, then player A must beat player C 76% of the time; and if we throw D into the mix, who C beats 64% of the time, then A and B must beat D 85% and 76% of the time, respectively. It would be a miracle if that turned out to be exactly and always true in practice. So it’s more of a kludge that’s meant to work “well enough”.
… Actually, as I read more, the underlying validity of the ELO model does seem like a serious problem. Apparently FIDE rules say that any rating difference exceeding 400 (91% chance of victory) is to be treated as a difference of 400. So even among humans in practice, the model is acknowledged to break down.
Far less expensive to make computers play 100 games than to make humans play 100 games. Unless you’re using a supercomputer. Which is a valid choice, but it probably makes more sense in most cases to focus on chess engines that run on your laptop, and maybe do a few tests against supercomputers at the end if you feel like it.
It does, though to what degree depends on what the errors are like. If you’re talking about uncorrelated errors due to measurement noise, then adding up N errors of the same size (i.e. standard deviation) would give you an error of √N times that size. And if you want to lower the error, you can always run more games.
However, if there are correlated errors, due to substantial underlying wrongness of the Elo model (or of its application to this scenario), then the total error may get pretty big. … I found a thread talking about FIDE rating vs human online chess ratings, wherein it seems that 1 online chess ELO point (from a weighted average of online classical and blitz ratings) = 0.86 FIDE ELO points, which would imply that e.g. if you beat someone 64% of the time in FIDE tournaments, then you’d beat them 66% of the time in online chess. I think tournaments tend to give players more time to think, which tends to lead to more draws, so that makes some sense...
But it also raises possibilities like, “Perhaps computers make mistakes in different ways”—actually, this is certainly true; a paper (which was attempting to correspond FIDE to CCRL ratings by analyzing the frequency and severity of mistakes, which is one dimension of chess expertise) indicates that the expected mistakes humans make are about 2x as bad as those chess engines make at similar rating levels. Anyway, it seems plausible that that would lead to different … mechanics.
Here are the problems with computer chess ELO ratings that Wiki talks about. Some come from the drawishness of high-level play, which is also felt at high-level human play:
Thanks for adding a much more detailed/factual context! This added more concrete evidence to my mental model of “ELO is not very accurate in multiple ways” too. I did already know some of the inaccuracies in how I presented it, but I wanted to write something rather than nothing, and converting vague intuitions into words is difficult.