Sorry, maybe I’m too dumb, but I don’t understand how this explains away the phenomenon. This seems to describe the thing that’s happening, but doesn’t explain why it happens. Saying X happens is not an explanation for why X happens.
Why is it that for many cognitive tasks, the best in the world are so far better than the median? Stating that they are is not an explanation of the phenomenon.
My point is that while intelligence is well approximated by a normal distribution (Not perfectly, and there may even be mild log-normal distributions), the others aren’t well approximated by a normal distribution at all, which means that the controlling variable of intelligence has very small variance, but the variables that are controlled have very large deviations ala power laws are very heavy tailed log-normals, thus the distribution has very high variance, often multiple orders of magnitude variance large.
Intelligence being on a normal distribution is entirely unconnected to the magnitude of cognitive differences between humans—the units of the standard deviation are IQ points (1 SD = 15 IQ points), but IQ points aren’t a linear measure of intelligence as applied to optimization power/learning ability/etc. - for that reason, it’s meaningless to consider how large a percentage of people fits into how many SDs.
That doesn’t follow. They have thin tails (in some well-specified mathematical sense), but that’s unconnected to them generating or not generating orders of magnitude differences.
Uh, I was under the impression that IQ was somewhat fat tailed.
[Epistemic status: pretty low confidence. That’s just a meme I’ve absorbed and not something I’ve independently verified. I wouldn’t be surprised to learn that I’m mistaken in an important aspect here.
Does raw g factor show a normal distribution? Or is that just an artifact of the normalisation that is performed when computing IQ test scores?
And even with a normal distribution, do we know that it is not fat tailed? How large a difference in raw scores is there between +4 SD humans and median? What about −2 SD humans and median?
And even with a normal distribution, do we know that it is not fat tailed? How large a difference in raw scores is there between +4 SD humans and median? What about −2 SD humans and median?
These can be calculated directly, which is 60% better and 30% worse respectively, or 1.6x for the 4SD case and 0.7x for the −2SD case respectively.
I think this is probably pretty accurate, though normalization may be a big problem here.
I remember reading a Gwern post that shows a lot of studies on human ability, and they show very similar if not better results for my theory that humans abilities have a very narrow range.
My cruxes on this are the following, such that if I changed my mind on this, I’d agree with a broad range theory:
The normal/very thin tailed log-normal distribution is not perfect, but it well approximates the actual distribution of abilities. That is, there aren’t large systematic errors in how we collect our data.
The normal or very thin tailed log-normals don’t approximate the tasks we actually do, that is at least 1% of the top do contribute 10-20% or more to success.
I remember reading a Gwern post that shows a lot of studies on human ability, and they show very similar if not better results for my theory that humans abilities have a very narrow range.
You are probably thinking of my mentions of Wechsler 1935 that if you compare the extremes (defined as best/worst out of 1000, ie. ±3 SD) of human capabilities (defined as broadly as possible, including eg running) where the capability has a cardinal scale, the absolute range is surprisingly often around 2-3x. There’s no obvious reason that it should be 2-3x rather than 10x or 100x or lots of other numbers*, so it certainly seems like the human range is quite narrow and we are, from a big picture view going from viruses to hypothetical galaxy-spanning superintelligences, stamped out from the same mold. (There is probably some sort of normality + evolution + mutation-load justification for this but I continue to wait for someone to propose any quantitative argument which can explain why it’s 2-3x.)
You could also look at parts of cognitive tests which do allow absolute, not merely relative, measures, like vocabulary or digit span. If you look at, say, backwards digit span and note that most people have a backwards digit span of only ~4.5 and the range is pretty narrow (±<1 digit SD?), obviously there’s “plenty of room at the top” and mnemonists can train to achieve digit spans of hundreds and computers go to digit spans of trillions (at least in the sense of storing on hard drives as an upper bound). Similarly, vocabularies or reaction time: English has millions of words, of which most people will know maybe 25k or closer to 1% than 100% while a neural net like GPT-3 probably knows several times that and has no real barrier to being trained to the point where it just memorizes the OED & other dictionaries; or reaction time tests like reacting to a bright light will take 20-100ms across all humans no matter how greased-lightning their reflexes while if (for some reason) you designed an electronic circuit optimized for that task it’d be more like 0.000000001ms (terahertz circuits on the order of picoseconds, and there’s also more exotic stuff like photonics).
* for example, in what you might call ‘compound’ capabilities like ‘number of papers published’, the range will probably be much larger than ‘2-3x’ (most people published 0 papers and the most prolific author out of 1000 people probably publishes 100+), so it’s not like there’s any a priori physical limit on most of these. But these could just break down into atomic: if paper publishing is log-normal because it’s intelligence X ideas X work X … = publications, then a range of 2-3x in each one would quickly give you the observed skewed range. But the question is where does that consistent 2-3x comes from, why couldn’t it be utterly dominated by one step where there’s a range of 1-10,000, say?
That’s what I was thinking about. Do you still have it on gwern.net? And can you link it please?
Some important implications here:
Eliezer’s spectrum is far more right than Dragon god’s spectrum of intelligence, and the claim of a broad spectrum needs to be reframed more narrowly.
This does suggest that AI intelligence could be much better than RL humans, even with limitations. That is, we should expect quite large capabilities differentials compared to human on human capabilities differentials.
These can be calculated directly, which is 60% better and 30% worse respectively, or 1.6x for the 4SD case and 0.7x for the −2SD case respectively.
Sorry, can you please walk me through these calculations.
I remember reading a Gwern post that shows a lot of studies on human ability, and they show very similar if not better results for my theory that humans abilities have a very narrow range.
Sorry, can you please walk me through these calculations.
Basically, the standard deviation here is 15, and the median is 100, so what I did was first multiply the standard deviation, then add or subtract based on whether the standard deviation number is positive or negative.
Basically, the standard deviation here is 15, and the median is 100, so what I did was first multiply the standard deviation, then add or subtract based on whether the standard deviation number is positive or negative.
But 15 isn’t the raw difference in IQ test scores. The raw difference in underlying test scores are (re?)normalised to a distribution with a mean of 100 and standard deviation of 15.
We don’t know what percentage difference in underlying cognitive ability/g factor 15 represents.
My point is that while intelligence is well approximated by a normal distribution (Not perfectly, and there may even be mild log-normal distributions), the others aren’t well approximated by a normal distribution at all, which means that the controlling variable of intelligence has very small variance, but the variables that are controlled have very large deviations ala power laws are very heavy tailed log-normals, thus the distribution has very high variance, often multiple orders of magnitude variance large.
Intelligence being on a normal distribution is entirely unconnected to the magnitude of cognitive differences between humans—the units of the standard deviation are IQ points (1 SD = 15 IQ points), but IQ points aren’t a linear measure of intelligence as applied to optimization power/learning ability/etc. - for that reason, it’s meaningless to consider how large a percentage of people fits into how many SDs.
Still, it’s very hard to generate orders of magnitude differences, because normal distributions have very thin tails.
That doesn’t follow. They have thin tails (in some well-specified mathematical sense), but that’s unconnected to them generating or not generating orders of magnitude differences.
Uh, I was under the impression that IQ was somewhat fat tailed.
[Epistemic status: pretty low confidence. That’s just a meme I’ve absorbed and not something I’ve independently verified. I wouldn’t be surprised to learn that I’m mistaken in an important aspect here.
Self reported IQ is pretty fat tailed.
Aren’t there selection effects there? People who received higher IQ test scores are more likely to report their IQ score?
Indeed.
Does raw g factor show a normal distribution? Or is that just an artifact of the normalisation that is performed when computing IQ test scores?
And even with a normal distribution, do we know that it is not fat tailed? How large a difference in raw scores is there between +4 SD humans and median? What about −2 SD humans and median?
These can be calculated directly, which is 60% better and 30% worse respectively, or 1.6x for the 4SD case and 0.7x for the −2SD case respectively.
I think this is probably pretty accurate, though normalization may be a big problem here.
I remember reading a Gwern post that shows a lot of studies on human ability, and they show very similar if not better results for my theory that humans abilities have a very narrow range.
My cruxes on this are the following, such that if I changed my mind on this, I’d agree with a broad range theory:
The normal/very thin tailed log-normal distribution is not perfect, but it well approximates the actual distribution of abilities. That is, there aren’t large systematic errors in how we collect our data.
The normal or very thin tailed log-normals don’t approximate the tasks we actually do, that is at least 1% of the top do contribute 10-20% or more to success.
You are probably thinking of my mentions of Wechsler 1935 that if you compare the extremes (defined as best/worst out of 1000, ie. ±3 SD) of human capabilities (defined as broadly as possible, including eg running) where the capability has a cardinal scale, the absolute range is surprisingly often around 2-3x. There’s no obvious reason that it should be 2-3x rather than 10x or 100x or lots of other numbers*, so it certainly seems like the human range is quite narrow and we are, from a big picture view going from viruses to hypothetical galaxy-spanning superintelligences, stamped out from the same mold. (There is probably some sort of normality + evolution + mutation-load justification for this but I continue to wait for someone to propose any quantitative argument which can explain why it’s 2-3x.)
You could also look at parts of cognitive tests which do allow absolute, not merely relative, measures, like vocabulary or digit span. If you look at, say, backwards digit span and note that most people have a backwards digit span of only ~4.5 and the range is pretty narrow (±<1 digit SD?), obviously there’s “plenty of room at the top” and mnemonists can train to achieve digit spans of hundreds and computers go to digit spans of trillions (at least in the sense of storing on hard drives as an upper bound). Similarly, vocabularies or reaction time: English has millions of words, of which most people will know maybe 25k or closer to 1% than 100% while a neural net like GPT-3 probably knows several times that and has no real barrier to being trained to the point where it just memorizes the OED & other dictionaries; or reaction time tests like reacting to a bright light will take 20-100ms across all humans no matter how greased-lightning their reflexes while if (for some reason) you designed an electronic circuit optimized for that task it’d be more like 0.000000001ms (terahertz circuits on the order of picoseconds, and there’s also more exotic stuff like photonics).
* for example, in what you might call ‘compound’ capabilities like ‘number of papers published’, the range will probably be much larger than ‘2-3x’ (most people published 0 papers and the most prolific author out of 1000 people probably publishes 100+), so it’s not like there’s any a priori physical limit on most of these. But these could just break down into atomic: if paper publishing is log-normal because it’s intelligence X ideas X work X … = publications, then a range of 2-3x in each one would quickly give you the observed skewed range. But the question is where does that consistent 2-3x comes from, why couldn’t it be utterly dominated by one step where there’s a range of 1-10,000, say?
That’s what I was thinking about. Do you still have it on gwern.net? And can you link it please?
Some important implications here:
Eliezer’s spectrum is far more right than Dragon god’s spectrum of intelligence, and the claim of a broad spectrum needs to be reframed more narrowly.
This does suggest that AI intelligence could be much better than RL humans, even with limitations. That is, we should expect quite large capabilities differentials compared to human on human capabilities differentials.
Sorry, can you please walk me through these calculations.
Do you remember the post?
Basically, the standard deviation here is 15, and the median is 100, so what I did was first multiply the standard deviation, then add or subtract based on whether the standard deviation number is positive or negative.
I wish I did, but I don’t right now.
But 15 isn’t the raw difference in IQ test scores. The raw difference in underlying test scores are (re?)normalised to a distribution with a mean of 100 and standard deviation of 15.
We don’t know what percentage difference in underlying cognitive ability/g factor 15 represents.
Yeah, this is probably a big question mark here, and an important area to study.