And even with a normal distribution, do we know that it is not fat tailed? How large a difference in raw scores is there between +4 SD humans and median? What about −2 SD humans and median?
These can be calculated directly, which is 60% better and 30% worse respectively, or 1.6x for the 4SD case and 0.7x for the −2SD case respectively.
I think this is probably pretty accurate, though normalization may be a big problem here.
I remember reading a Gwern post that shows a lot of studies on human ability, and they show very similar if not better results for my theory that humans abilities have a very narrow range.
My cruxes on this are the following, such that if I changed my mind on this, I’d agree with a broad range theory:
The normal/very thin tailed log-normal distribution is not perfect, but it well approximates the actual distribution of abilities. That is, there aren’t large systematic errors in how we collect our data.
The normal or very thin tailed log-normals don’t approximate the tasks we actually do, that is at least 1% of the top do contribute 10-20% or more to success.
I remember reading a Gwern post that shows a lot of studies on human ability, and they show very similar if not better results for my theory that humans abilities have a very narrow range.
You are probably thinking of my mentions of Wechsler 1935 that if you compare the extremes (defined as best/worst out of 1000, ie. ±3 SD) of human capabilities (defined as broadly as possible, including eg running) where the capability has a cardinal scale, the absolute range is surprisingly often around 2-3x. There’s no obvious reason that it should be 2-3x rather than 10x or 100x or lots of other numbers*, so it certainly seems like the human range is quite narrow and we are, from a big picture view going from viruses to hypothetical galaxy-spanning superintelligences, stamped out from the same mold. (There is probably some sort of normality + evolution + mutation-load justification for this but I continue to wait for someone to propose any quantitative argument which can explain why it’s 2-3x.)
You could also look at parts of cognitive tests which do allow absolute, not merely relative, measures, like vocabulary or digit span. If you look at, say, backwards digit span and note that most people have a backwards digit span of only ~4.5 and the range is pretty narrow (±<1 digit SD?), obviously there’s “plenty of room at the top” and mnemonists can train to achieve digit spans of hundreds and computers go to digit spans of trillions (at least in the sense of storing on hard drives as an upper bound). Similarly, vocabularies or reaction time: English has millions of words, of which most people will know maybe 25k or closer to 1% than 100% while a neural net like GPT-3 probably knows several times that and has no real barrier to being trained to the point where it just memorizes the OED & other dictionaries; or reaction time tests like reacting to a bright light will take 20-100ms across all humans no matter how greased-lightning their reflexes while if (for some reason) you designed an electronic circuit optimized for that task it’d be more like 0.000000001ms (terahertz circuits on the order of picoseconds, and there’s also more exotic stuff like photonics).
* for example, in what you might call ‘compound’ capabilities like ‘number of papers published’, the range will probably be much larger than ‘2-3x’ (most people published 0 papers and the most prolific author out of 1000 people probably publishes 100+), so it’s not like there’s any a priori physical limit on most of these. But these could just break down into atomic: if paper publishing is log-normal because it’s intelligence X ideas X work X … = publications, then a range of 2-3x in each one would quickly give you the observed skewed range. But the question is where does that consistent 2-3x comes from, why couldn’t it be utterly dominated by one step where there’s a range of 1-10,000, say?
That’s what I was thinking about. Do you still have it on gwern.net? And can you link it please?
Some important implications here:
Eliezer’s spectrum is far more right than Dragon god’s spectrum of intelligence, and the claim of a broad spectrum needs to be reframed more narrowly.
This does suggest that AI intelligence could be much better than RL humans, even with limitations. That is, we should expect quite large capabilities differentials compared to human on human capabilities differentials.
These can be calculated directly, which is 60% better and 30% worse respectively, or 1.6x for the 4SD case and 0.7x for the −2SD case respectively.
Sorry, can you please walk me through these calculations.
I remember reading a Gwern post that shows a lot of studies on human ability, and they show very similar if not better results for my theory that humans abilities have a very narrow range.
Sorry, can you please walk me through these calculations.
Basically, the standard deviation here is 15, and the median is 100, so what I did was first multiply the standard deviation, then add or subtract based on whether the standard deviation number is positive or negative.
Basically, the standard deviation here is 15, and the median is 100, so what I did was first multiply the standard deviation, then add or subtract based on whether the standard deviation number is positive or negative.
But 15 isn’t the raw difference in IQ test scores. The raw difference in underlying test scores are (re?)normalised to a distribution with a mean of 100 and standard deviation of 15.
We don’t know what percentage difference in underlying cognitive ability/g factor 15 represents.
These can be calculated directly, which is 60% better and 30% worse respectively, or 1.6x for the 4SD case and 0.7x for the −2SD case respectively.
I think this is probably pretty accurate, though normalization may be a big problem here.
I remember reading a Gwern post that shows a lot of studies on human ability, and they show very similar if not better results for my theory that humans abilities have a very narrow range.
My cruxes on this are the following, such that if I changed my mind on this, I’d agree with a broad range theory:
The normal/very thin tailed log-normal distribution is not perfect, but it well approximates the actual distribution of abilities. That is, there aren’t large systematic errors in how we collect our data.
The normal or very thin tailed log-normals don’t approximate the tasks we actually do, that is at least 1% of the top do contribute 10-20% or more to success.
You are probably thinking of my mentions of Wechsler 1935 that if you compare the extremes (defined as best/worst out of 1000, ie. ±3 SD) of human capabilities (defined as broadly as possible, including eg running) where the capability has a cardinal scale, the absolute range is surprisingly often around 2-3x. There’s no obvious reason that it should be 2-3x rather than 10x or 100x or lots of other numbers*, so it certainly seems like the human range is quite narrow and we are, from a big picture view going from viruses to hypothetical galaxy-spanning superintelligences, stamped out from the same mold. (There is probably some sort of normality + evolution + mutation-load justification for this but I continue to wait for someone to propose any quantitative argument which can explain why it’s 2-3x.)
You could also look at parts of cognitive tests which do allow absolute, not merely relative, measures, like vocabulary or digit span. If you look at, say, backwards digit span and note that most people have a backwards digit span of only ~4.5 and the range is pretty narrow (±<1 digit SD?), obviously there’s “plenty of room at the top” and mnemonists can train to achieve digit spans of hundreds and computers go to digit spans of trillions (at least in the sense of storing on hard drives as an upper bound). Similarly, vocabularies or reaction time: English has millions of words, of which most people will know maybe 25k or closer to 1% than 100% while a neural net like GPT-3 probably knows several times that and has no real barrier to being trained to the point where it just memorizes the OED & other dictionaries; or reaction time tests like reacting to a bright light will take 20-100ms across all humans no matter how greased-lightning their reflexes while if (for some reason) you designed an electronic circuit optimized for that task it’d be more like 0.000000001ms (terahertz circuits on the order of picoseconds, and there’s also more exotic stuff like photonics).
* for example, in what you might call ‘compound’ capabilities like ‘number of papers published’, the range will probably be much larger than ‘2-3x’ (most people published 0 papers and the most prolific author out of 1000 people probably publishes 100+), so it’s not like there’s any a priori physical limit on most of these. But these could just break down into atomic: if paper publishing is log-normal because it’s intelligence X ideas X work X … = publications, then a range of 2-3x in each one would quickly give you the observed skewed range. But the question is where does that consistent 2-3x comes from, why couldn’t it be utterly dominated by one step where there’s a range of 1-10,000, say?
That’s what I was thinking about. Do you still have it on gwern.net? And can you link it please?
Some important implications here:
Eliezer’s spectrum is far more right than Dragon god’s spectrum of intelligence, and the claim of a broad spectrum needs to be reframed more narrowly.
This does suggest that AI intelligence could be much better than RL humans, even with limitations. That is, we should expect quite large capabilities differentials compared to human on human capabilities differentials.
Sorry, can you please walk me through these calculations.
Do you remember the post?
Basically, the standard deviation here is 15, and the median is 100, so what I did was first multiply the standard deviation, then add or subtract based on whether the standard deviation number is positive or negative.
I wish I did, but I don’t right now.
But 15 isn’t the raw difference in IQ test scores. The raw difference in underlying test scores are (re?)normalised to a distribution with a mean of 100 and standard deviation of 15.
We don’t know what percentage difference in underlying cognitive ability/g factor 15 represents.
Yeah, this is probably a big question mark here, and an important area to study.