West African athletes dominate sprinting events, East Africans excel in endurance running, and despite their tiny population Icelanders have shown remarkable prowess in weightlifting competitions. We examine the Gaussian approximation for a simple additive genetic model for these observations.
The Simple Additive Genetic Model
Let’s begin by considering a simple additive genetic model. In this model, a trait T is influenced by n independent genes, each contributing a small effect, along with environmental factors. We can represent this mathematically as:
T = G₁ + G₂ + … + Gₙ + E
Where Gᵢ represents the effect of the i-th gene, and E represents environmental factors.
The Central Limit Theorem (CLT) suggests that the sum of many independent random variables, each with finite mean and variance, will approach a normal (Gaussian) distribution, regardless of the underlying distribution of the individual variables. Mathematically, if we have n independent random variables X₁, X₂, …, Xₙ, each with mean μᵢ and variance σᵢ², then their sum S = X₁ + X₂ + … + Xₙ will approach a normal distribution as n increases:
(S - μ) / (σ√n) → N(0,1) as n → ∞
Where μ = Σμᵢ and σ² = Σσᵢ²
This model seems particularly applicable to sports like running and weightlifting, which are widely practiced around the world and rely on fundamental physiological traits. The global nature of these sports suggests that differences in performance are less likely to be solely due to cultural or environmental factors.
Caution (can be skipped)
However, we must exercise caution in interpreting these models. While genetic factors likely play a role in patterns of exceptional performance, we must be wary of inferring purely genetic origins. Environmental and cultural factors can have significant impacts. For example:
The success of Ethiopian long-distance runners may be partly attributed to high-altitude training in the Ethiopian highlands, a practice that has been adopted with success by athletes from other regions.
Cultural emphasis on certain sports in specific regions can lead to more robust talent identification and training programs.
Socioeconomic factors can influence access to training facilities, nutrition, and coaching.
Moreover, the simple additive genetic model assumes that traits are purely additively genetic and normally distributed, which may not always be the case, especially at the extremes. Gene-environment interactions and epistasis (gene-gene interactions) may become more significant at these extremes, leading to deviations from the expected Gaussian distribution.
Minority Overrepresentation in Extreme Performance
The phenomenon of minority overrepresentation in certain fields of extreme performance provides intriguing insights into the nature of trait distributions. A striking example of this is the dominance of East African runners in marathon events. More specifically, it is East Africans that usually win major marathons. Since the 1968 Olympics, men and women from Kenya and Ethiopia have dominated the 26.2-mile event. Since 1991, the men’s winner at the Boston Marathon has been either a Kenyan or Ethiopian 26 of the last 29 times. East African women have worn the laurel wreath 21 times in the last 24 years at Boston. Upsets do happen, when in 2018 a crazy Japanese amateur won the Men’s marathon, and an American woman won the Woman’s marathon.
To understand why even small differences in mean genetic potential can lead to substantial overrepresentation at the extreme tails of performance, let’s examine the mathematics of Gaussian distributions.
Consider two populations with normally distributed traits, G and H, where H has a higher mean (m_H) than G (m_G), but they share the same standard deviation s. We can calculate the ratio of probabilities at a given level of standard deviation using the formula:
R = e^(kd—d²/2)
Where:
k is the number of standard deviations from the mean of G
d = (m_H—m_G) / s, the difference between means in units of standard deviation
Let’s consider four cases: d = 2⁄3, d = 4⁄5, d = 1, and d = 5⁄4. Here’s how the overrepresentation ratio R changes at different standard deviations:
k
d = 2⁄3
d = 4⁄5
d = 1
d = 5⁄4
1
1.3956
1.4918
1.6487
1.9477
2
2.7182
3.3201
4.4817
6.0496
3
5.2933
7.3891
12.1825
18.7874
4
10.3084
16.4446
33.1155
58.3442
5
20.0855
36.5982
90.0171
181.2721
6
39.1095
81.4509
244.6919
562.9412
To put these numbers in perspective, let’s consider the rarity of individuals at each standard deviation:
k
Percentage
1 in X
1
15.87%
6.3
2
2.28%
44
3
0.13%
769
4
0.003%
33,333
5
0.00003%
3,333,333
6
0.0000001%
1,000,000,000
In other words, a ‘6 Sigma’ event is a 1 in a billion event. A very naive inference would predict that there are 8 people in the world at this level. [1]
Now, let’s consider the case of East African marathon runners. The population of East Africa is approximately 250 million, while the global population is about 8 billion. This means East Africans represent about 3.125% of the world population.
For world-record level running performance, we’re likely looking at something around k=5 or k=6. At these levels:
For d = 2/3: East Africans would be overrepresented by a factor of 20-39
For d = 4/5: East Africans would be overrepresented by a factor of 37-81
For d = 1: East Africans would be overrepresented by a factor of 90-245
For d = 5/4: East Africans would be overrepresented by a factor of 181-563
Given that East Africans represent about 3.125% of the world population, if they were winning close to 100% of major marathons, this would suggest a d value between 1 and 5⁄4. However, as we’ll discuss in the next section, we should be cautious about drawing definitive conclusions from these mathematical models alone.
While the Gaussian model explains many observations, extreme outliers often deviate from this model.
Different Causes: Extreme outliers may result from fundamentally different mechanisms than those governing the main distribution. For example, the tallest man in recorded history, Robert Wadlow, reached an extraordinary height of 8′11″ (2.72m) due to a rare condition causing excessive growth hormone production.
Breakdown of the Additive Model: At extreme values, the simple additive genetic model may break down. Interactions between genes (epistasis) or between genes and environment may become more significant, leading to deviations from the expected Gaussian distribution.
Gaussian approximation fails at the tails As an example, for a sum of Bernouli trials the CLT approximation overestimates the fatness of the tails compared to an exact calculation or large deviation theory, see here for a GPT calculation.
Final Thoughts
We posited a simple additive genetic model for long-distance running, used the Central Limit Theorem approximation to estimate likelihood for extreme outliers for different mean populations. Using observed frequencies of extreme outliers and making the possibly questionable assumption that the long tails are in fact well-approximate by the Bell curve we obtained an estimate for the difference in mean traits.
Note that if you knew the ratio over extreme performers and the difference in means for two populations you can use that to test if the distribution is in fact well appproximated by a Bell curve at the tails.
More interestingly, if one is measuring a proxy trait and wonders whether to what degree this explains performance, observing a higher mean subpopulation overrepresented at tails can give an indication to what degree this trait is relevant for predicting extreme performance. We leave further inferences to the reader.
Surprisingly to me, this naive extrapolation is just about compatible with some seemingly outrageous claims of high IQs. Marilyn vos Savant at 188 IQ (Guinness World Record) and 195 IQ for Christopher Langan. This is ~ within 6 standard deviations for a Bell curve of mean a 100 and standard deviation 15. One would think these are meaningless numbers, more a result of test ceilings, test error, test inaccuracy and simple lies but perhaps there is something to it.
Gaussian Tails and Exceptional Performers
West African athletes dominate sprinting events, East Africans excel in endurance running, and despite their tiny population Icelanders have shown remarkable prowess in weightlifting competitions. We examine the Gaussian approximation for a simple additive genetic model for these observations.
The Simple Additive Genetic Model
Let’s begin by considering a simple additive genetic model. In this model, a trait T is influenced by n independent genes, each contributing a small effect, along with environmental factors. We can represent this mathematically as:
T = G₁ + G₂ + … + Gₙ + E
Where Gᵢ represents the effect of the i-th gene, and E represents environmental factors.
The Central Limit Theorem (CLT) suggests that the sum of many independent random variables, each with finite mean and variance, will approach a normal (Gaussian) distribution, regardless of the underlying distribution of the individual variables. Mathematically, if we have n independent random variables X₁, X₂, …, Xₙ, each with mean μᵢ and variance σᵢ², then their sum S = X₁ + X₂ + … + Xₙ will approach a normal distribution as n increases:
(S - μ) / (σ√n) → N(0,1) as n → ∞
Where μ = Σμᵢ and σ² = Σσᵢ²
This model seems particularly applicable to sports like running and weightlifting, which are widely practiced around the world and rely on fundamental physiological traits. The global nature of these sports suggests that differences in performance are less likely to be solely due to cultural or environmental factors.
Caution (can be skipped)
However, we must exercise caution in interpreting these models. While genetic factors likely play a role in patterns of exceptional performance, we must be wary of inferring purely genetic origins. Environmental and cultural factors can have significant impacts. For example:
The success of Ethiopian long-distance runners may be partly attributed to high-altitude training in the Ethiopian highlands, a practice that has been adopted with success by athletes from other regions.
Cultural emphasis on certain sports in specific regions can lead to more robust talent identification and training programs.
Socioeconomic factors can influence access to training facilities, nutrition, and coaching.
Moreover, the simple additive genetic model assumes that traits are purely additively genetic and normally distributed, which may not always be the case, especially at the extremes. Gene-environment interactions and epistasis (gene-gene interactions) may become more significant at these extremes, leading to deviations from the expected Gaussian distribution.
Minority Overrepresentation in Extreme Performance
The phenomenon of minority overrepresentation in certain fields of extreme performance provides intriguing insights into the nature of trait distributions. A striking example of this is the dominance of East African runners in marathon events. More specifically, it is East Africans that usually win major marathons. Since the 1968 Olympics, men and women from Kenya and Ethiopia have dominated the 26.2-mile event. Since 1991, the men’s winner at the Boston Marathon has been either a Kenyan or Ethiopian 26 of the last 29 times. East African women have worn the laurel wreath 21 times in the last 24 years at Boston. Upsets do happen, when in 2018 a crazy Japanese amateur won the Men’s marathon, and an American woman won the Woman’s marathon.
To understand why even small differences in mean genetic potential can lead to substantial overrepresentation at the extreme tails of performance, let’s examine the mathematics of Gaussian distributions.
Consider two populations with normally distributed traits, G and H, where H has a higher mean (m_H) than G (m_G), but they share the same standard deviation s. We can calculate the ratio of probabilities at a given level of standard deviation using the formula:
R = e^(kd—d²/2)
Where:
k is the number of standard deviations from the mean of G
d = (m_H—m_G) / s, the difference between means in units of standard deviation
Let’s consider four cases: d = 2⁄3, d = 4⁄5, d = 1, and d = 5⁄4. Here’s how the overrepresentation ratio R changes at different standard deviations:
To put these numbers in perspective, let’s consider the rarity of individuals at each standard deviation:
In other words, a ‘6 Sigma’ event is a 1 in a billion event. A very naive inference would predict that there are 8 people in the world at this level. [1]
Now, let’s consider the case of East African marathon runners. The population of East Africa is approximately 250 million, while the global population is about 8 billion. This means East Africans represent about 3.125% of the world population.
For world-record level running performance, we’re likely looking at something around k=5 or k=6. At these levels:
For d = 2/3: East Africans would be overrepresented by a factor of 20-39
For d = 4/5: East Africans would be overrepresented by a factor of 37-81
For d = 1: East Africans would be overrepresented by a factor of 90-245
For d = 5/4: East Africans would be overrepresented by a factor of 181-563
Given that East Africans represent about 3.125% of the world population, if they were winning close to 100% of major marathons, this would suggest a d value between 1 and 5⁄4. However, as we’ll discuss in the next section, we should be cautious about drawing definitive conclusions from these mathematical models alone.
The “Tails Come Apart” Phenomenon
(see also Tails Come Apart)
While the Gaussian model explains many observations, extreme outliers often deviate from this model.
Different Causes: Extreme outliers may result from fundamentally different mechanisms than those governing the main distribution. For example, the tallest man in recorded history, Robert Wadlow, reached an extraordinary height of 8′11″ (2.72m) due to a rare condition causing excessive growth hormone production.
Breakdown of the Additive Model: At extreme values, the simple additive genetic model may break down. Interactions between genes (epistasis) or between genes and environment may become more significant, leading to deviations from the expected Gaussian distribution.
Gaussian approximation fails at the tails
As an example, for a sum of Bernouli trials the CLT approximation overestimates the fatness of the tails compared to an exact calculation or large deviation theory, see here for a GPT calculation.
Final Thoughts
We posited a simple additive genetic model for long-distance running, used the Central Limit Theorem approximation to estimate likelihood for extreme outliers for different mean populations. Using observed frequencies of extreme outliers and making the possibly questionable assumption that the long tails are in fact well-approximate by the Bell curve we obtained an estimate for the difference in mean traits.
Note that if you knew the ratio over extreme performers and the difference in means for two populations you can use that to test if the distribution is in fact well appproximated by a Bell curve at the tails.
More interestingly, if one is measuring a proxy trait and wonders whether to what degree this explains performance, observing a higher mean subpopulation overrepresented at tails can give an indication to what degree this trait is relevant for predicting extreme performance. We leave further inferences to the reader.
Surprisingly to me, this naive extrapolation is just about compatible with some seemingly outrageous claims of high IQs. Marilyn vos Savant at 188 IQ (Guinness World Record) and 195 IQ for Christopher Langan. This is ~ within 6 standard deviations for a Bell curve of mean a 100 and standard deviation 15. One would think these are meaningless numbers, more a result of test ceilings, test error, test inaccuracy and simple lies but perhaps there is something to it.