I ran some simulations in Python, and (if I did this correctly), it seems that if r > 0.95, you should expect the most extreme data-point of one variable to be the same in the other variable over 50% of the time (even more if sample size ⇐ 100)
You can simulate it out easily, yeah, but the exact answer seems more elusive. I asked on CrossValidated whether anyone knew the formula for ‘probability of the maximum on both variables given a r and n’, since it seems like something that order statistics researchers would’ve solved long ago because it’s interesting and relevant to contests/competitions/searches/screening, but no one’s given an answer yet.
I have found something interesting in the ‘asymptotic independence’ order statistics literature: apparently it’s been proven since 1960 that the extremes of two correlated distributions are asymptotically independent (obviously when r != 1 or −1). So as you increase n, the probability of double-maxima decreases to the lower bound of 1/n.
The intuition here seems to be that n increases faster than increased deviation for any r, which functions as a constant-factor boost; so if you make n arbitrarily large, you can arbitrarily erode away the constant-factor boost of any r, and thus decrease the max-probability.
I suspected as much from my Monte Carlo simulations (Figure 2), but nice to have it proven for the maxima and minima. (I didn’t understand the more general papers, so I’m not sure what other order statistics are asymptotically independent: it seems like it should be all of them? But some papers need to deal with multiple classes of order statistics, so I dunno—are there order statistics, like maybe the median, where the probability of being the same order in both samples doesn’t converge on 1/n?)
I can do n=1 (the probability is 1, obviously) and n=2 (the probability is 12+1πsin−1r, not so obviously). n=3 and up seem harder, and my pattern-spotting skills are not sufficient to intuit the general case from those two :-).
Heh. I’ve sometimes thought it’d be nice to have a copy of Eureqa or the other symbolic tools, to feed the Monte Carlo results into and see if I could deduce any exact formula given their hints. I don’t need exact formulas often but it’s nice to have them. I’ve noticed people can do apparently magical things with Mathematica in this vein. All proprietary AFAIK, though.
I ran some simulations in Python, and (if I did this correctly), it seems that if r > 0.95, you should expect the most extreme data-point of one variable to be the same in the other variable over 50% of the time (even more if sample size ⇐ 100)
http://nbviewer.jupyter.org/github/ricardoV94/stats/blob/master/correlation_simulations.ipynb
You can simulate it out easily, yeah, but the exact answer seems more elusive. I asked on CrossValidated whether anyone knew the formula for ‘probability of the maximum on both variables given a r and n’, since it seems like something that order statistics researchers would’ve solved long ago because it’s interesting and relevant to contests/competitions/searches/screening, but no one’s given an answer yet.
I have found something interesting in the ‘asymptotic independence’ order statistics literature: apparently it’s been proven since 1960 that the extremes of two correlated distributions are asymptotically independent (obviously when r != 1 or −1). So as you increase n, the probability of double-maxima decreases to the lower bound of 1/n.
The intuition here seems to be that n increases faster than increased deviation for any r, which functions as a constant-factor boost; so if you make n arbitrarily large, you can arbitrarily erode away the constant-factor boost of any r, and thus decrease the max-probability.
I suspected as much from my Monte Carlo simulations (Figure 2), but nice to have it proven for the maxima and minima. (I didn’t understand the more general papers, so I’m not sure what other order statistics are asymptotically independent: it seems like it should be all of them? But some papers need to deal with multiple classes of order statistics, so I dunno—are there order statistics, like maybe the median, where the probability of being the same order in both samples doesn’t converge on 1/n?)
I can do n=1 (the probability is 1, obviously) and n=2 (the probability is 12+1πsin−1r, not so obviously). n=3 and up seem harder, and my pattern-spotting skills are not sufficient to intuit the general case from those two :-).
Heh. I’ve sometimes thought it’d be nice to have a copy of Eureqa or the other symbolic tools, to feed the Monte Carlo results into and see if I could deduce any exact formula given their hints. I don’t need exact formulas often but it’s nice to have them. I’ve noticed people can do apparently magical things with Mathematica in this vein. All proprietary AFAIK, though.
Writeup: https://www.gwern.net/Order-statistics#probability-of-bivariate-maximum