Hmm. You’re right that test-retest reliability typically refers to a correlation coefficient, and I was using the standard error of measurement. I’ll edit the grandparent to use the correct terms.
I’m not sure I agree with your method because it seems odd to me that the standard deviation doesn’t impact the magnitude of the regression to the mean effect. It seems like you could calculate the test-retest reliability coefficient from the population mean, population std, and standard measurement error std, and there might be different reliability coefficients for male and female test-takers, and then that’d probably be the simpler way to calculate it.
I’m not sure I agree with your method because it seems odd to me that the standard deviation doesn’t impact the magnitude of the regression to the mean effect.
Well, it delivers reasonable numbers, it seems to me that one ought to employ reliability somehow, is supported by the two links I gave, and makes sense to me: standard deviation doesn’t come into it because we’ve already singled out a specific datapoint; we’re not asking how many test-scorers will hit 800 (where standard deviation would be very important) but given that a test scorer has hit 800, how will they fall back?
Now that I’ve run through the math, I agree with your method. Supposing the measurement error is independent of score (which can’t be true because of the bounds, and in general probably isn’t true), we can calculate the reliability coefficient by (pop var)/(pop var + measurement var)=.93 for women and .94 for men. The resulting formulas are the exact same, and the difference between the numbers I calculated and the numbers you calculated comes from our differing estimates of the reliability coefficient.
In general, the reliability coefficient doesn’t take into account extra distributional knowledge. If you knew that scores were power-law distributed in the population but the test error were normally distributed, for example, then you would want to calculate the posterior the long way: with the population data as your prior distribution and the the measurement distribution as your likelihood ratio distribution, and the posterior is the renormalized product of the two. I don’t think that using a linear correction based on the reliability coefficient would get that right, but I haven’t worked it out to show the difference.
In general, the reliability coefficient doesn’t take into account extra distributional knowledge. If you knew that scores were power-law distributed in the population but the test error were normally distributed, for example, then you would want to calculate the posterior the long way: with the population data as your prior distribution and the the measurement distribution as your likelihood ratio distribution, and the posterior is the renormalized product of the two. I don’t think that using a linear correction based on the reliability coefficient would get that right, but I haven’t worked it out to show the difference.
That makes sense, but I think the SAT is constructed like IQ tests to be normally rather than power-law distributed, so in this case we get away with a linear correlation like reliability.
Hmm. You’re right that test-retest reliability typically refers to a correlation coefficient, and I was using the standard error of measurement. I’ll edit the grandparent to use the correct terms.
I’m not sure I agree with your method because it seems odd to me that the standard deviation doesn’t impact the magnitude of the regression to the mean effect. It seems like you could calculate the test-retest reliability coefficient from the population mean, population std, and standard measurement error std, and there might be different reliability coefficients for male and female test-takers, and then that’d probably be the simpler way to calculate it.
Well, it delivers reasonable numbers, it seems to me that one ought to employ reliability somehow, is supported by the two links I gave, and makes sense to me: standard deviation doesn’t come into it because we’ve already singled out a specific datapoint; we’re not asking how many test-scorers will hit 800 (where standard deviation would be very important) but given that a test scorer has hit 800, how will they fall back?
Now that I’ve run through the math, I agree with your method. Supposing the measurement error is independent of score (which can’t be true because of the bounds, and in general probably isn’t true), we can calculate the reliability coefficient by (pop var)/(pop var + measurement var)=.93 for women and .94 for men. The resulting formulas are the exact same, and the difference between the numbers I calculated and the numbers you calculated comes from our differing estimates of the reliability coefficient.
In general, the reliability coefficient doesn’t take into account extra distributional knowledge. If you knew that scores were power-law distributed in the population but the test error were normally distributed, for example, then you would want to calculate the posterior the long way: with the population data as your prior distribution and the the measurement distribution as your likelihood ratio distribution, and the posterior is the renormalized product of the two. I don’t think that using a linear correction based on the reliability coefficient would get that right, but I haven’t worked it out to show the difference.
That makes sense, but I think the SAT is constructed like IQ tests to be normally rather than power-law distributed, so in this case we get away with a linear correlation like reliability.