gwern comments on What is an appropriate sample size when surveying billions of data points?

gwern 23 Aug 2024 22:39 UTC
8 points
1
Since you mention ‘billions of data points’, but you say your goal is ‘how accessible the Internet is to people with disabilities’ where your sample size should be more like in the hundreds to thousands, you may need to seriously think about what the purpose of your survey is and how it is used. Planning sample size is the least of your problems.

It sounds like you think you can just take some dataset like Common Crawl and crunch numbers about ‘the top million domains’ and come up with a conclusion like ‘X% of the Internet is unusable’ and you just need to know how many domains to analyze and can turn the crank and see what pops out with p < 0.05. But that’s not the case. For datasets like this, you will find many parameters to be “statistically significant” as you are doing near-population-level analysis, where your sampling error is tiny and all your error will be the (unknown and usually impossible to measure) systematic error & bias which doesn’t go away (although Meng 2014 is an interesting discussion of asking how much systematic error goes away when you are sampling a large fraction of the entire population). At scale, all your results may tell you is something about the many serious flaws and biases in these sorts of Internet datasets—they may be all we have, but one shouldn’t fool oneself into thinking that they are any good. (As Cohen put it, a burning desire for an answer doesn’t mean that a given dataset or survey methodology will be able to provide it.)