The probability of having as unbalanced a string as #121 in a sample of 62 random strings is 12%…
Given that balance between zeros and ones is one of the most basic things to look at, Round 2 participants got unlucky…
Contestants justifiably gave probabilities near 0 for both strings.
This feels like a failure to apply Bayes. Let E = “#121 is 3.1 or more stddevs away from evenly split” and H = “#121 is ‘real’”. Then it is true that real randomness doesn’t do 3.1 stddevs very often, so P(E|H) is roughly 1:400 odds. But “given that balance between zeros and ones is one of the most basic things to look at”, human participants probably also aren’t going to do 3.1 stddevs very often, so P(E|!H) is also very low! How low? Based on the best dataset I have for how humans actually behave on this task (i.e. your dataset, so admittedly this method would not have actually been usable by contest participants), the odds a participant would create a string as unbalanced as #121 (i.e. 56 or fewer of whichever digit was rarer) is 0:62!! But there was 1 participant with each number of rare-digits between 57 and 60, so I’m going to irresponsibly eyeball that curve and say that P(E|!H) is probably somewhere between 1:50 and 1:200 odds. With our 1:400 from earlier, this means an odds ratio of between 2:1 and 8:1 in favor of #121 being human-generated; without more evidence to go on, the correct answer would have thus been between 11% and 33%. “Probabilities near 0″ are not justified, unless you’re incorporating other evidence as well, or you think both that the correct answer is nearer the 11% end of my range and also that 11% counts as “near 0”. (Could one guess P(E|!H) without seeing your dataset? Obviously I’m spoiled on the answers now so I don’t know, but I think I might have guessed that “one or two” of 62 participants would do something that extreme, which would correspond to an answer of ~10%.)
Unrelatedly, your post was extra fun for me because I happened to try something along the same lines (but much less well executed) as my middle school science fair project—I had people do Round 1 only (and only ~30 bits, not 150), and then just did some analysis myself to show ways they differed on average from random (primarily just looking at run lengths). Unfortunately I did not put any effort into getting motivated participants—I gathered data in part by getting teachers to force a bunch of uninterested kids to play along, and unsurprisingly ended up with some garbage responses like all 0s or strictly alternating 0s and 1s (and presumably most of the others, while not total garbage, did not represent best effort).
Very cool! One nitpick:
This feels like a failure to apply Bayes. Let E = “#121 is 3.1 or more stddevs away from evenly split” and H = “#121 is ‘real’”. Then it is true that real randomness doesn’t do 3.1 stddevs very often, so P(E|H) is roughly 1:400 odds. But “given that balance between zeros and ones is one of the most basic things to look at”, human participants probably also aren’t going to do 3.1 stddevs very often, so P(E|!H) is also very low! How low? Based on the best dataset I have for how humans actually behave on this task (i.e. your dataset, so admittedly this method would not have actually been usable by contest participants), the odds a participant would create a string as unbalanced as #121 (i.e. 56 or fewer of whichever digit was rarer) is 0:62!! But there was 1 participant with each number of rare-digits between 57 and 60, so I’m going to irresponsibly eyeball that curve and say that P(E|!H) is probably somewhere between 1:50 and 1:200 odds. With our 1:400 from earlier, this means an odds ratio of between 2:1 and 8:1 in favor of #121 being human-generated; without more evidence to go on, the correct answer would have thus been between 11% and 33%. “Probabilities near 0″ are not justified, unless you’re incorporating other evidence as well, or you think both that the correct answer is nearer the 11% end of my range and also that 11% counts as “near 0”. (Could one guess P(E|!H) without seeing your dataset? Obviously I’m spoiled on the answers now so I don’t know, but I think I might have guessed that “one or two” of 62 participants would do something that extreme, which would correspond to an answer of ~10%.)
Unrelatedly, your post was extra fun for me because I happened to try something along the same lines (but much less well executed) as my middle school science fair project—I had people do Round 1 only (and only ~30 bits, not 150), and then just did some analysis myself to show ways they differed on average from random (primarily just looking at run lengths). Unfortunately I did not put any effort into getting motivated participants—I gathered data in part by getting teachers to force a bunch of uninterested kids to play along, and unsurprisingly ended up with some garbage responses like all 0s or strictly alternating 0s and 1s (and presumably most of the others, while not total garbage, did not represent best effort).