It may be better to use a larger hash space to avoid an internal (in the data set) collisions, but then you lower the number of external collisions.
Are you thinking someone may want plausible deniability? “Yes, my email hashes to this entry with a terrible Brier score but that could’ve been anyone!”
Plausible Deniability yes. Reason agnostic. It’s hard to know why someone might not want to be known to have their address here, but with my numbers above, they would have the statistical backing that 1/1000 addresses will appear in the set by chance, meaning a someone who wants to deny it could say “for every address actually in the set, 1000 will appear to be” so that’s only a 1/1000 chance I actually took the survey! (Naively of course; rest in peace rationalist@lesswrong.com)
I guess in practice it’d be the tiniest shred of plausible deniability. If your prior is that alice@example.com almost surely didn’t enter the contest (p=1%) but her hash is in the table (which happens by chance with p=1/1000) then you Bayesian-update to a 91% chance that she did in fact enter the contest. If you think she had even a 10% chance on priors then her hash being in the table makes you 99% sure it’s her.
To make sure I understand this concern:
Are you thinking someone may want plausible deniability? “Yes, my email hashes to this entry with a terrible Brier score but that could’ve been anyone!”
Plausible Deniability yes. Reason agnostic. It’s hard to know why someone might not want to be known to have their address here, but with my numbers above, they would have the statistical backing that 1/1000 addresses will appear in the set by chance, meaning a someone who wants to deny it could say “for every address actually in the set, 1000 will appear to be” so that’s only a 1/1000 chance I actually took the survey! (Naively of course; rest in peace rationalist@lesswrong.com)
I guess in practice it’d be the tiniest shred of plausible deniability. If your prior is that alice@example.com almost surely didn’t enter the contest (p=1%) but her hash is in the table (which happens by chance with p=1/1000) then you Bayesian-update to a 91% chance that she did in fact enter the contest. If you think she had even a 10% chance on priors then her hash being in the table makes you 99% sure it’s her.