I think this would provide security against people casually accessing each other’s scores but wouldn’t provide much protection against a determined attacker.
Some problems:
There’s no protection at all for someone’s scores if the attacker knows their email address (and email addresses aren’t secret)
It’s probably not that hard to build or acquire a list of LessWrong users’ email addresses
Even if you just brute-force this, there are probably patterns in LessWrong users’ email addresses that make them distinguishable from random email addresses (more likely to be @somerationalistgroup.com, @gmail, recognizably American, nerdy, etc.).
A better solution:
Generate a random ID for each user and add it to your data
Email users their random ID
Publish the data with emails removed
(And remove anything else that could be used to reconstruct users, like jobs/locations/etc. if relevant)
I realized after writing this that you meant that people’s email addresses are private but their scores are public if you know their email. I’d default to not exposing people’s participation and scores unless they expected that to happen, but maybe that’s less of an issue than I was thinking. The predictability of LessWrong emails still would expose a lot of email addresses.
I’d still recommend the random ID solution though since it’s trivial to reason about (it’s basically a one-time-pad).
Thanks for your input. Though ideally we wouldn’t have to go through an email server, it may just be required at some level of security.
As for the patterns, the nice thing is that with a small output space in the millions, there are tons of overlapping reasonable addresses even if you pin it down to a domain. Every English first and last name combo even without any numbers in it is already a lot larger than 10 million, meaning even targeted domains should have plenty of collisions.
There’s an idea in security where you should avoid weak security because it lets you trick yourself into thinking you’re doing something. For example, if you’re not going to protect passwords, in some sense it’s better to leave them completely plaintext instead of hashing them with MD5. At least in the plaintext case you know you’re not protecting them (and won’t accidentally do something unsafe with it on the assumption that it’s already protected by being hashed).
I feel like this is a case like that:
If you don’t care if these become public, consider just making it public.
If you don’t think they should be public, use something that guarantees that they’re not (like the random ID solution)
The solution you proposed is better than nothing and might protect some email addresses in some cases, but it begs the questions: If you need to protect these sometimes, why not all the time; and if not protecting them sometimes is ok, why bother at all?
(I should say though that there are benefits to making data annoying to access, like that your scheme will protect the data from casual snoopers, and prevent it from being crawled by search engines unless someone goes to the trouble of de-anonymizing and reposting it. My point is mostly just that you should ask if you’re ok with it becoming entirely public or not)
I think this would provide security against people casually accessing each other’s scores but wouldn’t provide much protection against a determined attacker.
Some problems:
There’s no protection at all for someone’s scores if the attacker knows their email address (and email addresses aren’t secret)
It’s probably not that hard to build or acquire a list of LessWrong users’ email addresses
Even if you just brute-force this, there are probably patterns in LessWrong users’ email addresses that make them distinguishable from random email addresses (more likely to be @somerationalistgroup.com, @gmail, recognizably American, nerdy, etc.).
A better solution:
Generate a random ID for each user and add it to your data
Email users their random ID
Publish the data with emails removed
(And remove anything else that could be used to reconstruct users, like jobs/locations/etc. if relevant)
I realized after writing this that you meant that people’s email addresses are private but their scores are public if you know their email. I’d default to not exposing people’s participation and scores unless they expected that to happen, but maybe that’s less of an issue than I was thinking. The predictability of LessWrong emails still would expose a lot of email addresses.
I’d still recommend the random ID solution though since it’s trivial to reason about (it’s basically a one-time-pad).
Thanks for your input. Though ideally we wouldn’t have to go through an email server, it may just be required at some level of security.
As for the patterns, the nice thing is that with a small output space in the millions, there are tons of overlapping reasonable addresses even if you pin it down to a domain. Every English first and last name combo even without any numbers in it is already a lot larger than 10 million, meaning even targeted domains should have plenty of collisions.
There’s an idea in security where you should avoid weak security because it lets you trick yourself into thinking you’re doing something. For example, if you’re not going to protect passwords, in some sense it’s better to leave them completely plaintext instead of hashing them with MD5. At least in the plaintext case you know you’re not protecting them (and won’t accidentally do something unsafe with it on the assumption that it’s already protected by being hashed).
I feel like this is a case like that:
If you don’t care if these become public, consider just making it public.
If you don’t think they should be public, use something that guarantees that they’re not (like the random ID solution)
The solution you proposed is better than nothing and might protect some email addresses in some cases, but it begs the questions: If you need to protect these sometimes, why not all the time; and if not protecting them sometimes is ok, why bother at all?
(I should say though that there are benefits to making data annoying to access, like that your scheme will protect the data from casual snoopers, and prevent it from being crawled by search engines unless someone goes to the trouble of de-anonymizing and reposting it. My point is mostly just that you should ask if you’re ok with it becoming entirely public or not)