They say you shouldn’t roll your own encryption, which is why I’m posting this here, so it can be unrolled if it’s too unsafe.
Problem: Astral Codex Ten finished scoring the 2023 prediction results, but the primary identifier most used for people’s score was their email address. Since people wouldn’t want those published, what’s an easy way to get people their score?
You could email everyone, but then you have to interact with an email server, and then nobody can do cool analysis of the scores and whatever other data is the document.
My proposal:
There are ~10,000 email addresses. Hash the passwords using a hash that only maps to ~10 million values.
Replace the emails in the document with the hashes. Writing a python script to this could be done in a few minutes.
Give everyone access to that file.
If you know the email address of a participant, it’s trivial to check their score. And if you forgot which email address you used, just try each one! Odds are you will not have had a collision.
But at the same time, with 8 billion email addresses worldwide, any given hash in the document should collide with ~1000 other email addresses (because the 10,000 real addresses will have used 0.1% of the space of the hash output), meaning you can’t just brute force and figure out each persons address. Out of the 8 billion real addresses you try, ~8 million will be real and appear hashed in the document, but only 10,000 of those (~0.1%) will be the originals. So finding an address-hash that appears in the document is highly unlikely to be the actual address of the participant.
If there are a few victims of the birthday paradox, they could probably just email request for their line number in the document. It may be better to use a larger hash space to avoid an internal (in the data set) collisions, but then you lower the number of external collisions. My back of the envelope expects at least several collisions with a 10 mil output space. 100 mil makes it 0 or 1.
Which hash? Not sure. Maybe SHA256 then just delete N characters off the end until the space is ~10,000,000?
Please discuss how safe/unsafe this is. Thanks for your time.
[Question] Making 2023 ACX Prediction Results Public
They say you shouldn’t roll your own encryption, which is why I’m posting this here, so it can be unrolled if it’s too unsafe.
Problem: Astral Codex Ten finished scoring the 2023 prediction results, but the primary identifier most used for people’s score was their email address. Since people wouldn’t want those published, what’s an easy way to get people their score?
You could email everyone, but then you have to interact with an email server, and then nobody can do cool analysis of the scores and whatever other data is the document.
My proposal:
There are ~10,000 email addresses. Hash the passwords using a hash that only maps to ~10 million values.
Replace the emails in the document with the hashes. Writing a python script to this could be done in a few minutes.
Give everyone access to that file.
If you know the email address of a participant, it’s trivial to check their score. And if you forgot which email address you used, just try each one! Odds are you will not have had a collision.
But at the same time, with 8 billion email addresses worldwide, any given hash in the document should collide with ~1000 other email addresses (because the 10,000 real addresses will have used 0.1% of the space of the hash output), meaning you can’t just brute force and figure out each persons address. Out of the 8 billion real addresses you try, ~8 million will be real and appear hashed in the document, but only 10,000 of those (~0.1%) will be the originals. So finding an address-hash that appears in the document is highly unlikely to be the actual address of the participant.
If there are a few victims of the birthday paradox, they could probably just email request for their line number in the document. It may be better to use a larger hash space to avoid an internal (in the data set) collisions, but then you lower the number of external collisions. My back of the envelope expects at least several collisions with a 10 mil output space. 100 mil makes it 0 or 1.
Which hash? Not sure. Maybe SHA256 then just delete N characters off the end until the space is ~10,000,000?
Please discuss how safe/unsafe this is. Thanks for your time.