We tenatively postulated it would be fine to do this as long as seeing a name on your match page gave no more than like a 5:1 update about those people having checked you.
I would strongly advocate against this kind of thought; any such decision-making procedure relies on the assumption that you correctly figure out all the ways such information can be used, and that there isn’t a clever way an adversary can extract more information than you had thought. This is bound to fail—people come up with clever ways to extract more private information than anticipated all the time. For example:
We describe a class of attacks that can compromise the privacy of users’ Web-browsing histories. The attacks allow a malicious Web site to determine whether or not the user has recently visited some other, unrelated Web page. The malicious page can determine this information by measuring the time the user’s browser requires to perform certain operations. Since browsers perform various forms of caching, the time required for operations depends on the user’s browsing history; this paper shows that the resulting time variations convey enough information to compromise users’ privacy.
We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.
We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate. Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy “sybil” nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary’s auxiliary information is small.
Many applications benefit from user location data, but location data raises privacy concerns. Anonymization can protect privacy, but identities can sometimes be inferred from supposedly anonymous data. This paper studies a new attack on the anonymity of location data. We show that if the approximate locations of an individual’s home and workplace can both be deduced from a location trace, then the median size of the individual’s anonymity set in the U.S. working population is 1, 21 and 34,980, for locations known at the granularity of a census block, census track and county respectively. The location data of people who live and work in different regions can be re-identified even more easily. Our results show that the threat of re-identification for location data is much greater when the individual’s home and work locations can both be deduced from the data.
Fill-in-the-bubble forms are widely used for surveys, election ballots, and standardized tests. In these and other scenarios, use of the forms comes with an implicit assumption that individuals’ bubble markings themselves are not identifying. This work challenges this assumption, demonstrating that fill-in-the-bubble forms could convey a respondent’s identity even in the absence of explicit identifying information. We develop methods to capture the unique features of a marked bubble and use machine learning to isolate characteristics indicative of its creator. Using surveys from more than ninety individuals, we apply these techniques and successfully reidentify individuals from markings alone with over 50% accuracy.
Those are some interesting papers, thanks for linking.
In the case at hand, I do disagree with your conclusion though.
In this situation, the most a user could find out is who checked them in dialogues. They wouldn’t be able to find any data about checks not concerning themselves.
If they happened to be a capable enough dev and were willing to go through the schleps to obtain that information, then, well… we’re a small team and the world is on fire, and I don’t think we should really be prioritising making Dialogue Matching robust to this kind of adversarial cyber threat for information of comparable scope and sensitivity! Folks with those resources could probably uncover all kinds of private vote data already, if they wanted to.
we’re a small team and the world is on fire, and I don’t think we should really be prioritising making Dialogue Matching robust to this kind of adversarial cyber threat for information of comparable scope and sensitivity!
I agree that it wouldn’t be a very good use of your resources. But there’s a simple solution for that—only use data that’s already public and users have consented to you using. (Or offer an explicit opt-in where that isn’t the case.)
I do agree that in this specific instance, there’s probably little harm in the information being revealed. But I generally also don’t think that that’s the site admin’s call to make, even if I happen to agree with that call in some particular instances. A user may have all kinds of reasons to want to keep some information about themselves private, some of those reasons/kinds of information being very idiosyncratic and hard to know in advance. The only way to respect every user’s preferences for privacy, even the unusual ones, is by letting them control what information is used and not make any of those calls on their behalf.
Hmm, most of these don’t really apply here? Like, it’s not like we are proposing to do anything complicated. We are just saying “throw in some kind of sample function with a 5:1 ratio, ensure you can’t get redraws”. I feel like I can just audit the whole trail of implications myself here (and like, substantially more than I am e.g. capable of auditing all of our libraries for security vulnerabilities, which are sadly reasonably frequent in the JS land we are occupying).
My point is less about the individual example than the overall decision algorithm. Even if you’re correct that in this specific instance, you can verify the whole trail of implications and be certain that nothing bad happens, a general policy of “figure it out on a case-by-case basis and only do it when it feels safe” means that you’re probably going to make a mistake eventually, given how easy it is to make a mistake in this domain.
I am not sure what the alternative is. What decision-making algorithm do you suggest for adopting new server-side libraries that might have security vulnerabilities? Or to switch existing ones? My only algorithm is “figure it out on a case-by-case basis and only do it when it feels safe”. What’s the alternative?
For site libraries, there is indeed no alternative since you have to use some libraries to get anything done, so there you do have to do it on a case-by-case basis. In the case of exposing user data, there is an alternative—limiting yourself to only public data. (See also my reply to jacobjacob.)
I would strongly advocate against this kind of thought; any such decision-making procedure relies on the assumption that you correctly figure out all the ways such information can be used, and that there isn’t a clever way an adversary can extract more information than you had thought. This is bound to fail—people come up with clever ways to extract more private information than anticipated all the time. For example:
Timing Attacks on Web Privacy:
We describe a class of attacks that can compromise the privacy of users’ Web-browsing histories. The attacks allow a malicious Web site to determine whether or not the user has recently visited some other, unrelated Web page. The malicious page can determine this information by measuring the time the user’s browser requires to perform certain operations. Since browsers perform various forms of caching, the time required for operations depends on the user’s browsing history; this paper shows that the resulting time variations convey enough information to compromise users’ privacy.
Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset)
We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.
De-anonymizing Social Networks
We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate. Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy “sybil” nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary’s auxiliary information is small.
On the Anonymity of Home/Work Location Pairs
Many applications benefit from user location data, but location data raises privacy concerns. Anonymization can protect privacy, but identities can sometimes be inferred from supposedly anonymous data. This paper studies a new attack on the anonymity of location data. We show that if the approximate locations of an individual’s home and workplace can both be deduced from a location trace, then the median size of the individual’s anonymity set in the U.S. working population is 1, 21 and 34,980, for locations known at the granularity of a census block, census track and county respectively. The location data of people who live and work in different regions can be re-identified even more easily. Our results show that the threat of re-identification for location data is much greater when the individual’s home and work locations can both be deduced from the data.
Bubble Trouble: Off-Line De-Anonymization of Bubble Forms
Fill-in-the-bubble forms are widely used for surveys, election ballots, and standardized tests. In these and other scenarios, use of the forms comes with an implicit assumption that individuals’ bubble markings themselves are not identifying. This work challenges this assumption, demonstrating that fill-in-the-bubble forms could convey a respondent’s identity even in the absence of explicit identifying information. We develop methods to capture the unique features of a marked bubble and use machine learning to isolate characteristics indicative of its creator. Using surveys from more than ninety individuals, we apply these techniques and successfully reidentify individuals from markings alone with over 50% accuracy.
Those are some interesting papers, thanks for linking.
In the case at hand, I do disagree with your conclusion though.
In this situation, the most a user could find out is who checked them in dialogues. They wouldn’t be able to find any data about checks not concerning themselves.
If they happened to be a capable enough dev and were willing to go through the schleps to obtain that information, then, well… we’re a small team and the world is on fire, and I don’t think we should really be prioritising making Dialogue Matching robust to this kind of adversarial cyber threat for information of comparable scope and sensitivity! Folks with those resources could probably uncover all kinds of private vote data already, if they wanted to.
I agree that it wouldn’t be a very good use of your resources. But there’s a simple solution for that—only use data that’s already public and users have consented to you using. (Or offer an explicit opt-in where that isn’t the case.)
I do agree that in this specific instance, there’s probably little harm in the information being revealed. But I generally also don’t think that that’s the site admin’s call to make, even if I happen to agree with that call in some particular instances. A user may have all kinds of reasons to want to keep some information about themselves private, some of those reasons/kinds of information being very idiosyncratic and hard to know in advance. The only way to respect every user’s preferences for privacy, even the unusual ones, is by letting them control what information is used and not make any of those calls on their behalf.
Hmm, most of these don’t really apply here? Like, it’s not like we are proposing to do anything complicated. We are just saying “throw in some kind of sample function with a 5:1 ratio, ensure you can’t get redraws”. I feel like I can just audit the whole trail of implications myself here (and like, substantially more than I am e.g. capable of auditing all of our libraries for security vulnerabilities, which are sadly reasonably frequent in the JS land we are occupying).
My point is less about the individual example than the overall decision algorithm. Even if you’re correct that in this specific instance, you can verify the whole trail of implications and be certain that nothing bad happens, a general policy of “figure it out on a case-by-case basis and only do it when it feels safe” means that you’re probably going to make a mistake eventually, given how easy it is to make a mistake in this domain.
I am not sure what the alternative is. What decision-making algorithm do you suggest for adopting new server-side libraries that might have security vulnerabilities? Or to switch existing ones? My only algorithm is “figure it out on a case-by-case basis and only do it when it feels safe”. What’s the alternative?
For site libraries, there is indeed no alternative since you have to use some libraries to get anything done, so there you do have to do it on a case-by-case basis. In the case of exposing user data, there is an alternative—limiting yourself to only public data. (See also my reply to jacobjacob.)