Every attempt at manned flight failed until the first success, etc.
This is true. And to follow with your example, just as some failed attempts at manned flight resulted in serious injury or death of the pilot, similarly some failed attempts to anonymize a large data set that is then released to the public will cause an unreversible loss of privacy to thousands or millions of people.
I never claimed that the article had proven effective anonymization impossible just as (I don’t think that) you are claiming that it is proven possible. My claim is that we need to balance the benefits of such data releases against the risk to the privacy of the people whose data are released.
You mentioned a potentially effective way to strengthen the anonymization:
in general you can achieve arbitrary anonymization success by averaging samples—so you cluster individuals into groups of N based on similar demographics, and average all data across the sample cluster. This actually doesn’t necessarily reduce the data’s effectiveness that much
In my experience, when you cluster and average data, you have to make some assumptions about the sorts of questions researchers will be trying to answer. Even if machine learning tends to average across batches, the decision about how to cluster the data is usually a function of the kinds of questions you are trying to answer with the data. It seems to me raw data is more useful than clustered, averaged data, because it has not presupposed the types of questions that will be asked. That said, averaging may be necessary if we are to release anonymized data safely, but it is a tradeoff. This tradeoff between safety and usefulness is one point that the Ars Technica article was making.
Even if machine learning tends to average across batches, the decision about how to cluster the data is usually a function of the kinds of questions you are trying to answer with the data. It seems to me raw data is more useful than clustered, averaged data, because it has not presupposed the types of questions that will be asked.
Yes there is probably a fundamental information tradeoff between anonymization and data effectiveness, but it isn’t clear that this will be much of a limiter in practice.
Secondly, people should be able to opt-in to various levels of anonymization risk, and perhaps that could be tied to financial incentives, so that you can effectively sell your data to some degree.
This is true. And to follow with your example, just as some failed attempts at manned flight resulted in serious injury or death of the pilot, similarly some failed attempts to anonymize a large data set that is then released to the public will cause an unreversible loss of privacy to thousands or millions of people.
I never claimed that the article had proven effective anonymization impossible just as (I don’t think that) you are claiming that it is proven possible. My claim is that we need to balance the benefits of such data releases against the risk to the privacy of the people whose data are released.
You mentioned a potentially effective way to strengthen the anonymization:
In my experience, when you cluster and average data, you have to make some assumptions about the sorts of questions researchers will be trying to answer. Even if machine learning tends to average across batches, the decision about how to cluster the data is usually a function of the kinds of questions you are trying to answer with the data. It seems to me raw data is more useful than clustered, averaged data, because it has not presupposed the types of questions that will be asked. That said, averaging may be necessary if we are to release anonymized data safely, but it is a tradeoff. This tradeoff between safety and usefulness is one point that the Ars Technica article was making.
Yes, agreed with just about all of that.
Yes there is probably a fundamental information tradeoff between anonymization and data effectiveness, but it isn’t clear that this will be much of a limiter in practice.
Secondly, people should be able to opt-in to various levels of anonymization risk, and perhaps that could be tied to financial incentives, so that you can effectively sell your data to some degree.