That article illustrates a general common failure mode in trying to prove something is impossible or really hard simply by listing a few examples of previous failed attempts to solve the problem.
Every attempt at manned flight failed until the first success, etc.
The examples they list make trivial errors. The health data obviously doesn’t need to contain any bits of someone’s SSN, phone number, or their exact address. Those are just obvious failures. Also, in general you can achieve arbitrary anonymization success by averaging samples—so you cluster individuals into groups of N based on similar demographics, and average all data across the sample cluster. This actually doesn’t necessarily reduce the data’s effectiveness that much, as it tends to average out noise. It’s related to batch methods in machine learning which average across a number of samples before doing any inference steps. All the big ANN systems are trained with batch averaging over hundreds or more of samples.
The examples they list make trivial errors. The health data obviously doesn’t need to contain any bits of someone’s SSN, phone number, or their exact address.
Those trivial errors weren’t made by them.
In general anonymisation is hard. Hard enough that there’s an interest of lawmakers to regulate it which makes it hard to regulate.
At the moment it seems like Google, Microsoft and Apple all build their health data storages, and then those companies put their own machine learning people on the problem.
Every attempt at manned flight failed until the first success, etc.
This is true. And to follow with your example, just as some failed attempts at manned flight resulted in serious injury or death of the pilot, similarly some failed attempts to anonymize a large data set that is then released to the public will cause an unreversible loss of privacy to thousands or millions of people.
I never claimed that the article had proven effective anonymization impossible just as (I don’t think that) you are claiming that it is proven possible. My claim is that we need to balance the benefits of such data releases against the risk to the privacy of the people whose data are released.
You mentioned a potentially effective way to strengthen the anonymization:
in general you can achieve arbitrary anonymization success by averaging samples—so you cluster individuals into groups of N based on similar demographics, and average all data across the sample cluster. This actually doesn’t necessarily reduce the data’s effectiveness that much
In my experience, when you cluster and average data, you have to make some assumptions about the sorts of questions researchers will be trying to answer. Even if machine learning tends to average across batches, the decision about how to cluster the data is usually a function of the kinds of questions you are trying to answer with the data. It seems to me raw data is more useful than clustered, averaged data, because it has not presupposed the types of questions that will be asked. That said, averaging may be necessary if we are to release anonymized data safely, but it is a tradeoff. This tradeoff between safety and usefulness is one point that the Ars Technica article was making.
Even if machine learning tends to average across batches, the decision about how to cluster the data is usually a function of the kinds of questions you are trying to answer with the data. It seems to me raw data is more useful than clustered, averaged data, because it has not presupposed the types of questions that will be asked.
Yes there is probably a fundamental information tradeoff between anonymization and data effectiveness, but it isn’t clear that this will be much of a limiter in practice.
Secondly, people should be able to opt-in to various levels of anonymization risk, and perhaps that could be tied to financial incentives, so that you can effectively sell your data to some degree.
That article illustrates a general common failure mode in trying to prove something is impossible or really hard simply by listing a few examples of previous failed attempts to solve the problem.
Every attempt at manned flight failed until the first success, etc.
The examples they list make trivial errors. The health data obviously doesn’t need to contain any bits of someone’s SSN, phone number, or their exact address. Those are just obvious failures. Also, in general you can achieve arbitrary anonymization success by averaging samples—so you cluster individuals into groups of N based on similar demographics, and average all data across the sample cluster. This actually doesn’t necessarily reduce the data’s effectiveness that much, as it tends to average out noise. It’s related to batch methods in machine learning which average across a number of samples before doing any inference steps. All the big ANN systems are trained with batch averaging over hundreds or more of samples.
Those trivial errors weren’t made by them.
In general anonymisation is hard. Hard enough that there’s an interest of lawmakers to regulate it which makes it hard to regulate.
At the moment it seems like Google, Microsoft and Apple all build their health data storages, and then those companies put their own machine learning people on the problem.
This is true. And to follow with your example, just as some failed attempts at manned flight resulted in serious injury or death of the pilot, similarly some failed attempts to anonymize a large data set that is then released to the public will cause an unreversible loss of privacy to thousands or millions of people.
I never claimed that the article had proven effective anonymization impossible just as (I don’t think that) you are claiming that it is proven possible. My claim is that we need to balance the benefits of such data releases against the risk to the privacy of the people whose data are released.
You mentioned a potentially effective way to strengthen the anonymization:
In my experience, when you cluster and average data, you have to make some assumptions about the sorts of questions researchers will be trying to answer. Even if machine learning tends to average across batches, the decision about how to cluster the data is usually a function of the kinds of questions you are trying to answer with the data. It seems to me raw data is more useful than clustered, averaged data, because it has not presupposed the types of questions that will be asked. That said, averaging may be necessary if we are to release anonymized data safely, but it is a tradeoff. This tradeoff between safety and usefulness is one point that the Ars Technica article was making.
Yes, agreed with just about all of that.
Yes there is probably a fundamental information tradeoff between anonymization and data effectiveness, but it isn’t clear that this will be much of a limiter in practice.
Secondly, people should be able to opt-in to various levels of anonymization risk, and perhaps that could be tied to financial incentives, so that you can effectively sell your data to some degree.