Standard statistics is just primitive machine learning. Any question that you have about data is best answered by a well structured machine learning system/market, not old-school statistics. This is especially important for big complex health questions.
To make this more concrete:
First, a huge amount of health data is collected. Really all of the health data we have should be put into public repositories (anonymized/protected by the gov). The gov should then fund largescale prediction contests based on this data (kaggle is one implementation, prediction markets are more advanced form). For example, predicting how many people in X demographic cluster receiving Y interventions will receive a breast cancer diagnosis in the next year, or next month, etc.
The models that make the best predictions can then be used to predict farther into the future, to predict the outcome of new interventions, etc.
First, a huge amount of health data is collected. Really all of the health data we have should be put into public repositories (anonymized/protected by the gov).
That article illustrates a general common failure mode in trying to prove something is impossible or really hard simply by listing a few examples of previous failed attempts to solve the problem.
Every attempt at manned flight failed until the first success, etc.
The examples they list make trivial errors. The health data obviously doesn’t need to contain any bits of someone’s SSN, phone number, or their exact address. Those are just obvious failures. Also, in general you can achieve arbitrary anonymization success by averaging samples—so you cluster individuals into groups of N based on similar demographics, and average all data across the sample cluster. This actually doesn’t necessarily reduce the data’s effectiveness that much, as it tends to average out noise. It’s related to batch methods in machine learning which average across a number of samples before doing any inference steps. All the big ANN systems are trained with batch averaging over hundreds or more of samples.
The examples they list make trivial errors. The health data obviously doesn’t need to contain any bits of someone’s SSN, phone number, or their exact address.
Those trivial errors weren’t made by them.
In general anonymisation is hard. Hard enough that there’s an interest of lawmakers to regulate it which makes it hard to regulate.
At the moment it seems like Google, Microsoft and Apple all build their health data storages, and then those companies put their own machine learning people on the problem.
Every attempt at manned flight failed until the first success, etc.
This is true. And to follow with your example, just as some failed attempts at manned flight resulted in serious injury or death of the pilot, similarly some failed attempts to anonymize a large data set that is then released to the public will cause an unreversible loss of privacy to thousands or millions of people.
I never claimed that the article had proven effective anonymization impossible just as (I don’t think that) you are claiming that it is proven possible. My claim is that we need to balance the benefits of such data releases against the risk to the privacy of the people whose data are released.
You mentioned a potentially effective way to strengthen the anonymization:
in general you can achieve arbitrary anonymization success by averaging samples—so you cluster individuals into groups of N based on similar demographics, and average all data across the sample cluster. This actually doesn’t necessarily reduce the data’s effectiveness that much
In my experience, when you cluster and average data, you have to make some assumptions about the sorts of questions researchers will be trying to answer. Even if machine learning tends to average across batches, the decision about how to cluster the data is usually a function of the kinds of questions you are trying to answer with the data. It seems to me raw data is more useful than clustered, averaged data, because it has not presupposed the types of questions that will be asked. That said, averaging may be necessary if we are to release anonymized data safely, but it is a tradeoff. This tradeoff between safety and usefulness is one point that the Ars Technica article was making.
Even if machine learning tends to average across batches, the decision about how to cluster the data is usually a function of the kinds of questions you are trying to answer with the data. It seems to me raw data is more useful than clustered, averaged data, because it has not presupposed the types of questions that will be asked.
Yes there is probably a fundamental information tradeoff between anonymization and data effectiveness, but it isn’t clear that this will be much of a limiter in practice.
Secondly, people should be able to opt-in to various levels of anonymization risk, and perhaps that could be tied to financial incentives, so that you can effectively sell your data to some degree.
Standard statistics is just primitive machine learning. Any question that you have about data is best answered by a well structured machine learning system/market, not old-school statistics. This is especially important for big complex health questions.
To make this more concrete:
First, a huge amount of health data is collected. Really all of the health data we have should be put into public repositories (anonymized/protected by the gov). The gov should then fund largescale prediction contests based on this data (kaggle is one implementation, prediction markets are more advanced form). For example, predicting how many people in X demographic cluster receiving Y interventions will receive a breast cancer diagnosis in the next year, or next month, etc.
The models that make the best predictions can then be used to predict farther into the future, to predict the outcome of new interventions, etc.
One problem with that idea is that it is really hard to effectively anonymize a large data set.
That article illustrates a general common failure mode in trying to prove something is impossible or really hard simply by listing a few examples of previous failed attempts to solve the problem.
Every attempt at manned flight failed until the first success, etc.
The examples they list make trivial errors. The health data obviously doesn’t need to contain any bits of someone’s SSN, phone number, or their exact address. Those are just obvious failures. Also, in general you can achieve arbitrary anonymization success by averaging samples—so you cluster individuals into groups of N based on similar demographics, and average all data across the sample cluster. This actually doesn’t necessarily reduce the data’s effectiveness that much, as it tends to average out noise. It’s related to batch methods in machine learning which average across a number of samples before doing any inference steps. All the big ANN systems are trained with batch averaging over hundreds or more of samples.
Those trivial errors weren’t made by them.
In general anonymisation is hard. Hard enough that there’s an interest of lawmakers to regulate it which makes it hard to regulate.
At the moment it seems like Google, Microsoft and Apple all build their health data storages, and then those companies put their own machine learning people on the problem.
This is true. And to follow with your example, just as some failed attempts at manned flight resulted in serious injury or death of the pilot, similarly some failed attempts to anonymize a large data set that is then released to the public will cause an unreversible loss of privacy to thousands or millions of people.
I never claimed that the article had proven effective anonymization impossible just as (I don’t think that) you are claiming that it is proven possible. My claim is that we need to balance the benefits of such data releases against the risk to the privacy of the people whose data are released.
You mentioned a potentially effective way to strengthen the anonymization:
In my experience, when you cluster and average data, you have to make some assumptions about the sorts of questions researchers will be trying to answer. Even if machine learning tends to average across batches, the decision about how to cluster the data is usually a function of the kinds of questions you are trying to answer with the data. It seems to me raw data is more useful than clustered, averaged data, because it has not presupposed the types of questions that will be asked. That said, averaging may be necessary if we are to release anonymized data safely, but it is a tradeoff. This tradeoff between safety and usefulness is one point that the Ars Technica article was making.
Yes, agreed with just about all of that.
Yes there is probably a fundamental information tradeoff between anonymization and data effectiveness, but it isn’t clear that this will be much of a limiter in practice.
Secondly, people should be able to opt-in to various levels of anonymization risk, and perhaps that could be tied to financial incentives, so that you can effectively sell your data to some degree.