Research access to large amounts of anonymized patient data.
Take all the data you have, come up with some theory to describe it, build the scheme into a lossless data compressor, and invoke it on the data set. Write down the compression rate you achieve, and then try to do better. And better. And better. This goal will force you to systematically improve your understanding of the data.
(Note that transforming a sufficiently well specified statistical model into a lossless data compressor is a solved problem, and the solution is called arithmetic encoding—I can give you my implementation, or you can find one on the web. So what I’m really suggesting is just that you build statistical models of the raw data, and try systematically to improve those models).
There is a problem that some data may really fundamentally be a distraction, and so modeling it is just a waste of time.
But it is very hard to tell ahead of time whether or not a piece of data is going to be relevant to a downstream analysis. As an example, in my work on text analysis, the issue of capitalization takes a lot of effort in proportion to how interesting it seems. It is tempting to just throw away caps information by lowercasing everything. But capitalization actually has clues that are relevant to parsing and other analysis—in particular, it allows you to identify acronyms, which usually stand for proper nouns.
This positively sounds a lot like advice that was given in response to a question in the open thread about how to go about a masters thesis. I can’t find it but I endorse the recommendation. Immerse yourself in the data. Attack it from different angles and try to compress it down as much as possible. The idea behind the advice is that if you understand the mechanics behind the process the data can be generated from the process (imagine an image of a circle encoded as svg instead of bitmap (or png)).
There are two caveats: 1) You can’t eliminate noise of course. 2) You are limited by your data set(s). For the former you know enough tools to separate the noise from the data and quantify it.For the latter you should join in extenal data sets. Your modelling might suggest which could improve your compression. E.g. try to link in SNPs databases.
(Note that transforming a sufficiently well specified statistical model into a lossless data compressor is a solved problem, and the solution is called arithmetic encoding—I can give you my implementation, or you can find one on the web.
The unsolved problems are the ones hiding behind the token “sufficiently well specified statistical model”.
That said, thanks for the pointer to arithmetic encoding, that may be useful in the future.
Take all the data you have, come up with some theory to describe it, build the scheme into a lossless data compressor, and invoke it on the data set. Write down the compression rate you achieve, and then try to do better. And better. And better. This goal will force you to systematically improve your understanding of the data.
(Note that transforming a sufficiently well specified statistical model into a lossless data compressor is a solved problem, and the solution is called arithmetic encoding—I can give you my implementation, or you can find one on the web. So what I’m really suggesting is just that you build statistical models of the raw data, and try systematically to improve those models).
Would anyone want to literally do this on something as complex as patient data?
If not, why not just say try to come up with as good of models as you can?
Pick a couple of quantities of interest and try to model them as accurately as you can.
There is a problem that some data may really fundamentally be a distraction, and so modeling it is just a waste of time.
But it is very hard to tell ahead of time whether or not a piece of data is going to be relevant to a downstream analysis. As an example, in my work on text analysis, the issue of capitalization takes a lot of effort in proportion to how interesting it seems. It is tempting to just throw away caps information by lowercasing everything. But capitalization actually has clues that are relevant to parsing and other analysis—in particular, it allows you to identify acronyms, which usually stand for proper nouns.
This positively sounds a lot like advice that was given in response to a question in the open thread about how to go about a masters thesis. I can’t find it but I endorse the recommendation. Immerse yourself in the data. Attack it from different angles and try to compress it down as much as possible. The idea behind the advice is that if you understand the mechanics behind the process the data can be generated from the process (imagine an image of a circle encoded as svg instead of bitmap (or png)).
There are two caveats: 1) You can’t eliminate noise of course. 2) You are limited by your data set(s). For the former you know enough tools to separate the noise from the data and quantify it.For the latter you should join in extenal data sets. Your modelling might suggest which could improve your compression. E.g. try to link in SNPs databases.
The unsolved problems are the ones hiding behind the token “sufficiently well specified statistical model”.
That said, thanks for the pointer to arithmetic encoding, that may be useful in the future.