I read a bit of what you previously wrote about your approach but I didn’t read your full book.
I think a bunch of Quantified Self applications would profit from good compression. It’s for example relatively interesting to sample galvanic skin response in very short time intervals of 5ms. Similar things go for accelerometer data. It would be interesting what kind of data you can draw from the noisy heart rate data on smartwatches with shorter time intervals.
Smart watches could easily gather that data with shorter time accuracy than they currently do but they have relatively limited space.
In practice I think it will depend a lot on how easy it is to use your software.
Maybe you could also have a gamified version. You have a website and every week there’s a dataset that get’s published. Only have of the data is released. Every participant can enter their own model via a website and the person who’s model compresses the unreleased part of the data the best wins.
The minimum amount of time investment to participate in the GSS challenge might take hours. For most people it’s not even clear what steps are involved in building a model for compressing a dataset. It’s not really gamified. I think it would be possible to have a website that allows people to make up a model in a minute to take part in the tournament.
A one-minute model might be bad, but it might get people into the mood for engaging into the game.
I also think that a QS dataset might be more interesting than compressing the GSS. Promotion wise I think it could be promoted via the QS website (I might still have posting privelages or simply ask, I doubt people would have a problem).
Of course it might be that I misunderstand the issue and it’s not possible to build the website in a way that allows people to provide 1 minute models.
I also think that a QS dataset might be more interesting than compressing the GSS. Promotion wise I think it could be promoted via the QS website (I might still have posting privelages or simply ask, I doubt people would have a problem).
I dunno if it would be all that interesting. If someone wants to work on predictive modeling of datasets every week or month in a tournament format, they can just use Kaggle (and win with XGBOOST or a residual network, likely). I have fat/muscle/weight data on myself from an Omron scale going back 2 years with multiple measurements on most days; this is a reasonably interesting dataset because one can quantify measurement error, the variables are interrelated with one or two latent variables, there are definite nontrivial time trends, and it’s easy to generate hold out data (if the tournament runs 1 month, then there’s an additional 1 month of data which no one, including the organizer, had access to to score contributions with at the end) - but I doubt anyone would bother participating. I have an even bigger QS dataset incorporating all my recorded data of all kinds on a daily granularity, somewhere around 100+ summary variables, but the missingness is so high that it would be unpleasant to work with (I’ve been having a great deal of difficulty just getting lavaan/blavaan to run on it) and likewise I doubt there would be much interest in a competition. There needs to be some sort of incentive: either prizes, inherently interesting data, or some important intellectual/scientific point to it. Kaggles with a lot of participating have big prizes or sexy datasets like the Higgs boson or whales.
There needs to be some sort of incentive: either prizes, inherently interesting data, or some important intellectual/scientific point to it
I think there a scientific point for those QS data sets that can be automatically measured with a high scale of granuality. Very frequently people measure less data because they don’t want to store all the data that a single sensor can produce.
Currently acclerometer data get’s compressed into the variable of “steps”. That variable has the advantage that it has an intuitive meaning but it’s likely not the best possible variable to gather when doing scientific work about how Pokemon Go leads people to do more exercise.
Doesn’t that have as much to do with battery life and software engineering effort than anything? Those sensors could already log data in much more detail by streaming into an off-the-shelf compressor like xz, but they don’t because good compression inherently requires a lot of computation/battery-life and adds complexity compared to naive methods. There don’t seem to be many use-cases where people having already plugged in zpaq but that just isn’t enough and they need even better compression.
I think translating accerlerometer data into steps is effectively a way of data compression. But it’s a way of data compression that’s not optimized for leaving important features of the data intact but about trying to give users a variable they think they understand.
I read a bit of what you previously wrote about your approach but I didn’t read your full book.
I think a bunch of Quantified Self applications would profit from good compression. It’s for example relatively interesting to sample galvanic skin response in very short time intervals of 5ms. Similar things go for accelerometer data. It would be interesting what kind of data you can draw from the noisy heart rate data on smartwatches with shorter time intervals.
Smart watches could easily gather that data with shorter time accuracy than they currently do but they have relatively limited space.
In practice I think it will depend a lot on how easy it is to use your software.
Maybe you could also have a gamified version. You have a website and every week there’s a dataset that get’s published. Only have of the data is released. Every participant can enter their own model via a website and the person who’s model compresses the unreleased part of the data the best wins.
Thanks for the feedback.
A couple years ago (wow, is LessWrong really that old?) I challenged people to Compress the GSS, but nobody accepted the offer…
The minimum amount of time investment to participate in the GSS challenge might take hours. For most people it’s not even clear what steps are involved in building a model for compressing a dataset. It’s not really gamified. I think it would be possible to have a website that allows people to make up a model in a minute to take part in the tournament.
A one-minute model might be bad, but it might get people into the mood for engaging into the game.
I also think that a QS dataset might be more interesting than compressing the GSS. Promotion wise I think it could be promoted via the QS website (I might still have posting privelages or simply ask, I doubt people would have a problem).
Of course it might be that I misunderstand the issue and it’s not possible to build the website in a way that allows people to provide 1 minute models.
I dunno if it would be all that interesting. If someone wants to work on predictive modeling of datasets every week or month in a tournament format, they can just use Kaggle (and win with XGBOOST or a residual network, likely). I have fat/muscle/weight data on myself from an Omron scale going back 2 years with multiple measurements on most days; this is a reasonably interesting dataset because one can quantify measurement error, the variables are interrelated with one or two latent variables, there are definite nontrivial time trends, and it’s easy to generate hold out data (if the tournament runs 1 month, then there’s an additional 1 month of data which no one, including the organizer, had access to to score contributions with at the end) - but I doubt anyone would bother participating. I have an even bigger QS dataset incorporating all my recorded data of all kinds on a daily granularity, somewhere around 100+ summary variables, but the missingness is so high that it would be unpleasant to work with (I’ve been having a great deal of difficulty just getting lavaan/blavaan to run on it) and likewise I doubt there would be much interest in a competition. There needs to be some sort of incentive: either prizes, inherently interesting data, or some important intellectual/scientific point to it. Kaggles with a lot of participating have big prizes or sexy datasets like the Higgs boson or whales.
I think there a scientific point for those QS data sets that can be automatically measured with a high scale of granuality. Very frequently people measure less data because they don’t want to store all the data that a single sensor can produce.
Currently acclerometer data get’s compressed into the variable of “steps”. That variable has the advantage that it has an intuitive meaning but it’s likely not the best possible variable to gather when doing scientific work about how Pokemon Go leads people to do more exercise.
Doesn’t that have as much to do with battery life and software engineering effort than anything? Those sensors could already log data in much more detail by streaming into an off-the-shelf compressor like
xz
, but they don’t because good compression inherently requires a lot of computation/battery-life and adds complexity compared to naive methods. There don’t seem to be many use-cases where people having already plugged in zpaq but that just isn’t enough and they need even better compression.I think translating accerlerometer data into steps is effectively a way of data compression. But it’s a way of data compression that’s not optimized for leaving important features of the data intact but about trying to give users a variable they think they understand.