I have some software I am thinking about packaging up and releasing as open-source, but I’d like to gauge how interesting it is to people other than me.
The software is a highly useable implementation of arithmetic encoding. AE completely handles the problem of encoding, so in order to build a custom compressor for some data set, all you have to do is supply a probability model for the data type(s) you are compressing (I call this “BYOM”—Bring Your Own Model).
One of the key technical difficulties of data compression is that you need to keep the encoder and decoder in exact sync, or the whole procedure goes entirely off the rails. This problem is especially acute for the use case of AE, where you are potentially changing the model in response to every event. My software makes it very easy to guarantee that the sender/receiver are in sync, and at the same time it reduces the amount of code you have to write (basically you don’t write a separate encoder and decoder, you just write one class that is used for both, depending on the configuration).
I read a bit of what you previously wrote about your approach but I didn’t read your full book.
I think a bunch of Quantified Self applications would profit from good compression. It’s for example relatively interesting to sample galvanic skin response in very short time intervals of 5ms. Similar things go for accelerometer data. It would be interesting what kind of data you can draw from the noisy heart rate data on smartwatches with shorter time intervals.
Smart watches could easily gather that data with shorter time accuracy than they currently do but they have relatively limited space.
In practice I think it will depend a lot on how easy it is to use your software.
Maybe you could also have a gamified version. You have a website and every week there’s a dataset that get’s published. Only have of the data is released. Every participant can enter their own model via a website and the person who’s model compresses the unreleased part of the data the best wins.
The minimum amount of time investment to participate in the GSS challenge might take hours. For most people it’s not even clear what steps are involved in building a model for compressing a dataset. It’s not really gamified. I think it would be possible to have a website that allows people to make up a model in a minute to take part in the tournament.
A one-minute model might be bad, but it might get people into the mood for engaging into the game.
I also think that a QS dataset might be more interesting than compressing the GSS. Promotion wise I think it could be promoted via the QS website (I might still have posting privelages or simply ask, I doubt people would have a problem).
Of course it might be that I misunderstand the issue and it’s not possible to build the website in a way that allows people to provide 1 minute models.
I also think that a QS dataset might be more interesting than compressing the GSS. Promotion wise I think it could be promoted via the QS website (I might still have posting privelages or simply ask, I doubt people would have a problem).
I dunno if it would be all that interesting. If someone wants to work on predictive modeling of datasets every week or month in a tournament format, they can just use Kaggle (and win with XGBOOST or a residual network, likely). I have fat/muscle/weight data on myself from an Omron scale going back 2 years with multiple measurements on most days; this is a reasonably interesting dataset because one can quantify measurement error, the variables are interrelated with one or two latent variables, there are definite nontrivial time trends, and it’s easy to generate hold out data (if the tournament runs 1 month, then there’s an additional 1 month of data which no one, including the organizer, had access to to score contributions with at the end) - but I doubt anyone would bother participating. I have an even bigger QS dataset incorporating all my recorded data of all kinds on a daily granularity, somewhere around 100+ summary variables, but the missingness is so high that it would be unpleasant to work with (I’ve been having a great deal of difficulty just getting lavaan/blavaan to run on it) and likewise I doubt there would be much interest in a competition. There needs to be some sort of incentive: either prizes, inherently interesting data, or some important intellectual/scientific point to it. Kaggles with a lot of participating have big prizes or sexy datasets like the Higgs boson or whales.
There needs to be some sort of incentive: either prizes, inherently interesting data, or some important intellectual/scientific point to it
I think there a scientific point for those QS data sets that can be automatically measured with a high scale of granuality. Very frequently people measure less data because they don’t want to store all the data that a single sensor can produce.
Currently acclerometer data get’s compressed into the variable of “steps”. That variable has the advantage that it has an intuitive meaning but it’s likely not the best possible variable to gather when doing scientific work about how Pokemon Go leads people to do more exercise.
Doesn’t that have as much to do with battery life and software engineering effort than anything? Those sensors could already log data in much more detail by streaming into an off-the-shelf compressor like xz, but they don’t because good compression inherently requires a lot of computation/battery-life and adds complexity compared to naive methods. There don’t seem to be many use-cases where people having already plugged in zpaq but that just isn’t enough and they need even better compression.
I think translating accerlerometer data into steps is effectively a way of data compression. But it’s a way of data compression that’s not optimized for leaving important features of the data intact but about trying to give users a variable they think they understand.
Thanks for posting this link, it contains a good illustration of the problem of using separate encoder/decoder implementations.
See how they have separate encoder/decoder implementations on page 8⁄9 of the document? That strategy is very very error prone. It is very hard for the programmer to ensure that the encoder and decoder are performing exactly the same updates, and even the slightest off-by-one error will cause the process to fail completely (I spent many hours trying to debug sync problems like this). This problem becomes more painful as you attempt to build more and more sophisticated compressors.
With my library, there is no separation of encoder and decoder logic; it is effectively the same code. That basically guarantees there will be no sync problems. Since I developed this technique I haven’t had any sync problems.
Great. I’m interested. Performancewise it may not be the best possibility, but for reusability it’s good. I wonder about the overhead of your abstraction.
Re: performance, my implementation is not performance optimized, but in my experience Java is very fast. According to this benchmark Java is only about 2x slower than pure C (also known as “portable assembly”).
Yeah, the benchmark game. But arithmetic coding and the implied bit twiddling isn’t exactly the strength of Java. On the other hand in this case the overhead of you in-sync de/encoding abstraction may be decisive.
Basically, if you have your own dataset that you wanted to compress with a special purpose model, you could try doing that. You could try out compression-based tricks for computer vision, like in this paper. You could use it as part of an information theory course if you wanted to show students a real example of compression in practice.
In my view it is quite easy to use, but you still need to be a programmer with some knowledge of stats and information theory.
I have some software I am thinking about packaging up and releasing as open-source, but I’d like to gauge how interesting it is to people other than me.
The software is a highly useable implementation of arithmetic encoding. AE completely handles the problem of encoding, so in order to build a custom compressor for some data set, all you have to do is supply a probability model for the data type(s) you are compressing (I call this “BYOM”—Bring Your Own Model).
One of the key technical difficulties of data compression is that you need to keep the encoder and decoder in exact sync, or the whole procedure goes entirely off the rails. This problem is especially acute for the use case of AE, where you are potentially changing the model in response to every event. My software makes it very easy to guarantee that the sender/receiver are in sync, and at the same time it reduces the amount of code you have to write (basically you don’t write a separate encoder and decoder, you just write one class that is used for both, depending on the configuration).
Summary after your explanations: Yes, I’m really interested in an open-source packaging of your library. Preferably on github.
I read a bit of what you previously wrote about your approach but I didn’t read your full book.
I think a bunch of Quantified Self applications would profit from good compression. It’s for example relatively interesting to sample galvanic skin response in very short time intervals of 5ms. Similar things go for accelerometer data. It would be interesting what kind of data you can draw from the noisy heart rate data on smartwatches with shorter time intervals.
Smart watches could easily gather that data with shorter time accuracy than they currently do but they have relatively limited space.
In practice I think it will depend a lot on how easy it is to use your software.
Maybe you could also have a gamified version. You have a website and every week there’s a dataset that get’s published. Only have of the data is released. Every participant can enter their own model via a website and the person who’s model compresses the unreleased part of the data the best wins.
Thanks for the feedback.
A couple years ago (wow, is LessWrong really that old?) I challenged people to Compress the GSS, but nobody accepted the offer…
The minimum amount of time investment to participate in the GSS challenge might take hours. For most people it’s not even clear what steps are involved in building a model for compressing a dataset. It’s not really gamified. I think it would be possible to have a website that allows people to make up a model in a minute to take part in the tournament.
A one-minute model might be bad, but it might get people into the mood for engaging into the game.
I also think that a QS dataset might be more interesting than compressing the GSS. Promotion wise I think it could be promoted via the QS website (I might still have posting privelages or simply ask, I doubt people would have a problem).
Of course it might be that I misunderstand the issue and it’s not possible to build the website in a way that allows people to provide 1 minute models.
I dunno if it would be all that interesting. If someone wants to work on predictive modeling of datasets every week or month in a tournament format, they can just use Kaggle (and win with XGBOOST or a residual network, likely). I have fat/muscle/weight data on myself from an Omron scale going back 2 years with multiple measurements on most days; this is a reasonably interesting dataset because one can quantify measurement error, the variables are interrelated with one or two latent variables, there are definite nontrivial time trends, and it’s easy to generate hold out data (if the tournament runs 1 month, then there’s an additional 1 month of data which no one, including the organizer, had access to to score contributions with at the end) - but I doubt anyone would bother participating. I have an even bigger QS dataset incorporating all my recorded data of all kinds on a daily granularity, somewhere around 100+ summary variables, but the missingness is so high that it would be unpleasant to work with (I’ve been having a great deal of difficulty just getting lavaan/blavaan to run on it) and likewise I doubt there would be much interest in a competition. There needs to be some sort of incentive: either prizes, inherently interesting data, or some important intellectual/scientific point to it. Kaggles with a lot of participating have big prizes or sexy datasets like the Higgs boson or whales.
I think there a scientific point for those QS data sets that can be automatically measured with a high scale of granuality. Very frequently people measure less data because they don’t want to store all the data that a single sensor can produce.
Currently acclerometer data get’s compressed into the variable of “steps”. That variable has the advantage that it has an intuitive meaning but it’s likely not the best possible variable to gather when doing scientific work about how Pokemon Go leads people to do more exercise.
Doesn’t that have as much to do with battery life and software engineering effort than anything? Those sensors could already log data in much more detail by streaming into an off-the-shelf compressor like
xz
, but they don’t because good compression inherently requires a lot of computation/battery-life and adds complexity compared to naive methods. There don’t seem to be many use-cases where people having already plugged in zpaq but that just isn’t enough and they need even better compression.I think translating accerlerometer data into steps is effectively a way of data compression. But it’s a way of data compression that’s not optimized for leaving important features of the data intact but about trying to give users a variable they think they understand.
Is it something like this? http://www.cipr.rpi.edu/research/SPIHT/EW_Code/FastAC_Readme.pdf
Thanks for posting this link, it contains a good illustration of the problem of using separate encoder/decoder implementations.
See how they have separate encoder/decoder implementations on page 8⁄9 of the document? That strategy is very very error prone. It is very hard for the programmer to ensure that the encoder and decoder are performing exactly the same updates, and even the slightest off-by-one error will cause the process to fail completely (I spent many hours trying to debug sync problems like this). This problem becomes more painful as you attempt to build more and more sophisticated compressors.
With my library, there is no separation of encoder and decoder logic; it is effectively the same code. That basically guarantees there will be no sync problems. Since I developed this technique I haven’t had any sync problems.
Which language?
Java.
Great. I’m interested. Performancewise it may not be the best possibility, but for reusability it’s good. I wonder about the overhead of your abstraction.
Thanks for the feedback!
Re: performance, my implementation is not performance optimized, but in my experience Java is very fast. According to this benchmark Java is only about 2x slower than pure C (also known as “portable assembly”).
Yeah, the benchmark game. But arithmetic coding and the implied bit twiddling isn’t exactly the strength of Java. On the other hand in this case the overhead of you in-sync de/encoding abstraction may be decisive.
Just post it on github with no effort. If you start getting pull requests or issues logged, you’ll have your answer.
Two basic questions:
(1) What are the immediate practical applications?
(2) How qualified must the user be? (The “all you have to do is supply a probability model” part is worrying :-/)
Basically, if you have your own dataset that you wanted to compress with a special purpose model, you could try doing that. You could try out compression-based tricks for computer vision, like in this paper. You could use it as part of an information theory course if you wanted to show students a real example of compression in practice.
In my view it is quite easy to use, but you still need to be a programmer with some knowledge of stats and information theory.
So it’s more of a library and less of an application?
Yes.