Daniel_Burfoot comments on Open thread, Jul. 11 - Jul. 17, 2016

Daniel_Burfoot 14 Jul 2016 21:17 UTC
0 points
Thanks for the feedback.

You have a website and every week there’s a dataset that get’s published.

A couple years ago (wow, is LessWrong really that old?) I challenged people to Compress the GSS, but nobody accepted the offer…
- ChristianKl 15 Jul 2016 20:47 UTC
  0 points
  Parent
  The minimum amount of time investment to participate in the GSS challenge might take hours. For most people it’s not even clear what steps are involved in building a model for compressing a dataset. It’s not really gamified. I think it would be possible to have a website that allows people to make up a model in a minute to take part in the tournament.
  
  A one-minute model might be bad, but it might get people into the mood for engaging into the game.
  
  I also think that a QS dataset might be more interesting than compressing the GSS. Promotion wise I think it could be promoted via the QS website (I might still have posting privelages or simply ask, I doubt people would have a problem).
  
  Of course it might be that I misunderstand the issue and it’s not possible to build the website in a way that allows people to provide 1 minute models.
  - gwern 16 Jul 2016 22:35 UTC
    0 points
    Parent
    
    I also think that a QS dataset might be more interesting than compressing the GSS. Promotion wise I think it could be promoted via the QS website (I might still have posting privelages or simply ask, I doubt people would have a problem).
    
    I dunno if it would be all that interesting. If someone wants to work on predictive modeling of datasets every week or month in a tournament format, they can just use Kaggle (and win with XGBOOST or a residual network, likely). I have fat/muscle/weight data on myself from an Omron scale going back 2 years with multiple measurements on most days; this is a reasonably interesting dataset because one can quantify measurement error, the variables are interrelated with one or two latent variables, there are definite nontrivial time trends, and it’s easy to generate hold out data (if the tournament runs 1 month, then there’s an additional 1 month of data which no one, including the organizer, had access to to score contributions with at the end) - but I doubt anyone would bother participating. I have an even bigger QS dataset incorporating all my recorded data of all kinds on a daily granularity, somewhere around 100+ summary variables, but the missingness is so high that it would be unpleasant to work with (I’ve been having a great deal of difficulty just getting lavaan/blavaan to run on it) and likewise I doubt there would be much interest in a competition. There needs to be some sort of incentive: either prizes, inherently interesting data, or some important intellectual/scientific point to it. Kaggles with a lot of participating have big prizes or sexy datasets like the Higgs boson or whales.
    - ChristianKl 17 Jul 2016 11:05 UTC
      0 points
      Parent
      
      There needs to be some sort of incentive: either prizes, inherently interesting data, or some important intellectual/scientific point to it
      
      I think there a scientific point for those QS data sets that can be automatically measured with a high scale of granuality. Very frequently people measure less data because they don’t want to store all the data that a single sensor can produce.
      
      Currently acclerometer data get’s compressed into the variable of “steps”. That variable has the advantage that it has an intuitive meaning but it’s likely not the best possible variable to gather when doing scientific work about how Pokemon Go leads people to do more exercise.
      - gwern 17 Jul 2016 16:51 UTC
        0 points
        Parent
        Doesn’t that have as much to do with battery life and software engineering effort than anything? Those sensors could already log data in much more detail by streaming into an off-the-shelf compressor like xz, but they don’t because good compression inherently requires a lot of computation/battery-life and adds complexity compared to naive methods. There don’t seem to be many use-cases where people having already plugged in zpaq but that just isn’t enough and they need even better compression.
        ChristianKl 17 Jul 2016 17:53 UTC
        0 points
        Parent
        I think translating accerlerometer data into steps is effectively a way of data compression. But it’s a way of data compression that’s not optimized for leaving important features of the data intact but about trying to give users a variable they think they understand.