gwern comments on What should normal people do?

gwern 28 Oct 2013 17:21 UTC
2 points

The Mnemosyne data lies around for years without anyone analysing it. Going through that data and doing a bit of modeling with it should be easy for anyone who’s searching a bachlor thesis for computer science or otherwise seeks a project.

It’s a real pain to, though, because it’s so big. A month after I started, I’m still only halfway through the logs->SQL step.
- ChristianKl 28 Oct 2013 18:28 UTC
  0 points
  Parent
  
  It’s a real pain to, though, because it’s so big. A month after I started, I’m still only halfway through the logs-SQL step.
  
  That sounds like you do one insert per transaction which is the default way SQL operates. It possible to batch multiple inserts together to one transaction.
  
  If I remember right the data was something in the size of 10GB. I think that a computer should be able to do the logs->SQL step in less than a day provided one doesn’t do one insert per transaction.
  - gwern 28 Oct 2013 22:31 UTC
    0 points
    Parent
    I believe so, yeah. You can see an old copy of the script at http://github.com/bartosh/pomni/blob/master/mnemosyne/science_server/parse_logs.py (or download the Mnemosyne repo with bzr). My version is slightly different in that I made it a little more efficient by shifting the self.con.commit() call up into the exception handler, which is about as far as my current Python & SQL knowledge goes. I don’t see anything in http://docs.python.org/2/library/sqlite3.html mentioning ‘union’, so I don’t know how to improve the script.
    
    If I remember right the data was something in the size of 10GB.
    
    The .bz2 logs are ~4GB; the half-done SQL database is ~18GB so I infer the final database will be ~36GB.
    
    EDIT: my ultimate solution was to just spend $540 on an SSD, which finished the import process in a day; the final uploaded dataset was 2.8GB compressed and 18GB uncompressed (I’m not sure why it was half the size I expected).