The Mnemosyne data lies around for years without anyone analysing it. Going through that data and doing a bit of modeling with it should be easy for anyone who’s searching a bachlor thesis for computer science or otherwise seeks a project.
It’s a real pain to, though, because it’s so big. A month after I started, I’m still only halfway through the logs->SQL step.
It’s a real pain to, though, because it’s so big. A month after I started, I’m still only halfway through the logs-SQL step.
That sounds like you do one insert per transaction which is the default way SQL operates. It possible to batch multiple inserts together to one transaction.
If I remember right the data was something in the size of 10GB. I think that a computer should be able to do the logs->SQL step in less than a day provided one doesn’t do one insert per transaction.
If I remember right the data was something in the size of 10GB.
The .bz2 logs are ~4GB; the half-done SQL database is ~18GB so I infer the final database will be ~36GB.
EDIT: my ultimate solution was to just spend $540 on an SSD, which finished the import process in a day; the final uploaded dataset was 2.8GB compressed and 18GB uncompressed (I’m not sure why it was half the size I expected).
It’s a real pain to, though, because it’s so big. A month after I started, I’m still only halfway through the logs->SQL step.
That sounds like you do one insert per transaction which is the default way SQL operates. It possible to batch multiple inserts together to one transaction.
If I remember right the data was something in the size of 10GB. I think that a computer should be able to do the logs->SQL step in less than a day provided one doesn’t do one insert per transaction.
I believe so, yeah. You can see an old copy of the script at http://github.com/bartosh/pomni/blob/master/mnemosyne/science_server/parse_logs.py (or download the Mnemosyne repo with
bzr
). My version is slightly different in that I made it a little more efficient by shifting theself.con.commit()
call up into the exception handler, which is about as far as my current Python & SQL knowledge goes. I don’t see anything in http://docs.python.org/2/library/sqlite3.html mentioning ‘union’, so I don’t know how to improve the script.The .bz2 logs are ~4GB; the half-done SQL database is ~18GB so I infer the final database will be ~36GB.
EDIT: my ultimate solution was to just spend $540 on an SSD, which finished the import process in a day; the final uploaded dataset was 2.8GB compressed and 18GB uncompressed (I’m not sure why it was half the size I expected).