A proposal: make public an anonymised dataset of all Karma activity over an undisclosed approximate three-month period from some point in the past 18 months.
What I would like is a list of anonymised users, a list of posts and comments in the given three-month period (stripped of content and ancestry, but keeping a record of authorship), and all incidents of upvotes and downvotes between them that took place in the given period. This is for purposes of observing trends in Karma behaviour, and also sating my curiosity about how some sort of graph-theoretic-informed equivalent of Karma (kind of like Google PageRank) might work. I would also be curious to see what other data-types might make of it.
What good reasons are there for not making this data available?
Someone has to go to the trouble of pulling it from the database
I would personally be prepared to pay up to $13.50 for your time and effort. I would also be surprised if someone hasn’t at least snook a peak at this data already, because it’s kind of interesting.
Violation of LW user privacy
The biggie, really. It’s possible that a tenacious individual could use this data to deduce the voting habits of specific users. I’ve been thinking about how I might go about doing this if given the data in question, which informed the “approximate three months at some point in the past eighteen months” time frame. Without timestamps or details of comment ancestry, and without knowing the exact length of the snapshot period, I suspect anyone trying to extract this information would struggle enormously.
I am fascinated in how people would try and accomplish this, though, so please tell me how you’d go about it. My personal method would be to scrape the site to build up a record of post and comment authorship over time. Any given period would then have a “fingerprint” of authors to number of posts that you could compare against the dataset. This becomes harder, but not impossible, with a time period of unspecified length. This could be mitigated by the data being deliberately sabotaged prior to publication, in such a way that confounds this method while still keeping the broader trends available for analysis.
Any other concerns people would have with this? Alternatively, any awesome things they’d like to do with the data?
Is the LW database structure available? If yes, you could prepare some SELECT queries and ask admins to run them for you and send you the result.
Anonymization: Replace user ids with “f(id+c)” where “f” is a hash function and “c” is a constant that will be modified by the admin before running you script. Replace times of karma clicks with “ym(time+r)” where “r” is a random value between 0 and 30 days, and “ym” is a function that returns only month and year. Select only data from recent year and only from users who are were active during the whole year (made at least one vote in the first and last months of the time period). Would such data be still useful to you?
My day job is DB admin and development. In the unlikely event of LW back-end admin-types being comfortable running a query sent in by some dude off the site, I wouldn’t be comfortable giving it to them. The effort of due diligence on a foreign script is probably greater than that required to put it together.
The data I want correspond to:
the IDs (i.e. primary key, not the username) of all the users
the IDs (PK) and authorship (user ID) of all posts and comments in a contiguous ~3 month period
the adjacency of users and posts as upvotes and downvotes over this period (I assume this is a single junction table)
If I were providing this data, I would also scramble the IDs in some fashion while maintaining the underlying relationships, as consecutive IDs could provide some small clue as to the identity and chronology of users or posts. While this is pretty straightforward, the mechanism for such scrambling should not be known to recipients of the data.
A proposal: make public an anonymised dataset of all Karma activity over an undisclosed approximate three-month period from some point in the past 18 months.
What I would like is a list of anonymised users, a list of posts and comments in the given three-month period (stripped of content and ancestry, but keeping a record of authorship), and all incidents of upvotes and downvotes between them that took place in the given period. This is for purposes of observing trends in Karma behaviour, and also sating my curiosity about how some sort of graph-theoretic-informed equivalent of Karma (kind of like Google PageRank) might work. I would also be curious to see what other data-types might make of it.
What good reasons are there for not making this data available?
Someone has to go to the trouble of pulling it from the database I would personally be prepared to pay up to $13.50 for your time and effort. I would also be surprised if someone hasn’t at least snook a peak at this data already, because it’s kind of interesting.
Violation of LW user privacy The biggie, really. It’s possible that a tenacious individual could use this data to deduce the voting habits of specific users. I’ve been thinking about how I might go about doing this if given the data in question, which informed the “approximate three months at some point in the past eighteen months” time frame. Without timestamps or details of comment ancestry, and without knowing the exact length of the snapshot period, I suspect anyone trying to extract this information would struggle enormously.
I am fascinated in how people would try and accomplish this, though, so please tell me how you’d go about it. My personal method would be to scrape the site to build up a record of post and comment authorship over time. Any given period would then have a “fingerprint” of authors to number of posts that you could compare against the dataset. This becomes harder, but not impossible, with a time period of unspecified length. This could be mitigated by the data being deliberately sabotaged prior to publication, in such a way that confounds this method while still keeping the broader trends available for analysis.
Any other concerns people would have with this? Alternatively, any awesome things they’d like to do with the data?
Is the LW database structure available? If yes, you could prepare some SELECT queries and ask admins to run them for you and send you the result.
Anonymization: Replace user ids with “f(id+c)” where “f” is a hash function and “c” is a constant that will be modified by the admin before running you script. Replace times of karma clicks with “ym(time+r)” where “r” is a random value between 0 and 30 days, and “ym” is a function that returns only month and year. Select only data from recent year and only from users who are were active during the whole year (made at least one vote in the first and last months of the time period). Would such data be still useful to you?
My day job is DB admin and development. In the unlikely event of LW back-end admin-types being comfortable running a query sent in by some dude off the site, I wouldn’t be comfortable giving it to them. The effort of due diligence on a foreign script is probably greater than that required to put it together.
The data I want correspond to:
the IDs (i.e. primary key, not the username) of all the users
the IDs (PK) and authorship (user ID) of all posts and comments in a contiguous ~3 month period
the adjacency of users and posts as upvotes and downvotes over this period (I assume this is a single junction table)
If I were providing this data, I would also scramble the IDs in some fashion while maintaining the underlying relationships, as consecutive IDs could provide some small clue as to the identity and chronology of users or posts. While this is pretty straightforward, the mechanism for such scrambling should not be known to recipients of the data.