We study techniques for identifying an anonymous
author via linguistic stylometry, i.e., comparing the writing
style against a corpus of texts of known authorship. We experimentally
demonstrate the effectiveness of our techniques with
as many as 100,000 candidate authors.
[...]
In experiments where
we match a sample of just 3 blog posts against the rest of the
posts from that blog (mixed in with 100,000 other blogs), the
nearest-neighbor/RLSC combination is able to identify the
correct blog in about 20% of cases; in about 35% of cases,
the correct blog is one of the top 20 guesses. Via confidence
estimation, we can increase precision from 20% to over 80%
with a recall of 50%, which means that we identify 50% of
the blogs overall compared to what we would have if we
always made a guess.
The efficacy of the attack varies based on the number
of labeled and anonymous posts available. Even with just
a single post in the anonymous sample, we can identify
the correct author about 7.5% of the time (without any
confidence estimation). When the number of available posts
in the sample increases to 10, we are able to achieve a 25%
accuracy. Authors with relatively large amounts of content
online (about 40 blog posts) fare worse: they are identified
in over 30% of cases (with only 3 posts in the anonymous
sample).
[...]
Further, we
confirmed that our techniques work in a cross-context setting:
in experiments where we match an anonymous blog
against a set of 100,000 blogs, one of which is a different
blog by the same author, the nearest neighbor classifier can
correctly identify the blog by the same author in about 12%
of cases. Finally, we also manually verified that in crosscontext
matching we find pairs of blogs that are hard for
humans to match based on topic or writing style; we describe
three such pairs in Appendix A.
The strength of the deanonymization attack we have
presented is only likely to improve over time as better techniques
are developed. Our results thus call into question the
viability of anonymous online speech. Even if the adversary
is unable to identify the author using our methods in a fully
automated fashion, he might be able to identify a few tens
of candidates for manual inspection as we detail in Section
III.
Difference was one of scale. Much easier when just taking three dozen? pieces of classical latin literature, some of which were different parts of the same opus magnum, then see them cluster to their respective authors and to the other parts of the same piece. More of a “put the pieces into the box” as opposed to a 100,000 pieces puzzle. In the latter case, you just know most of the puzzle pieces will either show the blue sky, or the blue sea, both a similar shade of blue.
I thought this was pretty impressive:
[...]
[...]
Difference was one of scale. Much easier when just taking three dozen? pieces of classical latin literature, some of which were different parts of the same opus magnum, then see them cluster to their respective authors and to the other parts of the same piece. More of a “put the pieces into the box” as opposed to a 100,000 pieces puzzle. In the latter case, you just know most of the puzzle pieces will either show the blue sky, or the blue sea, both a similar shade of blue.