I propose a new version of the Turing test: An AI is as smart as a human when it can figure out which new poster on a message board is actually the same person as an old poster who got banned. :)
I don’t think this is actually a difficult problem. Some simple machine learning on word frequencies, bigram frequencies, etc. will be probably enough to solve it.
Friend of mine did it via computational complexity: using gzip (as an approximation for KC) for attributing classical latin literature to their respective authors by checking which strings add the least additional complexity (due to shared writing styles, word choice, etc.) when compressed together and then clustering. Worked like a charm.
ETA: These were large bodies of text, however. Probably not gonna work for a bundle of comments, except for me, due to my overuse of “obviously”, obviously.
We study techniques for identifying an anonymous
author via linguistic stylometry, i.e., comparing the writing
style against a corpus of texts of known authorship. We experimentally
demonstrate the effectiveness of our techniques with
as many as 100,000 candidate authors.
[...]
In experiments where
we match a sample of just 3 blog posts against the rest of the
posts from that blog (mixed in with 100,000 other blogs), the
nearest-neighbor/RLSC combination is able to identify the
correct blog in about 20% of cases; in about 35% of cases,
the correct blog is one of the top 20 guesses. Via confidence
estimation, we can increase precision from 20% to over 80%
with a recall of 50%, which means that we identify 50% of
the blogs overall compared to what we would have if we
always made a guess.
The efficacy of the attack varies based on the number
of labeled and anonymous posts available. Even with just
a single post in the anonymous sample, we can identify
the correct author about 7.5% of the time (without any
confidence estimation). When the number of available posts
in the sample increases to 10, we are able to achieve a 25%
accuracy. Authors with relatively large amounts of content
online (about 40 blog posts) fare worse: they are identified
in over 30% of cases (with only 3 posts in the anonymous
sample).
[...]
Further, we
confirmed that our techniques work in a cross-context setting:
in experiments where we match an anonymous blog
against a set of 100,000 blogs, one of which is a different
blog by the same author, the nearest neighbor classifier can
correctly identify the blog by the same author in about 12%
of cases. Finally, we also manually verified that in crosscontext
matching we find pairs of blogs that are hard for
humans to match based on topic or writing style; we describe
three such pairs in Appendix A.
The strength of the deanonymization attack we have
presented is only likely to improve over time as better techniques
are developed. Our results thus call into question the
viability of anonymous online speech. Even if the adversary
is unable to identify the author using our methods in a fully
automated fashion, he might be able to identify a few tens
of candidates for manual inspection as we detail in Section
III.
Difference was one of scale. Much easier when just taking three dozen? pieces of classical latin literature, some of which were different parts of the same opus magnum, then see them cluster to their respective authors and to the other parts of the same piece. More of a “put the pieces into the box” as opposed to a 100,000 pieces puzzle. In the latter case, you just know most of the puzzle pieces will either show the blue sky, or the blue sea, both a similar shade of blue.
gzip is a “crude upper bound” on KC, and afaik there is no known bound on the error.
EDIT: I’m pretty sure the following result isn’t even true: If KC(x) < KC(y), then gzip(x) < gzip(y). Or even quantitatively weaker variants of the same.
There can’t be, since KC isn’t computable (could be mistaken on that in itself precluding a (edit:) lower error bound). It’s still kinda nice and works on a similar principle (and also, somewhat, in practice). Let’s plug the Hutter prize while we’re at it, for the 5 people reading this. Also, just saw a cool paper which kinda describes the same principle, here.
There can’t be, since KC isn’t computable (could be mistaken on that in itself precluding a bounded error).
KC may not be uncomputable in general, but I’m pretty sure that doesn’t preclude all possible proofs or constructions*, and it seems odd to say that there is no upper bounds when we have what sure look like such things.
* I have just invented a Scheme-like programming language in which all tokens except numbers are 2 alphanumeric characters long; what is the Kolmogorov complexity of the bitstring ‘1’? ‘Who knows, KC is uncomputable -’ Bzzt! The answer is that the KC is exactly 1, since the shortest program which emits the bitstring ‘1’ is a program consisting of the constant ‘1’ which evaluates to ‘1’, which as you can see, is indeed of length 1, and all other programs emitting the bitstring ‘1’ are longer by definition.
Or if you don’t like that example, consider taking your favorite language and enumerating all possible programs in length order; consider the outputs of the programs you’ve just enumerated when evaluated up to the highest available Busy Beaver in steps (to avoid nontermination issues), and the respective lengths of outputs—have you not found the KC for those outputs? If you run gzip over the outputs to get estimated KCs, are those not upper bounds on the actual KCs you proved?
Computing upper bounds on on Kolmogorov Complexity is not very difficult: gzip and all the other compression algorithms do it. The difficulty is computing non-trivial lower bounds: For all programming languages (with a self-terminating encoding), there is a trivial lower bound that doesn’t depend on the string. This bound is at least one token.
But there is also a language-dependent constant L_max which is the maximum KC complexity and lower bound on KC complexity that you can compute for any string: L_max is the length of the shortest program for which the halting property is uncomputable ( * ) (which makes L_max is uncomputable as well). This implies that you can compute the KC complexity only for a finite number of strings.
There is no such thing as “the shortest program for which the halting property is uncomputable”. That property is computable trivially for every possible program. What’s uncomputable is always the halting problem for an infinity of programs using one common algorithm.
It is also easy to make up artificial languages in which Kolmogorov complexity is computable for an infinite subset of all possible strings.
You were probably thinking of something else: that there exists a constant L, which depends on the language and a proof system T, such that it’s not possible to prove in T that any string has Kolmogorov complexity larger than L. That is true. In particular, this means that there’s a limit to lower bounds we can establish, although we don’t know what that limit is.
There is no such thing as “the shortest program for which the halting property is uncomputable”. That property is computable trivially for every possible program.
The halting property is semi-decidable: if a program halts, then you can always trivially prove that it halts, you just need run it. If a program does not halt, then sometimes you can prove that it doesn’t halt, and sometimes you can’t prove anything. For any programming language, there exist a length such that you can compute the halting property for all programs shorter than that. At length L_max, there will be program Bad_0, which:
Does not halt.
Can’t be proven not to halt.
Doesn’t emit any output symbol.
It can’t be proven that there exist any string of output symbols that it doesn’t emit.
You can never prove that any string S has Kolmogorov complexity K if K > L_max, as it would imply that you proved that Bad_0 doesn’t halt, or at least doesn’t emit any symbol which is not a prefix of S.
Since there are only finitely many strings with complexity up to L_max, we can only compute Kolmogorov complexity, for finitely many strings.
It is also easy to make up artificial languages in which Kolmogorov complexity is computable for an infinite subset of all possible strings.
If the language is Turing-complete I don’t think this is possible. If you think that an arbitrary string S has complexity K, how can you prove that there exist no program shorter than K that computes S?
Yea, only talking about the general case. Even the Halting Problem is usually computable in daily life, just by running the program (ETA: and seeing it terminate).
Exactly. So why bother saying gzip is an approximation of KC? (I assume: because KC is a well-known theoretical object with good properties, and one shouldn’t let the fact that none of these properties carry over to gzip ruin one’s chance of getting published cheaply.)
Because gzip is being used to measure complexity. That’s literally the reason they used gzip and not, I don’t know, rot13. It’s an explanation of the causal role that gzip is playing in the whole process.
I propose a new version of the Turing test: An AI is as smart as a human when it can figure out which new poster on a message board is actually the same person as an old poster who got banned. :)
This is an old problem, see e.g.:
https://www.cs.berkeley.edu/~russell/papers/ifsa01-identity.pdf
for special cases. Both google and facebook are very interested in versions of this problem.
I don’t think this is actually a difficult problem. Some simple machine learning on word frequencies, bigram frequencies, etc. will be probably enough to solve it.
Friend of mine did it via computational complexity: using gzip (as an approximation for KC) for attributing classical latin literature to their respective authors by checking which strings add the least additional complexity (due to shared writing styles, word choice, etc.) when compressed together and then clustering. Worked like a charm.
ETA: These were large bodies of text, however. Probably not gonna work for a bundle of comments, except for me, due to my overuse of “obviously”, obviously.
I thought this was pretty impressive:
[...]
[...]
Difference was one of scale. Much easier when just taking three dozen? pieces of classical latin literature, some of which were different parts of the same opus magnum, then see them cluster to their respective authors and to the other parts of the same piece. More of a “put the pieces into the box” as opposed to a 100,000 pieces puzzle. In the latter case, you just know most of the puzzle pieces will either show the blue sky, or the blue sea, both a similar shade of blue.
Commenting to ‘save’ this comment. That’s a really clever way to handle that.
KC = Kolmogorov Complexity?
Yeap, the paper linked in my other comment explains how it works.
gzip is a “crude upper bound” on KC, and afaik there is no known bound on the error.
EDIT: I’m pretty sure the following result isn’t even true: If KC(x) < KC(y), then gzip(x) < gzip(y). Or even quantitatively weaker variants of the same.
There can’t be, since KC isn’t computable (could be mistaken on that in itself precluding a (edit:) lower error bound). It’s still kinda nice and works on a similar principle (and also, somewhat, in practice). Let’s plug the Hutter prize while we’re at it, for the 5 people reading this. Also, just saw a cool paper which kinda describes the same principle, here.
KC may not be uncomputable in general, but I’m pretty sure that doesn’t preclude all possible proofs or constructions*, and it seems odd to say that there is no upper bounds when we have what sure look like such things.
* I have just invented a Scheme-like programming language in which all tokens except numbers are 2 alphanumeric characters long; what is the Kolmogorov complexity of the bitstring ‘1’? ‘Who knows, KC is uncomputable -’ Bzzt! The answer is that the KC is exactly 1, since the shortest program which emits the bitstring ‘1’ is a program consisting of the constant ‘1’ which evaluates to ‘1’, which as you can see, is indeed of length 1, and all other programs emitting the bitstring ‘1’ are longer by definition.
Or if you don’t like that example, consider taking your favorite language and enumerating all possible programs in length order; consider the outputs of the programs you’ve just enumerated when evaluated up to the highest available Busy Beaver in steps (to avoid nontermination issues), and the respective lengths of outputs—have you not found the KC for those outputs? If you run gzip over the outputs to get estimated KCs, are those not upper bounds on the actual KCs you proved?
Computing upper bounds on on Kolmogorov Complexity is not very difficult: gzip and all the other compression algorithms do it. The difficulty is computing non-trivial lower bounds:
For all programming languages (with a self-terminating encoding), there is a trivial lower bound that doesn’t depend on the string. This bound is at least one token.
But there is also a language-dependent constant L_max which is the maximum KC complexity and lower bound on KC complexity that you can compute for any string: L_max is the length of the shortest program for which the halting property is uncomputable ( * ) (which makes L_max is uncomputable as well).
This implies that you can compute the KC complexity only for a finite number of strings.
( * And doesn’t provably emit any token)
There is no such thing as “the shortest program for which the halting property is uncomputable”. That property is computable trivially for every possible program. What’s uncomputable is always the halting problem for an infinity of programs using one common algorithm.
It is also easy to make up artificial languages in which Kolmogorov complexity is computable for an infinite subset of all possible strings.
You were probably thinking of something else: that there exists a constant L, which depends on the language and a proof system T, such that it’s not possible to prove in T that any string has Kolmogorov complexity larger than L. That is true. In particular, this means that there’s a limit to lower bounds we can establish, although we don’t know what that limit is.
The halting property is semi-decidable: if a program halts, then you can always trivially prove that it halts, you just need run it. If a program does not halt, then sometimes you can prove that it doesn’t halt, and sometimes you can’t prove anything.
For any programming language, there exist a length such that you can compute the halting property for all programs shorter than that. At length L_max, there will be program Bad_0, which:
Does not halt.
Can’t be proven not to halt.
Doesn’t emit any output symbol.
It can’t be proven that there exist any string of output symbols that it doesn’t emit.
You can never prove that any string S has Kolmogorov complexity K if K > L_max, as it would imply that you proved that Bad_0 doesn’t halt, or at least doesn’t emit any symbol which is not a prefix of S.
Since there are only finitely many strings with complexity up to L_max, we can only compute Kolmogorov complexity, for finitely many strings.
If the language is Turing-complete I don’t think this is possible. If you think that an arbitrary string S has complexity K, how can you prove that there exist no program shorter than K that computes S?
[retracted]
Yea, only talking about the general case. Even the Halting Problem is usually computable in daily life, just by running the program (ETA: and seeing it terminate).
Watch the program run
See it die before your eyes
Halting Problem? Solved!
Exactly. So why bother saying gzip is an approximation of KC? (I assume: because KC is a well-known theoretical object with good properties, and one shouldn’t let the fact that none of these properties carry over to gzip ruin one’s chance of getting published cheaply.)
Because gzip is being used to measure complexity. That’s literally the reason they used gzip and not, I don’t know, rot13. It’s an explanation of the causal role that gzip is playing in the whole process.
No.
“gzip is being used to measure complexity” is an explanation of the causal role that gzip is playing in this study.
“gzip is an approximation of KC” is either 1) flatly not true, see edit to grandparent or 2) not relevant to the study at all.
And while we’re at it, why do we even care about Turing Machines, it’s not like we could ever build one anyways. ;-)
goes back to tending his potato garden