Ander comments on PSA: Eugine_Nier evading ban?

Ander 8 Dec 2014 19:02 UTC
12 points
I propose a new version of the Turing test: An AI is as smart as a human when it can figure out which new poster on a message board is actually the same person as an old poster who got banned. :)
- IlyaShpitser 8 Dec 2014 20:24 UTC
  22 points
  Parent
  This is an old problem, see e.g.:
  
  https://www.cs.berkeley.edu/~russell/papers/ifsa01-identity.pdf
  
  for special cases. Both google and facebook are very interested in versions of this problem.
- V_V 9 Dec 2014 11:14 UTC
  13 points
  Parent
  I don’t think this is actually a difficult problem. Some simple machine learning on word frequencies, bigram frequencies, etc. will be probably enough to solve it.
  - Kawoomba 9 Dec 2014 12:40 UTC
    25 points
    Parent
    Friend of mine did it via computational complexity: using gzip (as an approximation for KC) for attributing classical latin literature to their respective authors by checking which strings add the least additional complexity (due to shared writing styles, word choice, etc.) when compressed together and then clustering. Worked like a charm.
    
    ETA: These were large bodies of text, however. Probably not gonna work for a bundle of comments, except for me, due to my overuse of “obviously”, obviously.
    - Kaj_Sotala 9 Dec 2014 20:43 UTC
      10 points
      Parent
      I thought this was pretty impressive:
      
      We study techniques for identifying an anonymous author via linguistic stylometry, i.e., comparing the writing style against a corpus of texts of known authorship. We experimentally demonstrate the effectiveness of our techniques with as many as 100,000 candidate authors.
      
      [...]
      
      In experiments where we match a sample of just 3 blog posts against the rest of the posts from that blog (mixed in with 100,000 other blogs), the nearest-neighbor/RLSC combination is able to identify the correct blog in about 20% of cases; in about 35% of cases, the correct blog is one of the top 20 guesses. Via confidence estimation, we can increase precision from 20% to over 80% with a recall of 50%, which means that we identify 50% of the blogs overall compared to what we would have if we always made a guess.
      
      The efficacy of the attack varies based on the number of labeled and anonymous posts available. Even with just a single post in the anonymous sample, we can identify the correct author about 7.5% of the time (without any confidence estimation). When the number of available posts in the sample increases to 10, we are able to achieve a 25% accuracy. Authors with relatively large amounts of content online (about 40 blog posts) fare worse: they are identified in over 30% of cases (with only 3 posts in the anonymous sample).
      
      [...]
      
      Further, we confirmed that our techniques work in a cross-context setting: in experiments where we match an anonymous blog against a set of 100,000 blogs, one of which is a different blog by the same author, the nearest neighbor classifier can correctly identify the blog by the same author in about 12% of cases. Finally, we also manually verified that in crosscontext matching we find pairs of blogs that are hard for humans to match based on topic or writing style; we describe three such pairs in Appendix A.
      
      The strength of the deanonymization attack we have presented is only likely to improve over time as better techniques are developed. Our results thus call into question the viability of anonymous online speech. Even if the adversary is unable to identify the author using our methods in a fully automated fashion, he might be able to identify a few tens of candidates for manual inspection as we detail in Section III.
      - Kawoomba 9 Dec 2014 20:51 UTC
        3 points
        Parent
        Difference was one of scale. Much easier when just taking three dozen? pieces of classical latin literature, some of which were different parts of the same opus magnum, then see them cluster to their respective authors and to the other parts of the same piece. More of a “put the pieces into the box” as opposed to a 100,000 pieces puzzle. In the latter case, you just know most of the puzzle pieces will either show the blue sky, or the blue sea, both a similar shade of blue.
    - gattsuru 9 Dec 2014 20:33 UTC
      4 points
      Parent
      Commenting to ‘save’ this comment. That’s a really clever way to handle that.
    - [deleted] 9 Dec 2014 20:22 UTC
      0 points
      Parent
      KC = Kolmogorov Complexity?
      - Kawoomba 9 Dec 2014 20:25 UTC
        0 points
        Parent
        Yeap, the paper linked in my other comment explains how it works.
    - [deleted] 9 Dec 2014 19:36 UTC
      0 points
      Parent
      gzip is a “crude upper bound” on KC, and afaik there is no known bound on the error.
      
      EDIT: I’m pretty sure the following result isn’t even true: If KC(x) < KC(y), then gzip(x) < gzip(y). Or even quantitatively weaker variants of the same.
      - Kawoomba 9 Dec 2014 19:43 UTC
        0 points
        Parent
        There can’t be, since KC isn’t computable (could be mistaken on that in itself precluding a (edit:) lower error bound). It’s still kinda nice and works on a similar principle (and also, somewhat, in practice). Let’s plug the Hutter prize while we’re at it, for the 5 people reading this. Also, just saw a cool paper which kinda describes the same principle, here.
        gwern 9 Dec 2014 20:00 UTC
        4 points
        Parent
        
        There can’t be, since KC isn’t computable (could be mistaken on that in itself precluding a bounded error).
        
        KC may not be uncomputable in general, but I’m pretty sure that doesn’t preclude all possible proofs or constructions*, and it seems odd to say that there is no upper bounds when we have what sure look like such things.
        
        * I have just invented a Scheme-like programming language in which all tokens except numbers are 2 alphanumeric characters long; what is the Kolmogorov complexity of the bitstring ‘1’? ‘Who knows, KC is uncomputable -’ Bzzt! The answer is that the KC is exactly 1, since the shortest program which emits the bitstring ‘1’ is a program consisting of the constant ‘1’ which evaluates to ‘1’, which as you can see, is indeed of length 1, and all other programs emitting the bitstring ‘1’ are longer by definition.
        
        Or if you don’t like that example, consider taking your favorite language and enumerating all possible programs in length order; consider the outputs of the programs you’ve just enumerated when evaluated up to the highest available Busy Beaver in steps (to avoid nontermination issues), and the respective lengths of outputs—have you not found the KC for those outputs? If you run gzip over the outputs to get estimated KCs, are those not upper bounds on the actual KCs you proved?
        V_V 9 Dec 2014 20:30 UTC
        5 points
        Parent
        Computing upper bounds on on Kolmogorov Complexity is not very difficult: gzip and all the other compression algorithms do it. The difficulty is computing non-trivial lower bounds:
        For all programming languages (with a self-terminating encoding), there is a trivial lower bound that doesn’t depend on the string. This bound is at least one token.
        
        But there is also a language-dependent constant L_max which is the maximum KC complexity and lower bound on KC complexity that you can compute for any string: L_max is the length of the shortest program for which the halting property is uncomputable ( * ) (which makes L_max is uncomputable as well).
        This implies that you can compute the KC complexity only for a finite number of strings.
        
        ( * And doesn’t provably emit any token)
        Anatoly_Vorobey 11 Dec 2014 16:01 UTC
        3 points
        Parent
        There is no such thing as “the shortest program for which the halting property is uncomputable”. That property is computable trivially for every possible program. What’s uncomputable is always the halting problem for an infinity of programs using one common algorithm.
        
        It is also easy to make up artificial languages in which Kolmogorov complexity is computable for an infinite subset of all possible strings.
        
        You were probably thinking of something else: that there exists a constant L, which depends on the language and a proof system T, such that it’s not possible to prove in T that any string has Kolmogorov complexity larger than L. That is true. In particular, this means that there’s a limit to lower bounds we can establish, although we don’t know what that limit is.
        V_V 11 Dec 2014 19:11 UTC
        6 points
        Parent
        
        There is no such thing as “the shortest program for which the halting property is uncomputable”. That property is computable trivially for every possible program.
        
        The halting property is semi-decidable: if a program halts, then you can always trivially prove that it halts, you just need run it. If a program does not halt, then sometimes you can prove that it doesn’t halt, and sometimes you can’t prove anything.
        For any programming language, there exist a length such that you can compute the halting property for all programs shorter than that. At length L_max, there will be program Bad_0, which:
        
        Does not halt.
        Can’t be proven not to halt.
        Doesn’t emit any output symbol.
        It can’t be proven that there exist any string of output symbols that it doesn’t emit.
        
        You can never prove that any string S has Kolmogorov complexity K if K > L_max, as it would imply that you proved that Bad_0 doesn’t halt, or at least doesn’t emit any symbol which is not a prefix of S.
        
        Since there are only finitely many strings with complexity up to L_max, we can only compute Kolmogorov complexity, for finitely many strings.
        
        It is also easy to make up artificial languages in which Kolmogorov complexity is computable for an infinite subset of all possible strings.
        
        If the language is Turing-complete I don’t think this is possible. If you think that an arbitrary string S has complexity K, how can you prove that there exist no program shorter than K that computes S?
        Anatoly_Vorobey 11 Dec 2014 20:10 UTC
        0 points
        Parent
        [retracted]
        Kawoomba 9 Dec 2014 20:14 UTC
        0 points
        Parent
        Yea, only talking about the general case. Even the Halting Problem is usually computable in daily life, just by running the program (ETA: and seeing it terminate).
        
        Watch the program run
        
        See it die before your eyes
        
        Halting Problem? Solved!
        [deleted] 9 Dec 2014 21:01 UTC
        0 points
        Parent
        
        There can’t be, since KC isn’t computable
        
        Exactly. So why bother saying gzip is an approximation of KC? (I assume: because KC is a well-known theoretical object with good properties, and one shouldn’t let the fact that none of these properties carry over to gzip ruin one’s chance of getting published cheaply.)
        lmm 16 Dec 2014 23:22 UTC
        0 points
        Parent
        Because gzip is being used to measure complexity. That’s literally the reason they used gzip and not, I don’t know, rot13. It’s an explanation of the causal role that gzip is playing in the whole process.
        [deleted] 17 Dec 2014 0:57 UTC
        2 points
        Parent
        No.
        
        “gzip is being used to measure complexity” is an explanation of the causal role that gzip is playing in this study.
        
        “gzip is an approximation of KC” is either 1) flatly not true, see edit to grandparent or 2) not relevant to the study at all.
        Kawoomba 9 Dec 2014 21:07 UTC
        0 points
        Parent
        And while we’re at it, why do we even care about Turing Machines, it’s not like we could ever build one anyways. ;-)
        
        goes back to tending his potato garden