I think how I’m imagining the more targeted, ‘working-with-finetuning’ version of evals to handle this kind of case is that you do your best to train the model to use its full capabilities, and approach tasks in a model-idiomatic way, when given a particular target like scamming someone.
In the case of inferring author information, I think the souped-up skilled-attacker version would not involve prompts at all.
You would treat it as an embedding problem similar to facial recognition or stylometric identification, and use something like a triplet loss for contrastive learning. Then you would have an embedding you can decode sensitive personal information from.
So for example, in stylometrics, to train a ML model, you would have a large text dataset of author+texts, and you would train a model to take a text and spit out an embedding, and you would force embeddings of random samples of non-overlapping text from the same author to be closer and be further away from embeddings of random texts from other (possibly unlabeled) authors. You would then take a dataset of authors+author-metadata (possibly a different dataset, possibly the same dataset, if only for ‘author name’), and train another model (possibly the same model) to take the (frozen) embedding of all the texts and predict the author-metadata. This lets you take a piece of text, such as an anonymous comment on a LW post, embed it, compare the similarity of the embedding to comments with labeled authors (or unlabeled texts) to get a list of candidate authors (or other texts possibly by the same anonymous author) by similarity, extract estimated demographic and other information which can be estimated from language (including the name if reasonably known), estimate number of authors and cluster texts which may let you infer activity patterns & timings etc, pass into still further ML systems for arbitrary use...
Because it’s hard to change writing styles, even attempts to obfuscate writing will probably fail, and you can also train on that as well—there are a number of private-sector companies which sell stylometric services to law enforcement etc, and I would assume that they have datasets of ‘trying to hide’ authors where the authors were later busted or the accounts/nyms linked by other methods, which can be used as hard-positive cases to further finetune the LLM after the normal training phase.
Facial anonymity is dead and buried. Location anonymity is dead thanks to smartphones but we’re still pretending it’s real (see the Capitol riot). Voice anonymity is waiting for the doctor to arrive and pronounce it dead. And text anonymity is flatlining now.
Having seen what stylometrics could do even with the simplest ML techniques from the 2000s, I strongly advise everyone to start assuming right now that robust stylometric deanonymization will be achieved within the next decade: any nontrivial pieces of writing (say, >50 words) will be attributable to you regardless of pseudonymity or anonymity with at least enough confidence to be useful for law enforcement investigation and possibly enough to cancel you on social media or get you fired, even if the LLM stylometrics do not rise to the level of ‘a smoking gun’.
Further, LLMs are so cheap to run that this may well be done en masse by a motivated hobbyist or activist. So, don’t count on “well, the NSA or FBI would never bother to dox my old comments”—it’ll look more mundane. One day you’ll wake up, and an activist on Mastodon announces that they have finetuned FluffyLlama-11 with contrastive learning on Pushshift and released a giant database re-identifying fascists, and then an old enemy or fan will look you up out of curiosity and a distributed flesh search engine kicks into gear.
In the case of inferring author information, I think the souped-up skilled-attacker version would not involve prompts at all.
You would treat it as an embedding problem similar to facial recognition or stylometric identification, and use something like a triplet loss for contrastive learning. Then you would have an embedding you can decode sensitive personal information from.
So for example, in stylometrics, to train a ML model, you would have a large text dataset of author+texts, and you would train a model to take a text and spit out an embedding, and you would force embeddings of random samples of non-overlapping text from the same author to be closer and be further away from embeddings of random texts from other (possibly unlabeled) authors. You would then take a dataset of authors+author-metadata (possibly a different dataset, possibly the same dataset, if only for ‘author name’), and train another model (possibly the same model) to take the (frozen) embedding of all the texts and predict the author-metadata. This lets you take a piece of text, such as an anonymous comment on a LW post, embed it, compare the similarity of the embedding to comments with labeled authors (or unlabeled texts) to get a list of candidate authors (or other texts possibly by the same anonymous author) by similarity, extract estimated demographic and other information which can be estimated from language (including the name if reasonably known), estimate number of authors and cluster texts which may let you infer activity patterns & timings etc, pass into still further ML systems for arbitrary use...
Because it’s hard to change writing styles, even attempts to obfuscate writing will probably fail, and you can also train on that as well—there are a number of private-sector companies which sell stylometric services to law enforcement etc, and I would assume that they have datasets of ‘trying to hide’ authors where the authors were later busted or the accounts/nyms linked by other methods, which can be used as hard-positive cases to further finetune the LLM after the normal training phase.
Facial anonymity is dead and buried. Location anonymity is dead thanks to smartphones but we’re still pretending it’s real (see the Capitol riot). Voice anonymity is waiting for the doctor to arrive and pronounce it dead. And text anonymity is flatlining now.
Having seen what stylometrics could do even with the simplest ML techniques from the 2000s, I strongly advise everyone to start assuming right now that robust stylometric deanonymization will be achieved within the next decade: any nontrivial pieces of writing (say, >50 words) will be attributable to you regardless of pseudonymity or anonymity with at least enough confidence to be useful for law enforcement investigation and possibly enough to cancel you on social media or get you fired, even if the LLM stylometrics do not rise to the level of ‘a smoking gun’.
Further, LLMs are so cheap to run that this may well be done en masse by a motivated hobbyist or activist. So, don’t count on “well, the NSA or FBI would never bother to dox my old comments”—it’ll look more mundane. One day you’ll wake up, and an activist on Mastodon announces that they have finetuned FluffyLlama-11 with contrastive learning on Pushshift and released a giant database re-identifying fascists, and then an old enemy or fan will look you up out of curiosity and a distributed flesh search engine kicks into gear.