There is a field called Forensic linguistics where detectives use someone’s “linguistic fingerprint” to determine the author of a document (famously instrumental in catching Ted Kaczynski by analyzing his manifesto). It seems like text is often used to predict things like gender, socioeconomic background, and education level.
If LLMs are superhuman at this kind of work, I wonder whether anyone is developing AI tools to automate this. Maybe the demand is not very strong, but I could imagine, for example, that an authoritarian regime might have a lot of incentive to de-anonymize people. While a company like OpenAI seems likely to have an incentive to hide how much the LLM actually knows about the user, I’m curious where anyone would have a strong incentive to make full use of superhuman linguistic analysis.
Thanks! I’ve been treating forensic linguistics as a subdiscipline of stylometry, which I mention in the related work section, although it’s hard to know from the outside where particular academic boundaries are drawn. My understanding of both is that they’re primarily concerned with identifying specific authors (as in the case of Kaczynski), but that both include forays into investigating author characteristics like gender. There definitely is overlap, although those fields tend to use specialized tools, where I’m more interested in the capabilities of general-purpose models since those are where more overall risk comes from.
If LLMs are superhuman at this kind of work
To be clear, I don’t think that’s been shown as yet; I’m personally uncertain at this point. I would be surprised if they didn’t become clearly superhuman at it within another generation or two, even in the absence of any overall capability breakthroughs.
I could imagine, for example, that an authoritarian regime might have a lot of incentive to de-anonymize people.
Absolutely agreed. The majority of nearish-term privacy risk in my view comes from a mix of authorities and corporate privacy invasion, with a healthy sprinkling of blackmail (though again, I’m personally less concerned about the misuse risk than about the deception/manipulation risk both from misuse and from possible misaligned models).
There is a field called Forensic linguistics where detectives use someone’s “linguistic fingerprint” to determine the author of a document (famously instrumental in catching Ted Kaczynski by analyzing his manifesto). It seems like text is often used to predict things like gender, socioeconomic background, and education level.
If LLMs are superhuman at this kind of work, I wonder whether anyone is developing AI tools to automate this. Maybe the demand is not very strong, but I could imagine, for example, that an authoritarian regime might have a lot of incentive to de-anonymize people. While a company like OpenAI seems likely to have an incentive to hide how much the LLM actually knows about the user, I’m curious where anyone would have a strong incentive to make full use of superhuman linguistic analysis.
Thanks! I’ve been treating forensic linguistics as a subdiscipline of stylometry, which I mention in the related work section, although it’s hard to know from the outside where particular academic boundaries are drawn. My understanding of both is that they’re primarily concerned with identifying specific authors (as in the case of Kaczynski), but that both include forays into investigating author characteristics like gender. There definitely is overlap, although those fields tend to use specialized tools, where I’m more interested in the capabilities of general-purpose models since those are where more overall risk comes from.
To be clear, I don’t think that’s been shown as yet; I’m personally uncertain at this point. I would be surprised if they didn’t become clearly superhuman at it within another generation or two, even in the absence of any overall capability breakthroughs.
Absolutely agreed. The majority of nearish-term privacy risk in my view comes from a mix of authorities and corporate privacy invasion, with a healthy sprinkling of blackmail (though again, I’m personally less concerned about the misuse risk than about the deception/manipulation risk both from misuse and from possible misaligned models).