Do you have any theory as to why the LLM did worse on guessing age/sexuality (relative to both other categories, and the baseline)?
Thanks for including some writing samples in Appendix C! They seem to all be in lowercase, was that how they were shown to the LLM? I expect that may be helpful for tokenization reasons, but also obscure some “real” information about how people write depending on age/gender/etc. So perhaps a person or language model could do even better at identity-guessing if the text had its original capitalization.
More of a comment than a question: I’d speculate that dating profiles, which are written to communicate things about the writer, make it easier to identify the writer’s identity than other text (professional writing, tweets, etc). I appreciate the data availability problem (and thanks for explaining your choice of dataset), but do you have any ideas of other datasets you could test on?
Age is extremely compressed/skewed because it’s OKCupid. So I can think of a couple issues there: there might be a problem of distribution mismatch where a GPT is trained on a much more even distribution of text (I would assume tons of text is written by age 50-100 IRL rather than a young techie dating website) and so is simply taking into account a very different base rate; another issue is that maybe the GPT is accurate but restriction of range creates misleading statistical artifacts. Binarization wouldn’t help, and might worsen matters given the actual binarization here at age 30 - how many people tweak their age on a dating site to avoid the dreaded leading ‘3’ and turning into Christmas cake? You’ll remember OKCupid’s posts about people shading the truth a little about things like height… (A more continuous loss like median average error might be a better metric than Brier on a binary or categorical.)
As far as sexuality goes, this is something the LLMs may be trained very heavily on, with unpredictable effects. But it’s also a much weirder category here too:
Dating sites in general have more males than females, reflecting the mating behavior seen offline (more
males being on the lookout). OKCupid features a very broad selection of possible genders. One must choose at least one category and up to 5 categories of which the possible options are: Man, Woman, Agender, Androgynous, Bigender, Cis Man, Cis Woman, Genderfluid, Genderqueer, Gender Nonconforming, Hijra, Intersex, Non-binary, Other, Pangender, Transfeminine, Transgender, Transmasculine, Transsexual, Trans Man, Trans Women and Two Spirit. Nevertheless, almost everybody chooses one of the first two (39.1 % Women, 60.6 % Men, binary total = 99.7 %)^5. The full count by type can be found in the supplementary materials sheet “Genders”).
I’m not sure how OP handled that. So the predictive power here should be considered as a loose lower bound, given all the potential sources of measurement error/noise.
Gwern’s theories make sense to me. The data was roughly 50⁄50 on ⇐ 30 vs > 30, so that’s where I split it (and I’m only asking the model to pick one of those two options). Sexuality in the dataset is just male/female; they must have added the other options later (35829 male, 24117 female, and 2 blanks which I ignored). Agreed that this is very much a lower bound, also because I applied zero optimization to the system prompt and user prompts. This is ‘if you do the simplest possible thing, how good is it?’
No, unfortunately it’s all lowercased already in the dataset.
I agree! Dating site data is somewhat easy mode. I compared gender accuracy on the Persuade 2.0 corpus of students writing essays on a fixed topic, which I consider very much hard mode, and it was still 80% accurate. So I do think it’s getting some advantage from being in easy mode but not that much. I’ll note also that I’m removing a bunch of words that are giveaways for gender, and it only lost 2 percentage points of accuracy. So I do think it’s mostly working from implicit cues and distributional differences here rather than easy giveaways. Staab et al (thanks @gwern for pointing that paper out to me) looks more at explicit cues and compares to human investigators looking for explicit cues, so you may find that interesting as well.
Cool work! Some questions:
Do you have any theory as to why the LLM did worse on guessing age/sexuality (relative to both other categories, and the baseline)?
Thanks for including some writing samples in Appendix C! They seem to all be in lowercase, was that how they were shown to the LLM? I expect that may be helpful for tokenization reasons, but also obscure some “real” information about how people write depending on age/gender/etc. So perhaps a person or language model could do even better at identity-guessing if the text had its original capitalization.
More of a comment than a question: I’d speculate that dating profiles, which are written to communicate things about the writer, make it easier to identify the writer’s identity than other text (professional writing, tweets, etc). I appreciate the data availability problem (and thanks for explaining your choice of dataset), but do you have any ideas of other datasets you could test on?
Age is extremely compressed/skewed because it’s OKCupid. So I can think of a couple issues there: there might be a problem of distribution mismatch where a GPT is trained on a much more even distribution of text (I would assume tons of text is written by age 50-100 IRL rather than a young techie dating website) and so is simply taking into account a very different base rate; another issue is that maybe the GPT is accurate but restriction of range creates misleading statistical artifacts. Binarization wouldn’t help, and might worsen matters given the actual binarization here at age 30 - how many people tweak their age on a dating site to avoid the dreaded leading ‘3’ and turning into Christmas cake? You’ll remember OKCupid’s posts about people shading the truth a little about things like height… (A more continuous loss like median average error might be a better metric than Brier on a binary or categorical.)
As far as sexuality goes, this is something the LLMs may be trained very heavily on, with unpredictable effects. But it’s also a much weirder category here too:
I’m not sure how OP handled that. So the predictive power here should be considered as a loose lower bound, given all the potential sources of measurement error/noise.
Gwern’s theories make sense to me. The data was roughly 50⁄50 on ⇐ 30 vs > 30, so that’s where I split it (and I’m only asking the model to pick one of those two options). Sexuality in the dataset is just male/female; they must have added the other options later (35829 male, 24117 female, and 2 blanks which I ignored). Agreed that this is very much a lower bound, also because I applied zero optimization to the system prompt and user prompts. This is ‘if you do the simplest possible thing, how good is it?’
No, unfortunately it’s all lowercased already in the dataset.
I agree! Dating site data is somewhat easy mode. I compared gender accuracy on the Persuade 2.0 corpus of students writing essays on a fixed topic, which I consider very much hard mode, and it was still 80% accurate. So I do think it’s getting some advantage from being in easy mode but not that much. I’ll note also that I’m removing a bunch of words that are giveaways for gender, and it only lost 2 percentage points of accuracy. So I do think it’s mostly working from implicit cues and distributional differences here rather than easy giveaways. Staab et al (thanks @gwern for pointing that paper out to me) looks more at explicit cues and compares to human investigators looking for explicit cues, so you may find that interesting as well.