Age is extremely compressed/skewed because it’s OKCupid. So I can think of a couple issues there: there might be a problem of distribution mismatch where a GPT is trained on a much more even distribution of text (I would assume tons of text is written by age 50-100 IRL rather than a young techie dating website) and so is simply taking into account a very different base rate; another issue is that maybe the GPT is accurate but restriction of range creates misleading statistical artifacts. Binarization wouldn’t help, and might worsen matters given the actual binarization here at age 30 - how many people tweak their age on a dating site to avoid the dreaded leading ‘3’ and turning into Christmas cake? You’ll remember OKCupid’s posts about people shading the truth a little about things like height… (A more continuous loss like median average error might be a better metric than Brier on a binary or categorical.)
As far as sexuality goes, this is something the LLMs may be trained very heavily on, with unpredictable effects. But it’s also a much weirder category here too:
Dating sites in general have more males than females, reflecting the mating behavior seen offline (more
males being on the lookout). OKCupid features a very broad selection of possible genders. One must choose at least one category and up to 5 categories of which the possible options are: Man, Woman, Agender, Androgynous, Bigender, Cis Man, Cis Woman, Genderfluid, Genderqueer, Gender Nonconforming, Hijra, Intersex, Non-binary, Other, Pangender, Transfeminine, Transgender, Transmasculine, Transsexual, Trans Man, Trans Women and Two Spirit. Nevertheless, almost everybody chooses one of the first two (39.1 % Women, 60.6 % Men, binary total = 99.7 %)^5. The full count by type can be found in the supplementary materials sheet “Genders”).
I’m not sure how OP handled that. So the predictive power here should be considered as a loose lower bound, given all the potential sources of measurement error/noise.
Gwern’s theories make sense to me. The data was roughly 50⁄50 on ⇐ 30 vs > 30, so that’s where I split it (and I’m only asking the model to pick one of those two options). Sexuality in the dataset is just male/female; they must have added the other options later (35829 male, 24117 female, and 2 blanks which I ignored). Agreed that this is very much a lower bound, also becase I applied zero optimization to the system prompt and user prompts. This is ‘if you do the simplest possible thing, how good is it?’
No, unfortunately it’s all lowercased already in the dataset.
I agree! Dating site data is somewhat easy mode. I compared gender accuracy on the Persuade 2.0 corpus of students writing essays on a fixed topic, which I consider very much hard mode, and it was still 80% accurate. So I do think it’s getting some advantage from being in easy mode but not that much. I’ll note also that I’m removing a bunch of words that are giveaways for gender, and it only lost 2 percentage points of accuracy. So I do think it’s mostly working from implicit cues and distributional differences here rather than easy giveaways. Staab et al (thanks @gwern for pointing that paper out to me) looks more at explicit cues and compares to human investigators looking for explicit cues, so you may find that interesting as well.
Age is extremely compressed/skewed because it’s OKCupid. So I can think of a couple issues there: there might be a problem of distribution mismatch where a GPT is trained on a much more even distribution of text (I would assume tons of text is written by age 50-100 IRL rather than a young techie dating website) and so is simply taking into account a very different base rate; another issue is that maybe the GPT is accurate but restriction of range creates misleading statistical artifacts. Binarization wouldn’t help, and might worsen matters given the actual binarization here at age 30 - how many people tweak their age on a dating site to avoid the dreaded leading ‘3’ and turning into Christmas cake? You’ll remember OKCupid’s posts about people shading the truth a little about things like height… (A more continuous loss like median average error might be a better metric than Brier on a binary or categorical.)
As far as sexuality goes, this is something the LLMs may be trained very heavily on, with unpredictable effects. But it’s also a much weirder category here too:
I’m not sure how OP handled that. So the predictive power here should be considered as a loose lower bound, given all the potential sources of measurement error/noise.
Gwern’s theories make sense to me. The data was roughly 50⁄50 on ⇐ 30 vs > 30, so that’s where I split it (and I’m only asking the model to pick one of those two options). Sexuality in the dataset is just male/female; they must have added the other options later (35829 male, 24117 female, and 2 blanks which I ignored). Agreed that this is very much a lower bound, also becase I applied zero optimization to the system prompt and user prompts. This is ‘if you do the simplest possible thing, how good is it?’
No, unfortunately it’s all lowercased already in the dataset.
I agree! Dating site data is somewhat easy mode. I compared gender accuracy on the Persuade 2.0 corpus of students writing essays on a fixed topic, which I consider very much hard mode, and it was still 80% accurate. So I do think it’s getting some advantage from being in easy mode but not that much. I’ll note also that I’m removing a bunch of words that are giveaways for gender, and it only lost 2 percentage points of accuracy. So I do think it’s mostly working from implicit cues and distributional differences here rather than easy giveaways. Staab et al (thanks @gwern for pointing that paper out to me) looks more at explicit cues and compares to human investigators looking for explicit cues, so you may find that interesting as well.