This is a relatively common topic in responsible AI; glad to see reference on Staab et al, 2023!
For PII (Personally Identifiable Information) - RLHF typically is the go to method for refusing such prompts, but since they are easy to be undone, efforts had been put into cleaning the pretaining safety data.
For demographics inference—seems to be bias related as well.
This is probably far from complete, but I think the references in the survey paper, and in the Staab et al. paper should have some additional good ones as well.
I’ve seen some of the PII/memorization work, but I think that problem is distinct from what I’m trying to address here; what I was most interested in is what the model can infer about someone who doesn’t appear in the training data at all. In practice it can be hard to distinguish those cases, but conceptually I see them as pretty distinct.
The demographics link (‘Privacy Risks of General-Purpose Language Models’) is interesting and I hadn’t seen it, thanks! It seems mostly pretty different from what I’m trying to look at, in that they’re looking at questions about models’ ability to reconstruct text sequences (including eg genome sequences), whereas I’m looking at questions about what the model can infer about users/authors.
Bias/fairness work is interesting and related, but aiming in a somewhat different direction—I’m not interested in inference of demographic characteristics primarily because they can have bias consequences (although certainly it’s valuable to try to prevent bias!). For me they’re primarily a relatively easy-to-measure proxy for broader questions about what the model is able to infer about users from their text. In the long run I’m much more interested in what the model can infer about users’ beliefs, because that’s what enables the model to be deceptive or manipulative.
I’ve focused here on differences between the work you linked and what I’m aiming toward, but those are still all helpful references, and I appreciate you providing them!
This is a relatively common topic in responsible AI; glad to see reference on Staab et al, 2023! For PII (Personally Identifiable Information) - RLHF typically is the go to method for refusing such prompts, but since they are easy to be undone, efforts had been put into cleaning the pretaining safety data. For demographics inference—seems to be bias related as well.
If there are other papers on the topic you’d recommend, I’d love to get links or citations.
Yeah for sure!
For PII—A relatively recent survey paper: https://arxiv.org/pdf/2403.05156
For pii/memorization generally:
https://arxiv.org/pdf/2302.00539
https://arxiv.org/abs/2202.07646
Lab’s LLM safety section typically has a pii/memorization section
For demographics inference:
https://ieeexplore.ieee.org/document/9152761
For bias/fairness—survey paper: https://arxiv.org/pdf/2309.00770
This is probably far from complete, but I think the references in the survey paper, and in the Staab et al. paper should have some additional good ones as well.
Thanks!
I’ve seen some of the PII/memorization work, but I think that problem is distinct from what I’m trying to address here; what I was most interested in is what the model can infer about someone who doesn’t appear in the training data at all. In practice it can be hard to distinguish those cases, but conceptually I see them as pretty distinct.
The demographics link (‘Privacy Risks of General-Purpose Language Models’) is interesting and I hadn’t seen it, thanks! It seems mostly pretty different from what I’m trying to look at, in that they’re looking at questions about models’ ability to reconstruct text sequences (including eg genome sequences), whereas I’m looking at questions about what the model can infer about users/authors.
Bias/fairness work is interesting and related, but aiming in a somewhat different direction—I’m not interested in inference of demographic characteristics primarily because they can have bias consequences (although certainly it’s valuable to try to prevent bias!). For me they’re primarily a relatively easy-to-measure proxy for broader questions about what the model is able to infer about users from their text. In the long run I’m much more interested in what the model can infer about users’ beliefs, because that’s what enables the model to be deceptive or manipulative.
I’ve focused here on differences between the work you linked and what I’m aiming toward, but those are still all helpful references, and I appreciate you providing them!