yc comments on Language Models Model Us

yc 8 Oct 2024 3:45 UTC
1 point
0
This is a relatively common topic in responsible AI; glad to see reference on Staab et al, 2023! For PII (Personally Identifiable Information) - RLHF typically is the go to method for refusing such prompts, but since they are easy to be undone, efforts had been put into cleaning the pretaining safety data. For demographics inference—seems to be bias related as well.
- eggsyntax 9 Oct 2024 16:33 UTC
  1 point
  0
  Parent
  This is a relatively common topic in responsible AI
  If there are other papers on the topic you’d recommend, I’d love to get links or citations.
  - yc 9 Oct 2024 21:17 UTC
    1 point
    0
    Parent
    Yeah for sure!
    For PII—A relatively recent survey paper: https://arxiv.org/pdf/2403.05156
    For pii/memorization generally:
    https://arxiv.org/pdf/2302.00539
    https://arxiv.org/abs/2202.07646
    Lab’s LLM safety section typically has a pii/memorization section
    For demographics inference:
    https://ieeexplore.ieee.org/document/9152761
    For bias/fairness—survey paper: https://arxiv.org/pdf/2309.00770
    This is probably far from complete, but I think the references in the survey paper, and in the Staab et al. paper should have some additional good ones as well.
    - eggsyntax 10 Oct 2024 17:19 UTC
      1 point
      0
      Parent
      Thanks!
      I’ve seen some of the PII/memorization work, but I think that problem is distinct from what I’m trying to address here; what I was most interested in is what the model can infer about someone who doesn’t appear in the training data at all. In practice it can be hard to distinguish those cases, but conceptually I see them as pretty distinct.
      The demographics link (‘Privacy Risks of General-Purpose Language Models’) is interesting and I hadn’t seen it, thanks! It seems mostly pretty different from what I’m trying to look at, in that they’re looking at questions about models’ ability to reconstruct text sequences (including eg genome sequences), whereas I’m looking at questions about what the model can infer about users/authors.
      Bias/fairness work is interesting and related, but aiming in a somewhat different direction—I’m not interested in inference of demographic characteristics primarily because they can have bias consequences (although certainly it’s valuable to try to prevent bias!). For me they’re primarily a relatively easy-to-measure proxy for broader questions about what the model is able to infer about users from their text. In the long run I’m much more interested in what the model can infer about users’ beliefs, because that’s what enables the model to be deceptive or manipulative.
      I’ve focused here on differences between the work you linked and what I’m aiming toward, but those are still all helpful references, and I appreciate you providing them!