gwern comments on The case for more ambitious language model evals

gwern 1 Feb 2024 18:45 UTC
LW: 30 AF: 16
4
AF
It’s unclear where the two intro quotes are from; I don’t recognize them despite being formatted as real quotes (and can’t find in searches). If they are purely hypothetical, that should be clearer.

LLMs definitely do infer a lot about authors of text. This is the inherent outcome of the prediction loss and just a concrete implication of their abilities to very accurately imitate many varying-sized demographics & groups of humans: if you can uncannily mimic arbitrary age groups or countries and responses to economic dilemmas or personality inventories, then you obviously can narrow that size down to groups of n = 1 (ie. individual authors). The most striking such paper I know of at present is probably “Beyond Memorization: Violating Privacy Via Inference with Large Language Models”, Staab et al 2023.

It’s pretty important because it tells you what LLMs do (imitation learning & meta-RL), which are quite dangerous things for them to do, and establishes a large information leak which can be used for things like steganography, coordination between instances, detecting testing vs deployment (for treacherous turns) etc.

It’s also concerning because RLHF is specifically targeted at hiding (but not destroying) these inferences. The model will still be making those latent inferences, it just won’t be making blatant use of them. (For example, one of the early signs of latent inference of author traits was that the Codex models look at how many subtle bugs or security vulnerabilities the prompt code has in it, and they replicate that: if they get buggy or insecure code, they emit more buggy or insecure code, vs more correct code doing the exact same task. IIRC, there was also evidence that Copilot was modulating code quality based on name ethnicity variations in code docs. However, RLHF and other forms of training would push them towards emitting the lowest-common denominator of ratings, while the KL constraints & self-supervised finetuning would continue to maintain the underlying inferences.) The most dangerous systems are those that only seem safe.
- janus 3 Feb 2024 1:20 UTC
  LW: 14 AF: 7
  0
  AF Parent
  The two intro quotes are not hypothetical. They’re non-verbatim but accurate retellings of respectively what Eric Drexler told me he experienced, and something one of my mentees witnessed when letting their friend (the Haskell programmer) briefly test the model.
- eggsyntax 2 Feb 2024 0:26 UTC
  3 points
  0
  Parent
  
  you obviously can narrow that size down to groups of n = 1
  
  I’m looking at what LLMs can infer about the current user (& how that’s represented in the model) as part of my research currently, and I think this is a very useful framing; given a universe of n possible users, how much information does the LLM need on average to narrow that universe to 1 with high confidence, with a theoretical minimum of log2(n) bits.
  
  I do think there’s an interesting distinction here between authors who may have many texts in the training data, who can be fully identified, and users (or authors) who don’t; in the latter case it’s typically impossible (without access to external resources) to eg determine the user’s name, but as the “Beyond Memorization” paper shows (thanks for linking that), models can still deduce quite a lot.
  
  It also seems worth understanding how the model represents info about the user, and that’s a key thing I’d like to investigate.
  - Jozdien 2 Feb 2024 0:58 UTC
    2 points
    0
    Parent
    I’m looking at what LLMs can infer about the current user (& how that’s represented in the model) as part of my research currently, and I think this is a very useful framing; given a universe of n possible users, how much information does the LLM need on average to narrow that universe to 1 with high confidence, with a theoretical minimum of log2(n) bits.
    This isn’t very related to what you’re talking about, but it is related and also by gwern, so have you read Death Note: L, Anonymity & Eluding Entropy? People leak bits all the time.
    - eggsyntax 2 Feb 2024 1:05 UTC
      3 points
      0
      Parent
      I have, but it was years ago; seems worth looking back at. Thanks!
- Chris_Leong 15 Feb 2024 2:50 UTC
  LW: 2 AF: 1
  0
  AF Parent
  IIRC, there was also evidence that Copilot was modulating code quality based on name ethnicity variations in code docs
  
  You don’t know where they heard that?
  - Cosin V 24 May 2024 20:18 UTC
    1 point
    0
    Parent
    I googled and couldn’t find any info
- Jozdien 1 Feb 2024 23:47 UTC
  LW: 2 AF: 1
  0
  AF Parent
  It’s unclear where the two intro quotes are from; I don’t recognize them despite being formatted as real quotes. If they are purely hypothetical, that should be clearer.
  They’re accounts from people who knows Eric and the person referenced in the second quote. They are real stories, but between not being allowed to publicly share GPT-4-base outputs and these being the most succinct stories I know of, I figured just quoting how I heard it would be best. I’ll add a footnote to make it clearer that these are real accounts.
  It’s pretty important because it tells you what LLMs do (imitation learning & meta-RL), which are quite dangerous things for them to do, and establishes a large information leak which can be used for things like steganography, coordination between instances, detecting testing vs deployment (for treacherous turns) etc.
  It’s also concerning because RLHF is specifically targeted at hiding (but not destroying) these inferences.
  I agree, the difference in perceived and true information density is one of my biggest concerns for near-term model deception. It changes questions like “can language models do steganography / when does it pop up” to “when are they able to make use of this channel that already exists”, which sure makes the dangers feel a lot more salient.
  Thanks for the linked paper, I hadn’t seen that before.