KhromeM comments on Using an LLM perplexity filter to detect weight exfiltration

KhromeM 22 Jul 2024 0:35 UTC
2 points
−5
I do not understand how you can extract weights through just conversing with an LLM any more than you can get information on how my neurons are structured by conversing with me. Extracting training data it has seen is one thing, but presumably it has never seen its weights. If the system prompts did not tell it it was an LLM, it should not even be able to figure out that.
- gwern 22 Jul 2024 20:24 UTC
  9 points
  5
  Parent
  
  If the system prompts did not tell it it was an LLM, it should not even be able to figure out that.
  
  LLMs are already able to figure that out, so your beliefs about LLMs and situated awareness are way off.
  
  And why wouldn’t they be able to? Can’t you read some anonymous text and immediately think ‘this is blatantly ChatGPT-generated’ without any ‘system prompt’ telling you to? (GPT-3 didn’t train on much GPT-2 text and struggled to know what a ‘GPT-3’ might be, but later LLMs sure don’t have that problem...)
  - KhromeM 22 Jul 2024 22:33 UTC
    1 point
    0
    Parent
    My last statement was totally wrong. Thanks for catching that.
    
    In theory its probably even possible to get the approximate weights by expending insane amounts of compute, but you could use those resources much more efficiently.
- Adam Karvonen 22 Jul 2024 2:07 UTC
  3 points
  0
  Parent
  The purpose of this proposal is to limit anyone from transferring model weights out of a data center. If someone wants to steal the weights and give them to China or another adversary, the model weights have to leave physically (hard drive out of the front door) or through the internet connection. If the facility has good physical security, then the weights have to leave through the internet connection.
  
  If we also take steps to secure the internet connection, such as treating all outgoing data as language tokens and using a perplexity filter, then the model weights can be reasonably secure.
  
  We don’t even have to filter all outgoing data. If there was 1 Gigabyte of unfiltered bandwidth per day, it would take 2,000 days to transfer GPT-4′s 2 terabytes of weights out (although this could be reduced by compression schemes).
  - Adam Karvonen 22 Jul 2024 2:52 UTC
    2 points
    0
    Parent
    Thanks for this comment, by the way! I added a paragraph to the beginning to make this post more clear.
- Daniel Paleka 18 Aug 2024 5:47 UTC
  1 point
  0
  Parent
  N = #params, D = #data
  Training compute = const .* N * D
  Forward pass cost (R bits) = c * N, and assume R = Ω(1) on average
  Now, thinking purely information-theoretically:
  Model stealing compute = C * fp16 * N / R ~ const. * c * N^2
  If compute-optimal training and α = β in Chinchilla scaling law:
  Model stealing compute ~ Training compute
  For significantly overtrained models:
  Model stealing << Training compute
  Typically:
  Total inference compute ~ Training compute
  => Model stealing << Total inference compute
  Caveats:
  - Prior on weights reduces stealing compute, same if you only want to recover some information about the model (e.g. to create an equally capable one)
  - Of course, if the model is producing much fewer than 1 token per forward pass, then model stealing compute is very large