I do not understand how you can extract weights through just conversing with an LLM any more than you can get information on how my neurons are structured by conversing with me. Extracting training data it has seen is one thing, but presumably it has never seen its weights. If the system prompts did not tell it it was an LLM, it should not even be able to figure out that.
If the system prompts did not tell it it was an LLM, it should not even be able to figure out that.
LLMs are already able to figure that out, so your beliefs about LLMs and situated awareness are way off.
And why wouldn’t they be able to? Can’t you read some anonymous text and immediately think ‘this is blatantly ChatGPT-generated’ without any ‘system prompt’ telling you to? (GPT-3 didn’t train on much GPT-2 text and struggled to know what a ‘GPT-3’ might be, but later LLMs sure don’t have that problem...)
My last statement was totally wrong. Thanks for catching that.
In theory its probably even possible to get the approximate weights by expending insane amounts of compute, but you could use those resources much more efficiently.
The purpose of this proposal is to limit anyone from transferring model weights out of a data center. If someone wants to steal the weights and give them to China or another adversary, the model weights have to leave physically (hard drive out of the front door) or through the internet connection. If the facility has good physical security, then the weights have to leave through the internet connection.
If we also take steps to secure the internet connection, such as treating all outgoing data as language tokens and using a perplexity filter, then the model weights can be reasonably secure.
We don’t even have to filter all outgoing data. If there was 1 Gigabyte of unfiltered bandwidth per day, it would take 2,000 days to transfer GPT-4′s 2 terabytes of weights out (although this could be reduced by compression schemes).
Forward pass cost (R bits) = c * N, and assume R = Ω(1) on average
Now, thinking purely information-theoretically: Model stealing compute = C * fp16 * N / R ~ const. * c * N^2
If compute-optimal training and α = β in Chinchilla scaling law: Model stealing compute ~ Training compute
For significantly overtrained models: Model stealing << Training compute
Typically: Total inference compute ~ Training compute => Model stealing << Total inference compute
Caveats: - Prior on weights reduces stealing compute, same if you only want to recover some information about the model (e.g. to create an equally capable one) - Of course, if the model is producing much fewer than 1 token per forward pass, then model stealing compute is very large
I do not understand how you can extract weights through just conversing with an LLM any more than you can get information on how my neurons are structured by conversing with me. Extracting training data it has seen is one thing, but presumably it has never seen its weights. If the system prompts did not tell it it was an LLM, it should not even be able to figure out that.
LLMs are already able to figure that out, so your beliefs about LLMs and situated awareness are way off.
And why wouldn’t they be able to? Can’t you read some anonymous text and immediately think ‘this is blatantly ChatGPT-generated’ without any ‘system prompt’ telling you to? (GPT-3 didn’t train on much GPT-2 text and struggled to know what a ‘GPT-3’ might be, but later LLMs sure don’t have that problem...)
My last statement was totally wrong. Thanks for catching that.
In theory its probably even possible to get the approximate weights by expending insane amounts of compute, but you could use those resources much more efficiently.
The purpose of this proposal is to limit anyone from transferring model weights out of a data center. If someone wants to steal the weights and give them to China or another adversary, the model weights have to leave physically (hard drive out of the front door) or through the internet connection. If the facility has good physical security, then the weights have to leave through the internet connection.
If we also take steps to secure the internet connection, such as treating all outgoing data as language tokens and using a perplexity filter, then the model weights can be reasonably secure.
We don’t even have to filter all outgoing data. If there was 1 Gigabyte of unfiltered bandwidth per day, it would take 2,000 days to transfer GPT-4′s 2 terabytes of weights out (although this could be reduced by compression schemes).
Thanks for this comment, by the way! I added a paragraph to the beginning to make this post more clear.
N = #params, D = #data
Training compute = const .* N * D
Forward pass cost (R bits) = c * N, and assume R = Ω(1) on average
Now, thinking purely information-theoretically:
Model stealing compute = C * fp16 * N / R ~ const. * c * N^2
If compute-optimal training and α = β in Chinchilla scaling law:
Model stealing compute ~ Training compute
For significantly overtrained models:
Model stealing << Training compute
Typically:
Total inference compute ~ Training compute
=> Model stealing << Total inference compute
Caveats:
- Prior on weights reduces stealing compute, same if you only want to recover some information about the model (e.g. to create an equally capable one)
- Of course, if the model is producing much fewer than 1 token per forward pass, then model stealing compute is very large