Weights live only in secure datacenter, which can’t run general-purpose code, and all it does is serve the model
Only one physical line out of the datacenter, with physically limited bandwidth
To enable use of model with low bandwidth, compress the text before it leaves the datacenter with a smaller model, and decompress after it’s past the bottleneck
Datacenter is physically guarded by people with intense background screening, large area around it cleared, (maybe everything is in a faraday cage?), 3-person rule for everything, nothing leaves the datacenter (e.g. if some part is replaced, the old part + any tools used are not removed, but are kept in secure storage or incinerated)
Monitoring / control:
Use dumber and probably safe models for monitoring:
Every (prompt, generation) pair going to and from the datacenter is overseen by (finetuned) GPT-4. GPT-4 is instructed to flag anything that seems dangerous, or that it can’t understand, or looks like a jailbreak/advex.
To avoid advexes against GPT-4, subsample and perturb the prompts/generations in a few different ways, and have GPT-4 also screen these
Have large workforce of fairly smart humans (e.g. 1000 workers with technical undergrad degrees, paid $50/hr, 50 experts in a range of relevant domains paid $200/hr) reviewing anything flagged by GPT-4, and some random sample of things that weren’t flagged
Maybe: multiple independent monitoring programs that intervene at different points, are managed by different people, use slightly different versions of GPT-4, etc.
Some pragmatic things:
Minimize risk of theft / exfiltration:
Weights live only in secure datacenter, which can’t run general-purpose code, and all it does is serve the model
Only one physical line out of the datacenter, with physically limited bandwidth
To enable use of model with low bandwidth, compress the text before it leaves the datacenter with a smaller model, and decompress after it’s past the bottleneck
Datacenter is physically guarded by people with intense background screening, large area around it cleared, (maybe everything is in a faraday cage?), 3-person rule for everything, nothing leaves the datacenter (e.g. if some part is replaced, the old part + any tools used are not removed, but are kept in secure storage or incinerated)
Monitoring / control:
Use dumber and probably safe models for monitoring:
Every (prompt, generation) pair going to and from the datacenter is overseen by (finetuned) GPT-4. GPT-4 is instructed to flag anything that seems dangerous, or that it can’t understand, or looks like a jailbreak/advex.
To avoid advexes against GPT-4, subsample and perturb the prompts/generations in a few different ways, and have GPT-4 also screen these
Have large workforce of fairly smart humans (e.g. 1000 workers with technical undergrad degrees, paid $50/hr, 50 experts in a range of relevant domains paid $200/hr) reviewing anything flagged by GPT-4, and some random sample of things that weren’t flagged
Maybe: multiple independent monitoring programs that intervene at different points, are managed by different people, use slightly different versions of GPT-4, etc.