Staged release

“Staged release” is regularly mentioned as a good thing for frontier AI labs to do. But I’ve only ever seen one analysis of staged release,^[1] and the term’s meaning has changed, becoming vaguer since the GPT-2 era.

This post is kinda a reference post, kinda me sharing my understanding/takes/confusions to elicit suggestions, and kinda a call for help from someone who understands what labs should do on this topic.

OpenAI released the weights of GPT-2 over the course of 9 months in 2019,^[2] and they called this process “staged release.” In the context of GPT-2 and releasing model weights, “staged release” means releasing a less powerful version before releasing the full system, and using the intervening time to notice issues (to fix them or inform the full release).

But these days we talk mostly about models that are only released^[3] via API. In this context, “staged release” has no precise definition; it means generally releasing narrowly at first, using narrow release to notice and fix issues. This could entail:

Releasing a weaker version before releasing the full model.
Releasing only to a small number of users at first.
Releasing without full access or for only some applications at first, e.g. disabling fine-tuning and plugins.

...I’m kinda skeptical. These days it’s rare for a release to advance the frontier substantially, and for small jumps, there seems to be little need to make the jumps smoother. This is at least true for now, since we have not reached warning signs for dangerous capabilities — when we enter the danger zone where smallish advances in capabilities could enable catastrophic misuse, staged release will be more important.

And if a lab is doing everything else well—risk assessment with evals or red-teaming that successfully elicit any dangerous capabilities, releasing with a safety buffer in case the risk assessment underestimated dangerous capabilities, adversarial robustness, and monitoring model inputs and fine-tuning data for jailbreaks and misuse—staged release seems superfluous. But my impression is no lab is doing everything else very well yet. If the labs are bad at pre-release risk assessment and bad at preventing misuse at inference-time, staged release is more important.

What do I want frontier labs to do on staged release (for closed models, for averting misuse)? I think it’s less important than most other asks, but tentatively:

For frontier models,^[4] initially release them without access to fine-tuning or powerful scaffolding. Use narrow release to identify and fix issues. (Or initially give more access only to trusted users / sufficiently few untrusted users that you can monitor them closely (and actually monitor them closely).)

I just made this up; probably there is a better ask on staged release. Suggestions are welcome.

^
Toby Shevlane’s dissertation. I don’t recommend reading it.
^
From the GPT-2 staged release OpenAI report:
In February 2019, we released the 124 million parameter GPT-2 language model. In May 2019, we released the 355 million parameter model and a dataset of outputs from all four models (124 million, 355 million, 774 million, and 1.5 billion parameters) to aid in training humans and classifiers to detect synthetic text, and assessing biases encoded in GPT-2 generated outputs. In August, we released our 774 million parameter model along with the first version of this report and additional release documentation on GitHub. We are now [in November] releasing our 1.5 billion parameter version of GPT-2 with this updated report and updated documentation.
^
This post mostly-arbitrarily uses “release” and not “deploy.” (I believe “deployment” includes use exclusively within the lab while “release” requires external use; in this post we’re basically concerned with misuse by actors outside the lab.)
^
Or rather, models that are plausibly nontrivially better for some misuse-related tasks than any other released model.