Fine-tuning GPT-3.5 (and “GPT-4 fine-tuning is in experimental access”; OpenAI shared GPT-4 fine-tuning access with academic researchers including Jacob Steinhardt and Daniel Kang in 2023)
Releasing model weights will likely be dangerous once models are more powerful, but all past releases seem fine, but e.g. Meta’s poor risk assessment and lack of a plan to make release decisions conditional on risk assessment is concerning.
AI labs can boost external safety research
Frontier AI labs can boost external safety researchers by
Sharing better access to powerful models (early access, fine-tuning, helpful-only,[1] filters/moderation-off, logprobs, activations)[2]
Releasing research artifacts besides models
Publishing (transparent, reproducible) safety research
Giving API credits
Mentoring
Here’s what the labs have done (besides just publishing safety research[3]).
Anthropic:
Releasing resources including RLHF and red-teaming datasets, an interpretability notebook, and model organisms prompts and transcripts
Supporting creation of safety-relevant evals and tools for evals
Giving free API access to some OP grantees and giving some researchers $1K (or sometimes more) in API credits
(Giving deep model access to Ryan Greenblatt)
(External mentoring, in particular via MATS)
[No fine-tuning or deep access, except for Ryan]
Google DeepMind:
Publishing their model evals for dangerous capabilities and sharing resources for reproducing some of them
Releasing Gemma SAEs
Releasing Gemma weights
(External mentoring, in particular via MATS)
[No fine-tuning or deep access to frontier models]
OpenAI:
OpenAI Evals
Superalignment Fast Grants
Maybe giving better API access to some OP grantees
Fine-tuning GPT-3.5 (and “GPT-4 fine-tuning is in experimental access”; OpenAI shared GPT-4 fine-tuning access with academic researchers including Jacob Steinhardt and Daniel Kang in 2023)
Update: GPT-4o fine-tuning
Early access: shared GPT-4 with a few safety researchers including Rachel Freedman before release
API gives top 5 logprobs
Meta AI:
Releasing Llama weights
Microsoft:
[Nothing]
xAI:
[Nothing]
Related papers:
Structured access for third-party research on frontier AI models (Bucknall and Trager 2023)
Black-Box Access is Insufficient for Rigorous AI Audits (Casper et al. 2024)
(The paper is about audits, like for risk assessment and oversight; this post is about research)
A Safe Harbor for AI Evaluation and Red Teaming (Longpre et al. 2024)
Structured Access (Shevlane 2022)
“Helpful-only” refers to the version of the model RLHFed/RLAIFed/finetuned/whatever for helpfulness but not harmlessness.
Releasing model weights will likely be dangerous once models are more powerful, but all past releases seem fine, but e.g. Meta’s poor risk assessment and lack of a plan to make release decisions conditional on risk assessment is concerning.
And an unspecified amount of funding Frontier Model Forum grants.