AI labs can boost external safety research

Zach Stein-Perlman31 Jul 2024 19:30 UTC

31 points

1 comment1 min readLW link

Frontier AI labs can boost external safety researchers by

Sharing better access to powerful models (early access, fine-tuning, helpful-only,^[1] filters/moderation-off, logprobs, activations)^[2]
Releasing research artifacts besides models
Publishing (transparent, reproducible) safety research
Giving API credits
Mentoring

Here’s what the labs have done (besides just publishing safety research^[3]).

Anthropic:

Releasing resources including RLHF and red-teaming datasets, an interpretability notebook, and model organisms prompts and transcripts
Supporting creation of safety-relevant evals and tools for evals
Giving free API access to some OP grantees and giving some researchers $1K (or sometimes more) in API credits
(Giving deep model access to Ryan Greenblatt)
(External mentoring, in particular via MATS)
[No fine-tuning or deep access, except for Ryan]

Google DeepMind:

Publishing their model evals for dangerous capabilities and sharing resources for reproducing some of them
Releasing Gemma SAEs
Releasing Gemma weights
(External mentoring, in particular via MATS)
[No fine-tuning or deep access to frontier models]

OpenAI:

OpenAI Evals
Superalignment Fast Grants
Maybe giving better API access to some OP grantees
Fine-tuning GPT-3.5 (and “GPT-4 fine-tuning is in experimental access”; OpenAI shared GPT-4 fine-tuning access with academic researchers including Jacob Steinhardt and Daniel Kang in 2023)
- Update: GPT-4o fine-tuning
Early access: shared GPT-4 with a few safety researchers including Rachel Freedman before release
API gives top 5 logprobs

Meta AI:

Releasing Llama weights

Microsoft:

[Nothing]

xAI:

[Nothing]

Related papers:

Structured access for third-party research on frontier AI models (Bucknall and Trager 2023)
Black-Box Access is Insufficient for Rigorous AI Audits (Casper et al. 2024)
- (The paper is about audits, like for risk assessment and oversight; this post is about research)
A Safe Harbor for AI Evaluation and Red Teaming (Longpre et al. 2024)
Structured Access (Shevlane 2022)

^
“Helpful-only” refers to the version of the model RLHFed/RLAIFed/finetuned/whatever for helpfulness but not harmlessness.
^
Releasing model weights will likely be dangerous once models are more powerful, but all past releases seem fine, but e.g. Meta’s poor risk assessment and lack of a plan to make release decisions conditional on risk assessment is concerning.
^
And an unspecified amount of funding Frontier Model Forum grants.

What links here?

Zach Stein-Perlman31 Jul 2024 19:30 UTC

31 points

1 comment1 min readLW link