AI labs can boost external safety research

Frontier AI labs can boost external safety researchers by

  • Sharing better access to powerful models (early access, fine-tuning, helpful-only,[1] filters/​moderation-off, logprobs, activations)[2]

  • Releasing research artifacts besides models

  • Publishing (transparent, reproducible) safety research

  • Giving API credits

  • Mentoring


Here’s what the labs have done (besides just publishing safety research[3]).

Anthropic:

Google DeepMind:

  • Publishing their model evals for dangerous capabilities and sharing resources for reproducing some of them

  • Releasing Gemma SAEs

  • Releasing Gemma weights

  • (External mentoring, in particular via MATS)

  • [No fine-tuning or deep access to frontier models]

OpenAI:

Meta AI:

Microsoft:

  • [Nothing]

xAI:

  • [Nothing]


Related papers:

  1. ^

    “Helpful-only” refers to the version of the model RLHFed/​RLAIFed/​finetuned/​whatever for helpfulness but not harmlessness.

  2. ^

    Releasing model weights will likely be dangerous once models are more powerful, but all past releases seem fine, but e.g. Meta’s poor risk assessment and lack of a plan to make release decisions conditional on risk assessment is concerning.

  3. ^

    And an unspecified amount of funding Frontier Model Forum grants.

No comments.