AI Control

This is a collection of posts about AI Control, an approach to AI safety that focuses on safety measures aimed at preventing powerful AIs from causing unacceptably bad outcomes even if powerful AIs are misaligned and intentionally try to subvert those safety measures.

These posts are useful to understand the AI Control approach, its upsides, and downsides. They only cover a small fraction of AI safety work relevant to AI control.

The case for en­sur­ing that pow­er­ful AIs are controlled

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

Un­trusted smart mod­els and trusted dumb models

Catch­ing AIs red-handed

Meta-level ad­ver­sar­ial eval­u­a­tion of over­sight tech­niques might al­low ro­bust mea­sure­ment of their adequacy

Au­dit­ing failures vs con­cen­trated failures

Pro­to­col eval­u­a­tions: good analo­gies vs control

How use­ful is “AI Con­trol” as a fram­ing on AI X-Risk?

New re­port: Safety Cases for AI

Notes on con­trol eval­u­a­tions for safety cases

Toy mod­els of AI con­trol for con­cen­trated catas­tro­phe prevention