ryan_greenblatt comments on Daniel Kokotajlo’s Shortform

ryan_greenblatt 24 Oct 2024 23:48 UTC
LW: 9 AF: 8
4
AF
We discuss similar issues in meta-level adversarial evaluations of oversight.

The most simple approach is just to have a human in the loop monitoring every action the AI does in combination with targeting a proxy failure rather than a real failure. This probably can reduce risk to a point where this is a small fraction of overall risk.

Human monitoring might be too expensive, in which case we’ll have to go with sandboxing and pure proxy.