The most simple approach is just to have a human in the loop monitoring every action the AI does in combination with targeting a proxy failure rather than a real failure. This probably can reduce risk to a point where this is a small fraction of overall risk.
Human monitoring might be too expensive, in which case we’ll have to go with sandboxing and pure proxy.
We discuss similar issues in meta-level adversarial evaluations of oversight.
The most simple approach is just to have a human in the loop monitoring every action the AI does in combination with targeting a proxy failure rather than a real failure. This probably can reduce risk to a point where this is a small fraction of overall risk.
Human monitoring might be too expensive, in which case we’ll have to go with sandboxing and pure proxy.