Roughly speaking, the hope is to make is so the AI can’t do something very problematic without having a very high chance of being caught (or stopped). This doesn’t guarantee catching it (as the AI could just never end up trying to do something very problematic), but at least can (if it works) avoid problems with this AIs.
We do explore this type of setting in prior work, e.g. here.
Roughly speaking, the hope is to make is so the AI can’t do something very problematic without having a very high chance of being caught (or stopped). This doesn’t guarantee catching it (as the AI could just never end up trying to do something very problematic), but at least can (if it works) avoid problems with this AIs.
We do explore this type of setting in prior work, e.g. here.