Roughly speaking, the hope is to make is so the AI can’t do something very problematic without having a very high chance of being caught (or stopped). This doesn’t guarantee catching it (as the AI could just never end up trying to do something very problematic), but at least can (if it works) avoid problems with this AIs.
We do explore this type of setting in prior work, e.g. here.
This concern is why we are interested in AI control.
Have there been any proposals for detecting alignment faking LLMs in AI control literature?
Roughly speaking, the hope is to make is so the AI can’t do something very problematic without having a very high chance of being caught (or stopped). This doesn’t guarantee catching it (as the AI could just never end up trying to do something very problematic), but at least can (if it works) avoid problems with this AIs.
We do explore this type of setting in prior work, e.g. here.