That would make the domain of checkable tasks rather small.
That said, it may not matter depending on the capability you want to measure.
If you want to make the AI hack a computer to turn the entire screen green and it skips a pixel so as to avoid completing the task, well it would have still demonstrated that it possesses the dangerous capability, so it has no reason to sandbag.
On the other hand, if you are trying to see if it has a capability that you wish it use, it can still sandbag.
That would make the domain of checkable tasks rather small.
That said, it may not matter depending on the capability you want to measure.
If you want to make the AI hack a computer to turn the entire screen green and it skips a pixel so as to avoid completing the task, well it would have still demonstrated that it possesses the dangerous capability, so it has no reason to sandbag.
On the other hand, if you are trying to see if it has a capability that you wish it use, it can still sandbag.