I think this might still lead to similar problems. For example this could cause an issue in the case where the diamond has been stolen but the AI believes humans would not be able to tell even if they physically entered the vault and checked, e.g the diamond has been replaced with a very convincing fake. In this case the AI might still say “I want to be rewarded if the diamond is still present” since it knows humans won’t be able to tell the difference.
I think this might still lead to similar problems. For example this could cause an issue in the case where the diamond has been stolen but the AI believes humans would not be able to tell even if they physically entered the vault and checked, e.g the diamond has been replaced with a very convincing fake. In this case the AI might still say “I want to be rewarded if the diamond is still present” since it knows humans won’t be able to tell the difference.