Would changing how the reward function pays off work? Instead of rewarding based on humans, pay out all rewards when the vault is checked (at a time unknown to the AI). The AI isn’t asked if the diamond is present or absent. Instead, it is asked “If the vault were checked now, do you want to be rewarded if the diamond is present or absent.
I think this might still lead to similar problems. For example this could cause an issue in the case where the diamond has been stolen but the AI believes humans would not be able to tell even if they physically entered the vault and checked, e.g the diamond has been replaced with a very convincing fake. In this case the AI might still say “I want to be rewarded if the diamond is still present” since it knows humans won’t be able to tell the difference.
Would changing how the reward function pays off work? Instead of rewarding based on humans, pay out all rewards when the vault is checked (at a time unknown to the AI). The AI isn’t asked if the diamond is present or absent. Instead, it is asked “If the vault were checked now, do you want to be rewarded if the diamond is present or absent.
I think this might still lead to similar problems. For example this could cause an issue in the case where the diamond has been stolen but the AI believes humans would not be able to tell even if they physically entered the vault and checked, e.g the diamond has been replaced with a very convincing fake. In this case the AI might still say “I want to be rewarded if the diamond is still present” since it knows humans won’t be able to tell the difference.