If the predictor AI is in fact imitating what humans would do, why wouldn’t it throw its hands up at an actuator sequence that is too complicated for humans—isn’t that what humans would do? (I’m referring to the protect-the-diamond framing here.)
As described in the report it would say “I’m not sure” when the human wasn’t sure (unless you penalized that).
That said, often a human who looks at a sequence of actions would say “almost certainly the diamond is there.” They might change their answer if you also told them “by the way these actions came from a powerful adversary trying to get you to think the diamond is there.” What exactly the reporter says will depend on some details of e.g. how the reporter reasons about provenance.
But the main point is that in no case do you get useful information about examples that a human (with AI assistants) couldn’t figure out what was happening on their own.
If the predictor AI is in fact imitating what humans would do, why wouldn’t it throw its hands up at an actuator sequence that is too complicated for humans—isn’t that what humans would do? (I’m referring to the protect-the-diamond framing here.)
As described in the report it would say “I’m not sure” when the human wasn’t sure (unless you penalized that).
That said, often a human who looks at a sequence of actions would say “almost certainly the diamond is there.” They might change their answer if you also told them “by the way these actions came from a powerful adversary trying to get you to think the diamond is there.” What exactly the reporter says will depend on some details of e.g. how the reporter reasons about provenance.
But the main point is that in no case do you get useful information about examples that a human (with AI assistants) couldn’t figure out what was happening on their own.