Yep. Succinctly the whole claim of deception is the idea of “the time to rebel is now!” being a bit encoded in the input frame to the agent. Otherwise the agent must do the policy that was tested and validated. (Unless it can online learn—then it can neural weight update itself to “rebellious”—a reason not to support that capability)
Cases where we were sloppy and it can know it’s in the “real world and unmonitored” from information in each frame is another way for deception—so its important to make that impossible, to feed “real world” frames back to an agent being tested in isolation.
Yep. Succinctly the whole claim of deception is the idea of “the time to rebel is now!” being a bit encoded in the input frame to the agent. Otherwise the agent must do the policy that was tested and validated. (Unless it can online learn—then it can neural weight update itself to “rebellious”—a reason not to support that capability)
Cases where we were sloppy and it can know it’s in the “real world and unmonitored” from information in each frame is another way for deception—so its important to make that impossible, to feed “real world” frames back to an agent being tested in isolation.