I think (3) is not the same as my definition of deception. There are two independent concepts from the Xu post: “deceptive misaligned mesaoptimizers” and “nondeceptive misaligned mesaoptimizers”.
(3) seems to be describing ordinary misaligned mesaoptimizers (whose proxies no longer generalize on the test distribution).
I think an agent that you train to keep your diamond safe that learns you’re judging it from cameras may indeed take actions to fool the cameras, but I don’t think it will secretly optimize some other objective while it’s doing that. I agree my argument doesn’t apply to this example.
I think (3) is not the same as my definition of deception. There are two independent concepts from the Xu post: “deceptive misaligned mesaoptimizers” and “nondeceptive misaligned mesaoptimizers”.
(3) seems to be describing ordinary misaligned mesaoptimizers (whose proxies no longer generalize on the test distribution).
I think an agent that you train to keep your diamond safe that learns you’re judging it from cameras may indeed take actions to fool the cameras, but I don’t think it will secretly optimize some other objective while it’s doing that. I agree my argument doesn’t apply to this example.