My thinking is not very clear on this point, but I am generally pessimistic that any type of RL/optimization regime with an adversarial nature could be robust to self-aware agents. To me, it seems like adversarial methodologies could spawn opposing mesaoptimizers, and we would be at the mercy of whichever subsystem represented its optimization process well enough to squash the other.
My thinking is not very clear on this point, but I am generally pessimistic that any type of RL/optimization regime with an adversarial nature could be robust to self-aware agents. To me, it seems like adversarial methodologies could spawn opposing mesaoptimizers, and we would be at the mercy of whichever subsystem represented its optimization process well enough to squash the other.