sej2020 comments on Greedy-Advantage-Aware RLHF

sej2020 30 Dec 2024 19:11 UTC
6 points
0
My thinking is not very clear on this point, but I am generally pessimistic that any type of RL/optimization regime with an adversarial nature could be robust to self-aware agents. To me, it seems like adversarial methodologies could spawn opposing mesaoptimizers, and we would be at the mercy of whichever subsystem represented its optimization process well enough to squash the other.