I like that this post clearly argues for some reasons why we might expect deception (and similar dynamics) to not just be possible in the sense of getting equal training rewards, but to actually provide higher rewards than the honest alternatives. This positively updates my probability of those scenarios.
I like that this post clearly argues for some reasons why we might expect deception (and similar dynamics) to not just be possible in the sense of getting equal training rewards, but to actually provide higher rewards than the honest alternatives. This positively updates my probability of those scenarios.