I think the collusion concern basically over-anthropomorphizes the training process. Say, in prisoner’s dilemma, if you train myopically, then “all incentives point toward defection” translates concretely to actual defection.
Granted, there are training regimes in which this doesn’t happen, but those would have to be avoided.
OTOH, the concern might be that an inner optimizer would develop which colludes. This would have to be dealt with by more general anti-inner-optimizer technology.
I don’t know if you’ve seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior—seems somewhat relevant to what you’re thinking about here.
I think the collusion concern basically over-anthropomorphizes the training process. Say, in prisoner’s dilemma, if you train myopically, then “all incentives point toward defection” translates concretely to actual defection.
Granted, there are training regimes in which this doesn’t happen, but those would have to be avoided.
OTOH, the concern might be that an inner optimizer would develop which colludes. This would have to be dealt with by more general anti-inner-optimizer technology.
Yep, I should take a look!