Some further thoughts on training ML models, based on discussions with Caspar Oesterheld:
I don’t see a principled reason why one couldn’t use one and the same model for both agents. I.e., do standard self-play training with weight sharing for this zero-sum game. Since both players have exactly the same loss function, we don’t need to allow them to specialize by feeding in a player id or something like that (there exists a symmetric Nash equilibrium).
There is one problem with optimizing the objective in the zero-sum game via gradient descent (assuming we could approximate this gradient, e.g., via policy gradient). The issue is that the response of the human to the prediction is discontinuous and not differentiable. I.e., local changes to the prediction will never change the action of the human and thus the gradient would just improve the prediction given the current action, rather than encouraging making predictions that make other actions look more favorable. This shows that, without any modification to the human policy, gradient descent on the objective would be equivalent to repeated gradient descent/gradient descent on the stop gradient objective. To make sure this converges, one would have to implement some exploration of all of the actions. (Of course, one may hope that the model generalizes correctly to new predictions.)
One could get around this issue by employing other, non-local optimization methods (e.g., a random search—which would effectively introduce some exploration). Here, one would still retain the desirable honesty properties of the optimum in the zero-sum game, which would not be the case when just optimizing the score.
Another way to view the zero-sum game, in the case where both players are the same model, is as below optimization problem (where q is assumed to be the ground truth). Note that we are here just subtracting the score received by the same model, but we are fixing that score when optimizing p to avoid making the objective 0.
p∗:=argmaxpS(pa,qa)−S(p∗a,qa) s.t. a is the best action given reports p,p∗.
Some further thoughts on training ML models, based on discussions with Caspar Oesterheld:
p∗:=argmaxpS(pa,qa)−S(p∗a,qa) s.t. a is the best action given reports p,p∗.I don’t see a principled reason why one couldn’t use one and the same model for both agents. I.e., do standard self-play training with weight sharing for this zero-sum game. Since both players have exactly the same loss function, we don’t need to allow them to specialize by feeding in a player id or something like that (there exists a symmetric Nash equilibrium).
There is one problem with optimizing the objective in the zero-sum game via gradient descent (assuming we could approximate this gradient, e.g., via policy gradient). The issue is that the response of the human to the prediction is discontinuous and not differentiable. I.e., local changes to the prediction will never change the action of the human and thus the gradient would just improve the prediction given the current action, rather than encouraging making predictions that make other actions look more favorable. This shows that, without any modification to the human policy, gradient descent on the objective would be equivalent to repeated gradient descent/gradient descent on the stop gradient objective. To make sure this converges, one would have to implement some exploration of all of the actions. (Of course, one may hope that the model generalizes correctly to new predictions.)
One could get around this issue by employing other, non-local optimization methods (e.g., a random search—which would effectively introduce some exploration). Here, one would still retain the desirable honesty properties of the optimum in the zero-sum game, which would not be the case when just optimizing the score.
Another way to view the zero-sum game, in the case where both players are the same model, is as below optimization problem (where q is assumed to be the ground truth). Note that we are here just subtracting the score received by the same model, but we are fixing that score when optimizing p to avoid making the objective 0.