Yes, this is another reason that setups like OP are lower-bounds. Stockfish, like most game RL AIs, is trying to play the Nash equilibrium move, not the maximally-exploitative move against the current player; it will punish the player for any deviations from Nash, but it will not itself risk deviating from Nash in the hopes of tempting the player into an even larger error, because it assumes that it is playing against something as good or better than itself, and such a deviation will merely be replied to with a Nash move & be very bad.
You could frame it as an imitation-learning problem like Maia. But also train directly: Stockfish could be trained with a mixture of opponents and at scale, should learn to observe the board state (I don’t know if it needs the history per se, since just the stage of game + current margin of victory ought to encode the Elo difference and may be a sufficient statistic for Elo), infer enemy playing strength, and calibrate play appropriately when doing tree search & predicting enemy response. Silver & Veness 2010 comes to mind as an example of how you’d do MCTS with this sort of hidden-information (the enemy’s unknown Elo strength) which turns it into a POMDP rather than a MDP.
For a clear example of this, in endgames where I have a winning position but have little to no idea how to win, Stockfish’s king will often head for the hills, in order to delay the coming mate as long as theoretically possible.
Making my win very easy because the computer’s king isn’t around to help out in defence.
This is not a theoretical difficulty! It makes it very difficult to practise endgames against the computer.
Yes, this is another reason that setups like OP are lower-bounds. Stockfish, like most game RL AIs, is trying to play the Nash equilibrium move, not the maximally-exploitative move against the current player; it will punish the player for any deviations from Nash, but it will not itself risk deviating from Nash in the hopes of tempting the player into an even larger error, because it assumes that it is playing against something as good or better than itself, and such a deviation will merely be replied to with a Nash move & be very bad.
You could frame it as an imitation-learning problem like Maia. But also train directly: Stockfish could be trained with a mixture of opponents and at scale, should learn to observe the board state (I don’t know if it needs the history per se, since just the stage of game + current margin of victory ought to encode the Elo difference and may be a sufficient statistic for Elo), infer enemy playing strength, and calibrate play appropriately when doing tree search & predicting enemy response. Silver & Veness 2010 comes to mind as an example of how you’d do MCTS with this sort of hidden-information (the enemy’s unknown Elo strength) which turns it into a POMDP rather than a MDP.
For a clear example of this, in endgames where I have a winning position but have little to no idea how to win, Stockfish’s king will often head for the hills, in order to delay the coming mate as long as theoretically possible.
Making my win very easy because the computer’s king isn’t around to help out in defence.
This is not a theoretical difficulty! It makes it very difficult to practise endgames against the computer.