Thank you for doing the experiment. Someone could run a similar set of tests for Go.
Go has an advantage here of much greater granularity in handicapping. Handicapping with pieces isn’t used as much in chess as it is in Go because, well, there are so few pieces, on such a small board, for a game lasting so few moves, that each removed piece is both a large difference and changes the game qualitatively. I wouldn’t want to study chess at all at this point as a RL testbed: there’s better environments, which are cleaner to tweak, cheaper to run, more realistic/harder, have oracles, or something else; chess is best at nothing at this point (unless you are interested in chess or history of AI, of course).
Also, it’s worth noting that these piece-disadvantage games are generally way out of distribution / off-policy for an agent like Stockfish: AFAIK, the Stockfish project (and all other chess engine projects, for that matter) does not spend a (or any?) meaningful amount of training on extreme handicap scenarios like ‘what if I somehow started the game missing a knight’ or ‘what if my queen just wasn’t there somehow’ or ‘somehow, Palpatine’s piece returned’. (So there’s a similar problem here as with the claims that humans are still champs at correspondence chess: since the chess engines are not designed in any way or trained for correspondence time-controls, simply using a chess engine ‘out of the box’ designed for normal time controls provides only a lower bound on how good a correspondence chess engine would be.) Putting the human on the piece-advantage side means that the human is advantaged much more than just the piece, because they can play like normal. It would be more meaningful to put Stockfish on both sides (and much easier time-wise; and could yield as large a sample size as one wants; and let one calculate things like ‘how many additional move evaluations / thinking-time is necessary to match a piece-advantage’, which would be particularly relevant in this DL scaling context & should look like Jones 2020, which would help you model scenarios like ‘what if Stockfish played a Stockfish-minus-a-queen which used 100x the compute to train and used that same 100x compute at runtime as well?’).
This is why in the DM/Kramnik chess-variant investigations with AlphaZero, they have to train the AZ agent from scratch for each variant, because the models need to learn the new game and can’t just be the standard AZ agent off the shelf: and these variants don’t even remove any pieces—they’re just small tweaks like permitting self-capture or forbidding castling within the first 10 moves, but they still span a range of 4% difference in winrates for White (57% in Torpedo to 53% in Pawn-back).
My prediction (see also my discussion of temporal scaling laws & preliminary results in Hilton et al 2023) would be that Go would show less ‘intrinsic material advantage’ for worse players compared to chess, because it has longer games & larger boards, which allow greater scope of empowerment in space & time, and allow the better player to claw their way back from initial disadvantages, slightly superior move by slightly superior move, ruthlessly exploiting all errors, and compounding into certain victory just as time runs out. (In this respect, of course, Go is more like the real world than is chess...)
and these variants don’t even remove any pieces—they’re just small tweaks like permitting self-capture or forbidding castling within the first 10 moves
You’re framing these as being closer to “regular” chess, but my intuition is the opposite. Most of the game positions that occur during a queen-odds game are rare but possible positions in a regular game; they are contained within the game tree of normal chess. I’m not sure about Stockfish in particular, but I’d expect many chess AIs incorporating machine learning would have non-zero experience with such positions (e.g. from early self-play runs when they were making lots of bad moves).
Positions permitting self-capture do not appear anywhere in that game tree and typical chess AIs are guaranteed to have exactly zero experience of them.
ETA: It also might affect your intuitions to remember that many positions Stockfish would never actually play will still show up in its tree search, requiring it to evaluate them at least accurately enough to know not to play them.
I disagree. By starting with impossible positions like a queen already being missing*, the game is already far out of the superhuman-level chess-game distribution which is defined by Stockfish. Stockfish will never blunder in the early game so badly as to lose a queen in a normal early-game position, even if it was playing God. I expect these to be positions that the Stockfish policy will never reach, not even with its weakest play of zero tree search & following deterministic argmax move choice. The only time Stockfish would ever reach such positions is if forced to by some external force like a player fiddling with settings or a strange training setup, or, like, a cosmic ray flipping some bits on the CPU. There might be some such blunders very early on in training which takes it into such imbalanced very early positions, but those are still fairly different, and the final Stockfish is going to be millions (or at this point, billions) of games of training later and will have no idea of how to handle some positions that near-random play produced eons ago and long-since washed out. (After all, those will be the very stupidest and most incompetent games it ever played, so there is little value in holding onto them in any way. Most setups will erase old games pretty quickly, and certainly don’t hold onto games from the start.)
Whereas several of the changes Kramnik evaluated, like ‘Forbidding castling within the first 10 moves’ probably overlaps to quite a considerable degree; what fraction of chess games, human expert or Stockfish, involve no castling in the first 10 moves and so accidentally fulfill that rule? Probably a pretty good chunk!
* even odds like knight-odds -where you can at least in theory construct the position during a game, by moving the knight out, capturing it with the other knight, and carefully moving the other knight back into its original position—have exactly zero probability of ever occurring in an on-policy game.
Several? I can see one (the one you cite). Some of the other variants—e.g., no castling at all, or pawns can’t move two squares on their first move—can lead to positions that also arise in normal chess. But having neither side castle at all is really unusual and most such positions will be well out of distribution; and it’s very common for some pawns to remain on the second rank all the way to the endgame, where the option of moving one or two squares can have important timing implications.
What do you think about the other corollary? At the upper end of play the number of stones required for a worse agent to equal the best agent shrinks?
And we could plot out compute vs skill and estimate the number of stones for a particular skill level to have a 50 percent win rate against an agent with infinite compute. (Infinite compute just means it has perfect moves as it can factor in all permutations. This is an experiment we can run for solvable games like checkers but we can estimate the asymtote for Go)
What do you think about the other corollary? At the upper end of play the number of stones required for a worse agent to equal the best agent shrinks?
I mean, it would have to. There’s a ceiling there, while the game size and stones remain fixed units. If you have agents vastly below optimal play, they can differ a lot in absolute units, because they’re not the ceiling, and differ greatly in strategy as well. But the closer you approach God (as players like to put it), the smaller the equalizing material advantage must be, approaching zero. There is only one game tree.
Does this imply convergence for rampant AGI systems?
The “we are doomed” model assumes we will be defeated even when we have networks of superintelligent ASI systems restricted from hostile actions through essentially CAIS. This is where we subdivide large tasks into the smallest possible subtasks, define sparse schema to encode intermediate results, and have separate sessions of an ASI on each (subtask description, subtask context, intermediate schema from other results). Among other benefits this prevents most deception and collusion because the subtask context was a possible draw from the training set and the ASI has no memory or state, it can’t know it’s not still in training. (It’s stateless in that ASI_output = f(f(network architecture, weights), task description, task context, environment input set, RNG seed). It’s a functional system and on the next time step you can switch out the network architecture and weights if you wish for a different model with similar capabilities. ASI_output updates the context.
Anyways such a network of systems will perform well but what you are throwing away is bits of context in between the steps. For example if the task is “make housing” one subtask might design the overall shape and visual appearance, another might be the structural design and engineering plans, another might be an inspection to look for mistakes. Yet other subtasks would actually build the structure. Each subtask is a fresh, context ignorant session and closes when a step is done with all memory erased. For example if constructing the building is subdividable into floors or individual girder attachments, those are separate subtasks. The same or different model can be assigned to any given subtask, they need not share any lineage and it makes sense to have the “inspection” subtasks done by a different lineage of base model.
A single “context aware model” doing all steps benefits from having all of the bits of context for every step in theory (in practice it has to stop considering bits from it’s context window in order to meet task completion deadlines especially during the robotics steps but it chooses which bits to discard). So it performs better, but it’s gains are limited to the value of those marginal bits.
The way this relates to the chess problem is the benefit of the marginal bits is finite. In the real world being smarter has diminishing returns and there exists a resource disparity vs a smart opponent where no possible victory exists.
This means that when it matters, if we have a rampant ASI system with armed robots guarding data centers, the overall task of “defeat the enemy” would be achievable assuming the network of ASIs we use have more armed robots and other assets to work with.
We would not inevitably be defeated by the first unaligned ASI system to exist.
What do you think of this line of reasoning, gwern? You were correct about the scaling hypothesis, you are likely correct about many other things. Have you already written blog entries on this before?
Go has an advantage here of much greater granularity in handicapping. Handicapping with pieces isn’t used as much in chess as it is in Go because, well, there are so few pieces, on such a small board, for a game lasting so few moves, that each removed piece is both a large difference and changes the game qualitatively. I wouldn’t want to study chess at all at this point as a RL testbed: there’s better environments, which are cleaner to tweak, cheaper to run, more realistic/harder, have oracles, or something else; chess is best at nothing at this point (unless you are interested in chess or history of AI, of course).
Also, it’s worth noting that these piece-disadvantage games are generally way out of distribution / off-policy for an agent like Stockfish: AFAIK, the Stockfish project (and all other chess engine projects, for that matter) does not spend a (or any?) meaningful amount of training on extreme handicap scenarios like ‘what if I somehow started the game missing a knight’ or ‘what if my queen just wasn’t there somehow’ or ‘somehow, Palpatine’s piece returned’. (So there’s a similar problem here as with the claims that humans are still champs at correspondence chess: since the chess engines are not designed in any way or trained for correspondence time-controls, simply using a chess engine ‘out of the box’ designed for normal time controls provides only a lower bound on how good a correspondence chess engine would be.) Putting the human on the piece-advantage side means that the human is advantaged much more than just the piece, because they can play like normal. It would be more meaningful to put Stockfish on both sides (and much easier time-wise; and could yield as large a sample size as one wants; and let one calculate things like ‘how many additional move evaluations / thinking-time is necessary to match a piece-advantage’, which would be particularly relevant in this DL scaling context & should look like Jones 2020, which would help you model scenarios like ‘what if Stockfish played a Stockfish-minus-a-queen which used 100x the compute to train and used that same 100x compute at runtime as well?’).
This is why in the DM/Kramnik chess-variant investigations with AlphaZero, they have to train the AZ agent from scratch for each variant, because the models need to learn the new game and can’t just be the standard AZ agent off the shelf: and these variants don’t even remove any pieces—they’re just small tweaks like permitting self-capture or forbidding castling within the first 10 moves, but they still span a range of 4% difference in winrates for White (57% in Torpedo to 53% in Pawn-back).
My prediction (see also my discussion of temporal scaling laws & preliminary results in Hilton et al 2023) would be that Go would show less ‘intrinsic material advantage’ for worse players compared to chess, because it has longer games & larger boards, which allow greater scope of empowerment in space & time, and allow the better player to claw their way back from initial disadvantages, slightly superior move by slightly superior move, ruthlessly exploiting all errors, and compounding into certain victory just as time runs out. (In this respect, of course, Go is more like the real world than is chess...)
You’re framing these as being closer to “regular” chess, but my intuition is the opposite. Most of the game positions that occur during a queen-odds game are rare but possible positions in a regular game; they are contained within the game tree of normal chess. I’m not sure about Stockfish in particular, but I’d expect many chess AIs incorporating machine learning would have non-zero experience with such positions (e.g. from early self-play runs when they were making lots of bad moves).
Positions permitting self-capture do not appear anywhere in that game tree and typical chess AIs are guaranteed to have exactly zero experience of them.
ETA: It also might affect your intuitions to remember that many positions Stockfish would never actually play will still show up in its tree search, requiring it to evaluate them at least accurately enough to know not to play them.
I disagree. By starting with impossible positions like a queen already being missing*, the game is already far out of the superhuman-level chess-game distribution which is defined by Stockfish. Stockfish will never blunder in the early game so badly as to lose a queen in a normal early-game position, even if it was playing God. I expect these to be positions that the Stockfish policy will never reach, not even with its weakest play of zero tree search & following deterministic argmax move choice. The only time Stockfish would ever reach such positions is if forced to by some external force like a player fiddling with settings or a strange training setup, or, like, a cosmic ray flipping some bits on the CPU. There might be some such blunders very early on in training which takes it into such imbalanced very early positions, but those are still fairly different, and the final Stockfish is going to be millions (or at this point, billions) of games of training later and will have no idea of how to handle some positions that near-random play produced eons ago and long-since washed out. (After all, those will be the very stupidest and most incompetent games it ever played, so there is little value in holding onto them in any way. Most setups will erase old games pretty quickly, and certainly don’t hold onto games from the start.)
Whereas several of the changes Kramnik evaluated, like ‘Forbidding castling within the first 10 moves’ probably overlaps to quite a considerable degree; what fraction of chess games, human expert or Stockfish, involve no castling in the first 10 moves and so accidentally fulfill that rule? Probably a pretty good chunk!
* even odds like knight-odds -where you can at least in theory construct the position during a game, by moving the knight out, capturing it with the other knight, and carefully moving the other knight back into its original position—have exactly zero probability of ever occurring in an on-policy game.
Several? I can see one (the one you cite). Some of the other variants—e.g., no castling at all, or pawns can’t move two squares on their first move—can lead to positions that also arise in normal chess. But having neither side castle at all is really unusual and most such positions will be well out of distribution; and it’s very common for some pawns to remain on the second rank all the way to the endgame, where the option of moving one or two squares can have important timing implications.
What do you think about the other corollary? At the upper end of play the number of stones required for a worse agent to equal the best agent shrinks?
And we could plot out compute vs skill and estimate the number of stones for a particular skill level to have a 50 percent win rate against an agent with infinite compute. (Infinite compute just means it has perfect moves as it can factor in all permutations. This is an experiment we can run for solvable games like checkers but we can estimate the asymtote for Go)
I mean, it would have to. There’s a ceiling there, while the game size and stones remain fixed units. If you have agents vastly below optimal play, they can differ a lot in absolute units, because they’re not the ceiling, and differ greatly in strategy as well. But the closer you approach God (as players like to put it), the smaller the equalizing material advantage must be, approaching zero. There is only one game tree.
Does this imply convergence for rampant AGI systems?
The “we are doomed” model assumes we will be defeated even when we have networks of superintelligent ASI systems restricted from hostile actions through essentially CAIS. This is where we subdivide large tasks into the smallest possible subtasks, define sparse schema to encode intermediate results, and have separate sessions of an ASI on each (subtask description, subtask context, intermediate schema from other results). Among other benefits this prevents most deception and collusion because the subtask context was a possible draw from the training set and the ASI has no memory or state, it can’t know it’s not still in training. (It’s stateless in that ASI_output = f(f(network architecture, weights), task description, task context, environment input set, RNG seed). It’s a functional system and on the next time step you can switch out the network architecture and weights if you wish for a different model with similar capabilities. ASI_output updates the context.
Anyways such a network of systems will perform well but what you are throwing away is bits of context in between the steps. For example if the task is “make housing” one subtask might design the overall shape and visual appearance, another might be the structural design and engineering plans, another might be an inspection to look for mistakes. Yet other subtasks would actually build the structure. Each subtask is a fresh, context ignorant session and closes when a step is done with all memory erased. For example if constructing the building is subdividable into floors or individual girder attachments, those are separate subtasks. The same or different model can be assigned to any given subtask, they need not share any lineage and it makes sense to have the “inspection” subtasks done by a different lineage of base model.
A single “context aware model” doing all steps benefits from having all of the bits of context for every step in theory (in practice it has to stop considering bits from it’s context window in order to meet task completion deadlines especially during the robotics steps but it chooses which bits to discard). So it performs better, but it’s gains are limited to the value of those marginal bits.
The way this relates to the chess problem is the benefit of the marginal bits is finite. In the real world being smarter has diminishing returns and there exists a resource disparity vs a smart opponent where no possible victory exists.
This means that when it matters, if we have a rampant ASI system with armed robots guarding data centers, the overall task of “defeat the enemy” would be achievable assuming the network of ASIs we use have more armed robots and other assets to work with.
We would not inevitably be defeated by the first unaligned ASI system to exist.
What do you think of this line of reasoning, gwern? You were correct about the scaling hypothesis, you are likely correct about many other things. Have you already written blog entries on this before?