It seems like a mistake to analogize a forward pass of the transformer to a human using external tools, if you want to make meaningful comparisons.
That may be, but it also seems to me like a mistake to use as your example a human who is untrained (or at least has had very little training), instead of a human whose training run has basically saturated the performance of their native architecture. Those people do in fact, play blindfold chess, and are capable of tracking the board state perfectly without any external visual aid, while playing with a time control of ~1 minute per player per game (which, if we assume an average game length of 80 moves, comes out to ~1.5 seconds per move).
Of course, that comparison again becomes unfair in the other direction, since ChatGPT hasn’t been trained nearly as exhaustively on chess notation, whereas the people I’m talking about have dedicated their entire careers to the game. But I’d be willing to bet that even a heavily fine-tuned version of GPT-3 wouldn’t be able to play out a chess game of non-trivial length, while maintaining legality throughout the entire game, without needing to be re-prompted. (And that isn’t even getting into move quality, which I’d fully expect to be terrible no matter what.)
(No confident predictions about GPT-4 as of yet. My old models would have predicted a similar lack of “board vision” from GPT-4 as compared with GPT-3, but I trust those old models less, since Bing/Sydney has managed to surprise me in a number of ways.)
ETA: To be clear, this isn’t a criticism of language models. This whole task is trying to get them to do something that they’re practically architecturally designed to be bad at, so in some sense the mere fact that we’re even talking about this says very impressive things about their capabilities. And obviously, CNNs do the whole chess thing really, really well—easily on par with skilled humans, even without the massive boost offered by search. But CNNs aren’t general, and the question here is one of generality, you know?
I said that playing blindfolded chess at 1s/move is “extraordinarily hard;” I agree that might be an overstatement and “extremely hard” might be more accurate. I also agree that humans don’t need “external” tools; I feel like the whole comparison will come down to arbitrary calls like whether a human explicitly visualizing something or repeating a sound to themself is akin to an LM modifying its prompt, or whether our verbal loop is “internal” whereas an LM prompt is “external” and therefore shows that the AI is missing the special sauce.
Incidentally, I would guess that 100B model trained on 100B chess games will learn to only make valid moves with similar accuracy to a trained human. But this wouldn’t affect my views about AI timelines.
That may be, but it also seems to me like a mistake to use as your example a human who is untrained (or at least has had very little training), instead of a human whose training run has basically saturated the performance of their native architecture. Those people do in fact, play blindfold chess, and are capable of tracking the board state perfectly without any external visual aid, while playing with a time control of ~1 minute per player per game (which, if we assume an average game length of 80 moves, comes out to ~1.5 seconds per move).
Of course, that comparison again becomes unfair in the other direction, since ChatGPT hasn’t been trained nearly as exhaustively on chess notation, whereas the people I’m talking about have dedicated their entire careers to the game. But I’d be willing to bet that even a heavily fine-tuned version of GPT-3 wouldn’t be able to play out a chess game of non-trivial length, while maintaining legality throughout the entire game, without needing to be re-prompted. (And that isn’t even getting into move quality, which I’d fully expect to be terrible no matter what.)
(No confident predictions about GPT-4 as of yet. My old models would have predicted a similar lack of “board vision” from GPT-4 as compared with GPT-3, but I trust those old models less, since Bing/Sydney has managed to surprise me in a number of ways.)
ETA: To be clear, this isn’t a criticism of language models. This whole task is trying to get them to do something that they’re practically architecturally designed to be bad at, so in some sense the mere fact that we’re even talking about this says very impressive things about their capabilities. And obviously, CNNs do the whole chess thing really, really well—easily on par with skilled humans, even without the massive boost offered by search. But CNNs aren’t general, and the question here is one of generality, you know?
I said that playing blindfolded chess at 1s/move is “extraordinarily hard;” I agree that might be an overstatement and “extremely hard” might be more accurate. I also agree that humans don’t need “external” tools; I feel like the whole comparison will come down to arbitrary calls like whether a human explicitly visualizing something or repeating a sound to themself is akin to an LM modifying its prompt, or whether our verbal loop is “internal” whereas an LM prompt is “external” and therefore shows that the AI is missing the special sauce.
Incidentally, I would guess that 100B model trained on 100B chess games will learn to only make valid moves with similar accuracy to a trained human. But this wouldn’t affect my views about AI timelines.