The chess “board vision” task is extraordinarily hard for humans who are spending 1 second per token and not using an external scratchspace. It’s not trivial for an untrained human even if they spend multiple seconds per token. (I can do it only by using my visual field, e.g. it helps me massively to be looking at a blank 8 x 8 chessboard because it gives a place for the visuals to live and minimizes off-by-one errors.)
Humans would solve this prediction task by maintaining an external representation of the state of the board, updating that representation on each move, and then re-reading the representation each time before making a prediction. I think GPT-3.5 will also likely do this if asked to use external tools to make a prediction about the next move. (And of course when we actually play chess we just do it by observing the state of the board, as represented to us by the chess board or chess program, prior to making each move.)
It seems like a mistake to analogize a forward pass of the transformer to a human using external tools, if you want to make meaningful comparisons.
You might learn something from such a test, but you wouldn’t learn much about how AI performance compares to human performance, or when AI might have a transformative impact.
It seems like a mistake to analogize a forward pass of the transformer to a human using external tools, if you want to make meaningful comparisons.
That may be, but it also seems to me like a mistake to use as your example a human who is untrained (or at least has had very little training), instead of a human whose training run has basically saturated the performance of their native architecture. Those people do in fact, play blindfold chess, and are capable of tracking the board state perfectly without any external visual aid, while playing with a time control of ~1 minute per player per game (which, if we assume an average game length of 80 moves, comes out to ~1.5 seconds per move).
Of course, that comparison again becomes unfair in the other direction, since ChatGPT hasn’t been trained nearly as exhaustively on chess notation, whereas the people I’m talking about have dedicated their entire careers to the game. But I’d be willing to bet that even a heavily fine-tuned version of GPT-3 wouldn’t be able to play out a chess game of non-trivial length, while maintaining legality throughout the entire game, without needing to be re-prompted. (And that isn’t even getting into move quality, which I’d fully expect to be terrible no matter what.)
(No confident predictions about GPT-4 as of yet. My old models would have predicted a similar lack of “board vision” from GPT-4 as compared with GPT-3, but I trust those old models less, since Bing/Sydney has managed to surprise me in a number of ways.)
ETA: To be clear, this isn’t a criticism of language models. This whole task is trying to get them to do something that they’re practically architecturally designed to be bad at, so in some sense the mere fact that we’re even talking about this says very impressive things about their capabilities. And obviously, CNNs do the whole chess thing really, really well—easily on par with skilled humans, even without the massive boost offered by search. But CNNs aren’t general, and the question here is one of generality, you know?
I said that playing blindfolded chess at 1s/move is “extraordinarily hard;” I agree that might be an overstatement and “extremely hard” might be more accurate. I also agree that humans don’t need “external” tools; I feel like the whole comparison will come down to arbitrary calls like whether a human explicitly visualizing something or repeating a sound to themself is akin to an LM modifying its prompt, or whether our verbal loop is “internal” whereas an LM prompt is “external” and therefore shows that the AI is missing the special sauce.
Incidentally, I would guess that 100B model trained on 100B chess games will learn to only make valid moves with similar accuracy to a trained human. But this wouldn’t affect my views about AI timelines.
My proposed experiment / test is trying to avoid analogizing humans, but rather scope out places where the ai can’t do very well. I’d like to avoid accidentally overly-narrow-scoping the vision of the tests. It won’t work with an ai network where the weights are reset every time.
An alternative, albeit massively-larger-scale experiment might be:
Will a self-driving car ever be able to navigate from one end of a city to another, using street signs and just learning the streets by exploring it?
A test of this might be like the following:
Randomly generate a simulated city/town, complete with street signs and traffic
Allow the self-driving car to peruse the city on its own accord
(or feed the ai network the map of the city a few times before the target destinations are given, if that is infeasible)
Give the self-driving car target destinations. Can the self-driving car navigate from one end of the city to the other, using only street signs, no GPS?
I think this kind of measuring would tell us how well our ai can handle open-endedness and help us understand where the void of progress is, and I think a small-scale chess experiment like this would help us shed light on bigger questions.
But humans play blindfold chess much slower than they read/write moves, they take tons of cognitive actions between each move. And at least when I play blindfold chess I need to lean heavily on my visual memory, and I often need to go back over the game so far for error-correction purposes, laboriously reading and writing to a mental scratchspace. I don’t know if better players do that.
I’m not sure why we shouldn’t expect an ai to be able to do well at it?
But an AI can do completely fine at the task by writing to an internal scratchspace. You are defining a restriction on what kind of AI is allowed, and I’m saying that human cognition probably doesn’t satisfy the analogous restrictions. I think to learn to play blindfold chess humans need to explicitly think about cognitive strategies, and the activity is much more similar to equipping an LM with the ability to write to its own context and then having it reason aloud about how to use that ability.
The reason why I don’t want a scratch-space, is because I view scratch space and context equivalent to giving the ai a notecard that it can peek at. I’m not against having extra categories or asterisks for the different kinds of ai for the small test.
Thinking aloud and giving it scratch space would mean it’s likely to be a lot more tractable for interpretability and alignment research, I’ll grant you that.
I appreciate the feedback, and I will think about your points more, though I’m not sure if I will agree.
The chess “board vision” task is extraordinarily hard for humans who are spending 1 second per token and not using an external scratchspace. It’s not trivial for an untrained human even if they spend multiple seconds per token. (I can do it only by using my visual field, e.g. it helps me massively to be looking at a blank 8 x 8 chessboard because it gives a place for the visuals to live and minimizes off-by-one errors.)
Humans would solve this prediction task by maintaining an external representation of the state of the board, updating that representation on each move, and then re-reading the representation each time before making a prediction. I think GPT-3.5 will also likely do this if asked to use external tools to make a prediction about the next move. (And of course when we actually play chess we just do it by observing the state of the board, as represented to us by the chess board or chess program, prior to making each move.)
It seems like a mistake to analogize a forward pass of the transformer to a human using external tools, if you want to make meaningful comparisons.
You might learn something from such a test, but you wouldn’t learn much about how AI performance compares to human performance, or when AI might have a transformative impact.
That may be, but it also seems to me like a mistake to use as your example a human who is untrained (or at least has had very little training), instead of a human whose training run has basically saturated the performance of their native architecture. Those people do in fact, play blindfold chess, and are capable of tracking the board state perfectly without any external visual aid, while playing with a time control of ~1 minute per player per game (which, if we assume an average game length of 80 moves, comes out to ~1.5 seconds per move).
Of course, that comparison again becomes unfair in the other direction, since ChatGPT hasn’t been trained nearly as exhaustively on chess notation, whereas the people I’m talking about have dedicated their entire careers to the game. But I’d be willing to bet that even a heavily fine-tuned version of GPT-3 wouldn’t be able to play out a chess game of non-trivial length, while maintaining legality throughout the entire game, without needing to be re-prompted. (And that isn’t even getting into move quality, which I’d fully expect to be terrible no matter what.)
(No confident predictions about GPT-4 as of yet. My old models would have predicted a similar lack of “board vision” from GPT-4 as compared with GPT-3, but I trust those old models less, since Bing/Sydney has managed to surprise me in a number of ways.)
ETA: To be clear, this isn’t a criticism of language models. This whole task is trying to get them to do something that they’re practically architecturally designed to be bad at, so in some sense the mere fact that we’re even talking about this says very impressive things about their capabilities. And obviously, CNNs do the whole chess thing really, really well—easily on par with skilled humans, even without the massive boost offered by search. But CNNs aren’t general, and the question here is one of generality, you know?
I said that playing blindfolded chess at 1s/move is “extraordinarily hard;” I agree that might be an overstatement and “extremely hard” might be more accurate. I also agree that humans don’t need “external” tools; I feel like the whole comparison will come down to arbitrary calls like whether a human explicitly visualizing something or repeating a sound to themself is akin to an LM modifying its prompt, or whether our verbal loop is “internal” whereas an LM prompt is “external” and therefore shows that the AI is missing the special sauce.
Incidentally, I would guess that 100B model trained on 100B chess games will learn to only make valid moves with similar accuracy to a trained human. But this wouldn’t affect my views about AI timelines.
My proposed experiment / test is trying to avoid analogizing humans, but rather scope out places where the ai can’t do very well. I’d like to avoid accidentally overly-narrow-scoping the vision of the tests. It won’t work with an ai network where the weights are reset every time.
An alternative, albeit massively-larger-scale experiment might be:
Will a self-driving car ever be able to navigate from one end of a city to another, using street signs and just learning the streets by exploring it?
A test of this might be like the following:
Randomly generate a simulated city/town, complete with street signs and traffic
Allow the self-driving car to peruse the city on its own accord
(or feed the ai network the map of the city a few times before the target destinations are given, if that is infeasible)
Give the self-driving car target destinations. Can the self-driving car navigate from one end of the city to the other, using only street signs, no GPS?
I think this kind of measuring would tell us how well our ai can handle open-endedness and help us understand where the void of progress is, and I think a small-scale chess experiment like this would help us shed light on bigger questions.
Just seems worth flagging that humans couldn’t do the chess test, and that there’s no particular reason to think that transformative AI could either.
I’m confused. What I’m referring to here is https://en.wikipedia.org/wiki/Blindfold_chess
I’m not sure why we shouldn’t expect an ai to be able to do well at it?
But humans play blindfold chess much slower than they read/write moves, they take tons of cognitive actions between each move. And at least when I play blindfold chess I need to lean heavily on my visual memory, and I often need to go back over the game so far for error-correction purposes, laboriously reading and writing to a mental scratchspace. I don’t know if better players do that.
But an AI can do completely fine at the task by writing to an internal scratchspace. You are defining a restriction on what kind of AI is allowed, and I’m saying that human cognition probably doesn’t satisfy the analogous restrictions. I think to learn to play blindfold chess humans need to explicitly think about cognitive strategies, and the activity is much more similar to equipping an LM with the ability to write to its own context and then having it reason aloud about how to use that ability.
The reason why I don’t want a scratch-space, is because I view scratch space and context equivalent to giving the ai a notecard that it can peek at. I’m not against having extra categories or asterisks for the different kinds of ai for the small test.
Thinking aloud and giving it scratch space would mean it’s likely to be a lot more tractable for interpretability and alignment research, I’ll grant you that.
I appreciate the feedback, and I will think about your points more, though I’m not sure if I will agree.