gwern comments on Chess as a case study in hidden capabilities in ChatGPT

gwern 9 Jan 2024 2:23 UTC
4 points
0

Well, maybe that is unsurprising.

Yes, that’s a very common observation. After all, you still have to try to model the current player’s planning & move based on the board state, and in the final layers, you also have to generate the actual prediction—the full 51k BPE logit array or whatever. That has to happen somewhere, and the final layers are the most logical place to do so. Same as with CNNs doing image classification: the final layer is a bad place to get an embedding from, because by that point, the CNN is changing the activations for the final categorical output.

I think more important than how easy it is to extract the information, is how necessary it is to extract the information. You can probably be somewhat fuzzy about board state details and still get great accuracy.

Yes. This gets back to the core of ‘what is an imitation-learning LLM doing?’ Janus’s Simulators puts the emphasis on ‘it is learning the distribution, and learning to simulate worlds’; but the DRL perspective puts the emphasis on ‘it is learning to act like an agent to maximize its predictive-reward, and learns simulations/worlds only insofar as that is necessary to maximize reward’. It learns a world-model which chucks out everything which is unnecessary for maximizing reward: this is not a faithful model but a value-equivalent model.

If there is some aspect of the latent chess state which doesn’t, ultimately, help win games, then a chess LLM (or MuZero) doesn’t want to learn to model that part of the latent chess state, because it is, by definition, useless. (It may learn to model it, but for other reasons like accident or having over-generalized or because it’s not yet sure that part is useless or as a shortcut etc etc.)

This hasn’t previously been important to the ‘do LLMs learn a world model?’ literature because the emphasis has been responding to naysayers like Bender, who claim that they do not and cannot learn any world model at all, that a ‘superintelligent octopus’ eavesdropping on chess game transcripts would never learn to play chess at all beyond memorization. But since that claim is now generally accepted to be false, the questions move on to ‘since they do learn world models, what sorts, how well, why, and when?’

Which will include the question of how well they can learn to amortize planning. I am quite sure the answer is that they do so to some non-zero degree, and that you are wrong in general about GPTs never planning ahead, based on evidence like Jones’s scaling laws which show no sharp transitions between training and planning and only a smooth exchange rate, and the fact that you can so easily distill results from planning into a GPT (eg. distilling the results of inner-monologue ‘planning’ into a single forward pass). So my expectation is that either your models will be too small to show clear signs or planning or you will just get false nulls from inadequate model interpretability methods—it’s not an easy thing to do, to figure out what a model is thinking and what it is not thinking, after all.