Research agenda—Building a multi-modal chess-language model
This is one of the posts that detail my research agenda, which tries to marry Deep Learning and chess to gain insight into current AI technologies.
How “the world”, “what the model knows about the world” and “what the models says about the world” hang together seems to me to be a core question for prosaic alignment. In chess the state of the world can be automatically analysed by powerful engines.
This extends to the detection of high level concepts, like zugzwang, initiative, king safety, development, corresponding squares, etc. Therefore a multi-modal chess language models truthfulness and symbol grounding capabilities could be automatically quantified on several levels of complexity.
Of course it would also just be awesome.
There is current research into how to create chess comments. The probably best of such tries, Chis Butner’s amazing project “Chess Coach”, gives a nice overview.
Generally speaking, the quality of these generated comments is low. They are frequently nonsensical or only loosely connected to the position. To a large part, this must be due to the small dataset sizes, which range from 300,000 to 1 million position/comment pairs. But another problem is that the existing datasets contain mostly very low quality comments by weak players.
There are several avenues to improve these models:
Pretraining is a big one:
I plan on using an encoder-decoder architecture that encodes chess positions and decodes into natural language.
The encoder will be pretrained on chess move prediction and possibly several other pretraining tasks like predicting the outcome, predicting an engine score, predicting future piece occupation of squares, predicting the length of the game, predicting mate, etc.
The decoder will be pretrained (or finetuned) on natural language as chess related as I can make the dataset. The idea is to plug an encoder that knows chess very well, into a decoder that is already proficient at making grammatical and non-selfcontradictory statements.
However, training this pretrained architecture on only several hundred thousand position-comment pairs is probably not going to be enough. So a key part of my research roadmap is to create a new and much larger dataset.
Partly, this will be synthetic. Connecting moves and squares and pieces and checks and mates and stalemates with the correct phrase and vocabulary is not something that has to depend on actual human generated comments. By automating the generation of data with basically a description of what is happening, a translation from game notation into natural language, this stumbling block can be removed completely.
Mostly, I will extract human generated commentary from chess videos. This will very likely be bottlenecked by the compute I can leverage and not by the availability of data. There are professional chess streamers who are generating a large amount of chess commentary while playing very short live games online. These players are generally of master or grandmaster strength with several world elite players.
The commentary would have to be extracted via speech-to-text models. The game positions would have to be located and read out with likely different computer vision models. Probably only a fraction of the frames will be processed, because the need to fit into a sequence of legal moves restricts the possible positions a lot and of course positions change only once every dozens of frames even in the fastest games.
I have done similar computer vision work in the past and I think with the right hardware and decent download speed this approach should make it possible to create a high-quality dataset of chess comments many times larger than what was available so far.
It’s unclear what sort of insight you might want to learn. In any case, it sounds to me like capability research that’s more likely to be net harmful for humanity rather than safety research.
My reasoning is partly that we know that large AGI-outfits do not necessarily publish their insights into the capabilities of their systems and architectures. But it seems to me to be quite important to develop a strong understanding of these capabilities.
Given that I would use existing techniques in a toy scenario, I think it’s very unlikely that I would create new capabilities. Maybe I would discover unknown capabilities but these would have existed in similar systems anyway. And of course what discoveries I would decide to publish is a separate question altogether.
I also wouldn’t call this “safety research”, though I think such a model might downstream be useful for prosaic alignment. My motivation is mostly to understand whether AGI is 5 years away or 30. And to know which breakthroughs fill the remaining gaps and which don’t.