p.b. comments on The Shard Theory Alignment Scheme

p.b. 26 Aug 2022 8:00 UTC
2 points
1
I am currently working on my research agenda (supported by a long-term future fund grant), which I think ties in neatly with shard theory alignment research.
The TL;DR is that I work towards a multi-modal chess language model.
My pitch was pretty vague (and I am somewhat surprised it has been funded) but I quickly realized that the most valuable thing that might drop out of my research is probably a toy dataset/model for the investigation of value learning.
Chess is imho perfect for that because while the ultimate goal is to mate the oponent’s king, you make moves based on intermediate considerations like king safety, material balance, pawn structure, activity of your pieces, control of open lines, etc.
These considerations trade off against each other when the next move is determined. And I think it should be possible to probe models for representations of these “values” and maybe even how they interact.
Compared to language models, models that learn chess can be almost any architecture and can be trained via imitation learning, reinforcement learning or even via verbal comments. They can be trained to output moves, likely game outcomes or verbal comments. It also seems to me that these considerations are more like values than human values in an LLM because they actually shape the behavior of the model.
If you think this might fit into your research agenda, drop me a line.