oceaninthemiddleofanisland answers Probability that other architectures will scale as well as Transformers?

oceaninthemiddleofanisland 29 Jul 2020 2:50 UTC
5 points
I just finished Iain M Banks’ ‘The Player of Games’ so my thoughts are being influenced by that, but it had an interesting main character who made it his mission to become the best “general game-player” (e.g no specialising in specific games), so I would be interested to see whether policy-based reinforcement learning models scale (thinking of how Agent 57 exceeded human performance across all Atari games).
It seems kind of trivially true that a large enough MuZero with some architectural changes could do something like play chess, shogi and go – by developing separate “modules” for each. At a certain size, this would be actually be optimal. There would be no point in developing general game-play strategies, it would be redundant. But suppose you scale it up, and then add hundreds or thousands of other games and unique tasks into the mix, with multiple timescales, multiple kinds of input, and multiple simulation environments? This is assuming we’ve figured out how to implement reward design automatically, or if the model/s (multi-agent?) itself develops a better representation of reward.
At what point does transfer learning occur? I saw a very simple version of this when I was fine-tuning GPT-2 on a dataset of 50-60 distinct authors. When you look at validation loss (accuracy), you always see a particular shape—when you add another author, validation loss rises, then it plateaus, then it sharply falls when the model makes some kind of “breakthrough” (for whatever that means for a transformer). When you train on authors writing about the same topics, final loss is lower, and the end outputs are a lot more coherent. The model benefits from the additional coverage, because it learns to generalise better. So after piling on more and more games and increasing the model size, at what point would we see the sharp, non-linear fall?
At what point does it start developing shared representations of multiple games?
At what point does it start meta-learning during validation testing, like GPT-3? Performing well at unseen games?
What about social games involving text communication? Interaction with real-world systems?
Suppose you gave it a ‘game’ involving predicting sequences of text as accurately as possible, and another involving predicting sequences of pixels, and another with modelling physical environments. At what point would it get good at all of those, as good as GPT-3?
And we could flip this question around: suppose you fine-tuned GPT-X on a corpus of games represented as text. (We already know GPT-2 can kind of handle chess.) At what version number do you see it start to perform as well as MuZero?
The other commenter is right – the architecture is important, but it’s not about the architecture. It’s about the task, and whether in order to do that task well, whether you need to learn other more general and transferable tasks in the process.
In this sense, it doesn’t matter if you’re modelling text or playing a video-game: the end result – given enough compute and enough data – always converges on the same set of universally-useful skills. A stable world-model, a theory of mind (we could unpack this into an understanding of agency and goal-setting, and keeping track of the mental-states of others), meta-learning, application of logic, an understanding of causality, pattern-recognition, and problem-solving.