there was a result (from Pieter Abbeel’s lab?) a couple of years ago that showed that pretraining a model on language would lead to improved sample efficiency in some nominally-totally-unrelated RL task
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning – in particular [...] a variety of sequence classification tasks spanning numerical, computation, vision, and protein fold prediction
Pretrained Transformers as Universal Computation Engines
From the abstract:
That’s the one, thanks!