I do think the LM-only version seems easier and probably better to start with.
How are we imagining prompting the multimodal Go+English AI with questions like “is this group alive or dead?” And how are we imagining training it so that it forms intermodal connections rather than just letting them atrophy?
The hope is that you can fiddle with these things to get it to answer some questions and then see whether it generalizes.
My first guess for an architecture would be producing a 19 x 19 grid of embeddings from the CNN, and then letting a transformer attend over them (along with the prior text). That is, you train a CNN that is supposed to produce both (moves, embeddings) and a transformer that talks and sees the embeddings.
I do think the LM-only version seems easier and probably better to start with.
The hope is that you can fiddle with these things to get it to answer some questions and then see whether it generalizes.
My first guess for an architecture would be producing a 19 x 19 grid of embeddings from the CNN, and then letting a transformer attend over them (along with the prior text). That is, you train a CNN that is supposed to produce both (moves, embeddings) and a transformer that talks and sees the embeddings.