How are we imagining prompting the multimodal Go+English AI with questions like “is this group alive or dead?” And how are we imagining training it so that it forms intermodal connections rather than just letting them atrophy?
My past thoughts (from What’s the dream for giving natural language instructions to AI) were to do it like an autoencoder + translator—you could simultaneously use a latent space for English and for Go, and you train both on autoencoding (or more general unimodal) tasks and on translation (or more general multimodal) tasks. But I think this will actually predictably fail at generalizing to new multimodal tasks that are not compositions of existing tasks. This is because the connection is only at the highest level, so the English mapping would have to “know about” choices of the Go mapping that it previously hasn’t used. Maybe it could solve the problem if it could use English to talk about individual Go stones, but it can’t.
But anyhow, the impression from this post is something more like a sequence transformer. You have some symbols for English, and some symbols for Go boards, and you are allowed to just predict the next token. And sometimes sequences of Go boards are interrupted by sequences of English text asking a question about the Go being played?
EDIT: After thinking more, perhaps the transformer has a less extreme version of the same problem. In humans, what we might do if we are faced with a new Go problem is to use the verbal description to carefully visualize the Go stones, thus taking more time. This makes me feel like recurrence is an important component of how humans do intermodal reasoning—if we start with a word problem but need to engage our visual-based Go reasoning abilities, we don’t have to do it all in one feed-forward pass, we can visualize a Go board in our internal state and then feed it into our entire visual-based Go-reasoning system at a later timestep.
I do think the LM-only version seems easier and probably better to start with.
How are we imagining prompting the multimodal Go+English AI with questions like “is this group alive or dead?” And how are we imagining training it so that it forms intermodal connections rather than just letting them atrophy?
The hope is that you can fiddle with these things to get it to answer some questions and then see whether it generalizes.
My first guess for an architecture would be producing a 19 x 19 grid of embeddings from the CNN, and then letting a transformer attend over them (along with the prior text). That is, you train a CNN that is supposed to produce both (moves, embeddings) and a transformer that talks and sees the embeddings.
How are we imagining prompting the multimodal Go+English AI with questions like “is this group alive or dead?” And how are we imagining training it so that it forms intermodal connections rather than just letting them atrophy?
My past thoughts (from What’s the dream for giving natural language instructions to AI) were to do it like an autoencoder + translator—you could simultaneously use a latent space for English and for Go, and you train both on autoencoding (or more general unimodal) tasks and on translation (or more general multimodal) tasks. But I think this will actually predictably fail at generalizing to new multimodal tasks that are not compositions of existing tasks. This is because the connection is only at the highest level, so the English mapping would have to “know about” choices of the Go mapping that it previously hasn’t used. Maybe it could solve the problem if it could use English to talk about individual Go stones, but it can’t.
But anyhow, the impression from this post is something more like a sequence transformer. You have some symbols for English, and some symbols for Go boards, and you are allowed to just predict the next token. And sometimes sequences of Go boards are interrupted by sequences of English text asking a question about the Go being played?
EDIT: After thinking more, perhaps the transformer has a less extreme version of the same problem. In humans, what we might do if we are faced with a new Go problem is to use the verbal description to carefully visualize the Go stones, thus taking more time. This makes me feel like recurrence is an important component of how humans do intermodal reasoning—if we start with a word problem but need to engage our visual-based Go reasoning abilities, we don’t have to do it all in one feed-forward pass, we can visualize a Go board in our internal state and then feed it into our entire visual-based Go-reasoning system at a later timestep.
I do think the LM-only version seems easier and probably better to start with.
The hope is that you can fiddle with these things to get it to answer some questions and then see whether it generalizes.
My first guess for an architecture would be producing a 19 x 19 grid of embeddings from the CNN, and then letting a transformer attend over them (along with the prior text). That is, you train a CNN that is supposed to produce both (moves, embeddings) and a transformer that talks and sees the embeddings.