Taking a sentence output by AI Dungeon and feeding it into DALL-E is totally possible (if and when the DALL-E source code becomes available). I’m not sure how much money it would cost. DALL-E has about 7% of the parameters that the biggest model of GPT-3 has, though I doubt AI Dungeon uses the biggest model. Generating an entire image with DALL-E means predicting 1024 tokens/codewords, whereas predicting text is at most 1 token per letter. All in all, it seems financially plausible. I think it would be fun to see the results too.
What seems tricky to me is that a story can be much more complex than the 256-token text input that DALL-E accepts. Suppose the last sentence of the story is “He picks up the rock.” This input fits into DALL-E easily, but is very ambiguous. “He” might be illustrated by DALL-E as any arbitrary male figure, even though in the story, “He” refers to a very specific character. (“The rock” is similarly ambiguous. And there are more ambiguities, such as the location and the time of day that the scene takes place in.) If you scan back a couple of lines, you may find that “He” refers to a character called Fredrick. His name is not immediately useful for determining what he looks like, but knowing his name, we can now look through the entire story to find descriptions of him. Perhaps Fredrick was introduced in the first chapter as a farmer, but became a royal knight in Chapter 3 after an act of heroism. Whether Fredrick is currently wearing his armor might depend on the last few hundred words, and what his armor looks like was probably described in Chapter 3. Whether his hair is red could depend on the first few hundred words. But maybe in the middle of the story, a curse turned his hair into a bunch of worms.
All this is to say that to correctly interpret a sentence in a story, you potentially have to read the entire story. Trying to summarize the story could help, but can only go so far. Every paragraph of the story could contain facts about the world that are relevant to the current scene. Instead, you might want to summarize only those details of the story that are currently relevant.
Or maybe you can somehow train an AI that builds up a world model from the story text, so that it can answer the questions necessary for illustrating the current scene. It’s worth noting that GPT-3 has something akin to a world model that it can use to answer questions about Earth, as well as fictional worlds it’s been exposed to during training. However, its ability to learn about new worlds outside of training (so, during inference) is limited, since it can only remember the last 2000 tokens. To me this kind of seems like they need to give the AI its own memory, so that it can store long-term facts about the text to help predict the next token. I wonder if something like that has been tried yet.
One way you might be able to train such a model is to have it generate movie frames out of subtitles, since there’s plenty of training data that way. Then you’re pretty close to illustrating scenes from a story.
Taking a sentence output by AI Dungeon and feeding it into DALL-E is totally possible (if and when the DALL-E source code becomes available). I’m not sure how much money it would cost. DALL-E has about 7% of the parameters that the biggest model of GPT-3 has, though I doubt AI Dungeon uses the biggest model. Generating an entire image with DALL-E means predicting 1024 tokens/codewords, whereas predicting text is at most 1 token per letter. All in all, it seems financially plausible. I think it would be fun to see the results too.
What seems tricky to me is that a story can be much more complex than the 256-token text input that DALL-E accepts. Suppose the last sentence of the story is “He picks up the rock.” This input fits into DALL-E easily, but is very ambiguous. “He” might be illustrated by DALL-E as any arbitrary male figure, even though in the story, “He” refers to a very specific character. (“The rock” is similarly ambiguous. And there are more ambiguities, such as the location and the time of day that the scene takes place in.) If you scan back a couple of lines, you may find that “He” refers to a character called Fredrick. His name is not immediately useful for determining what he looks like, but knowing his name, we can now look through the entire story to find descriptions of him. Perhaps Fredrick was introduced in the first chapter as a farmer, but became a royal knight in Chapter 3 after an act of heroism. Whether Fredrick is currently wearing his armor might depend on the last few hundred words, and what his armor looks like was probably described in Chapter 3. Whether his hair is red could depend on the first few hundred words. But maybe in the middle of the story, a curse turned his hair into a bunch of worms.
All this is to say that to correctly interpret a sentence in a story, you potentially have to read the entire story. Trying to summarize the story could help, but can only go so far. Every paragraph of the story could contain facts about the world that are relevant to the current scene. Instead, you might want to summarize only those details of the story that are currently relevant.
Or maybe you can somehow train an AI that builds up a world model from the story text, so that it can answer the questions necessary for illustrating the current scene. It’s worth noting that GPT-3 has something akin to a world model that it can use to answer questions about Earth, as well as fictional worlds it’s been exposed to during training. However, its ability to learn about new worlds outside of training (so, during inference) is limited, since it can only remember the last 2000 tokens. To me this kind of seems like they need to give the AI its own memory, so that it can store long-term facts about the text to help predict the next token. I wonder if something like that has been tried yet.
One way you might be able to train such a model is to have it generate movie frames out of subtitles, since there’s plenty of training data that way. Then you’re pretty close to illustrating scenes from a story.