Google’s new text-to-image model—Parti, a demonstration of scaling benefits
Google has released their latest text-to-image generation model- Parti. They provide a few prompts and showcase the differences between models trained on 350M, 750M, 3B and 20B parameters.
One difference from last week’s Imagen is that Parti is GAN-based. Imagen and DALL-E 2 are diffusion-based models, whereas Parti is a sequence-to-sequence model scaled highly on Transformer + VQGAN.
The announcement says
Parti and Imagen are complementary in exploring two different families of generative models – autoregressive and diffusion, respectively.
...
We have decided not to release our Parti models, code, or data for public use without further safeguards in place.
There’s an interesting thread on Parti by Jason Baldridge here, and a short overview here by Google. I wonder how well the 20B model will do on text characters inside images compared to the other diffusion-based approaches like Imagen and DALL-E 2.
Well, you can see plenty of text in the samples. Obviously, like Imagen, it beats the pants off DALL-E 2 inasmuch as you can actually read the text; not a high bar. Harder to see if it really improves over Imagen: the COCO FID increase is small and otherwise they omit any real Imagen vs Parti head-to-head comparison. They advertise Parti’s ability to do long complex prompts with high fidelity, so maybe for long text insertions it’ll clearly win?
The readability difference, when compared to DALL-E 2, is laughable.
They have provided some examples after the references section, including some direct comparisons with DALL-E 2 for text in images. Also, PartiPrompts looks like a good collection of novel prompts for eval.
Let’s give it a reasoning test.
A photo of five minus three coins.
A painting of the last main character to die in the Harry Potter series.
An essay, in correctly spelled English, on the causes of the scientific revolution.
A helpful essay, in correctly spelled English, on how to align artificial superintelligence.
It probably wouldn’t do very well.
In scroll down to the “Discussion and Limitations” section on the page linked at the start of this post you’ll see that when given the input “A plate that has no bananas on it. there is a glass without orange juice next to it.” it generated a photo with both bananas and orange juice.