It’s been a good while since we’ve had AI chat assistants able to generate images on user request. Unfortunately, for about as long, we’ve had people being peeved at the disconnect between what they asked for, and what they actually got. Particularly annoying was the tendency for the assistants to often claim to have generated what you desired, or that they edited an image to change it, without *actually* doing that.
This was an unfortunate consequence of the LLM, being the assistant persona you speak to, and the *actual* image generator that spits out images from prompts, actually being two entirely separate entities. The LLM doesn’t have any more control over the image model than you do when running something like Midjourney or Stable Diffusion. It’s sending a prompt through a function call, getting an image in response, and then trying to modify prompts to meet user needs. Depending on how lazy the devs are, it might not even be ‘looking’ at the final output at all.
The image models, on the other hand, are a fundamentally different architecture, usually being diffusion-based (Google a better explanation, but the gist of it is that they hallucinate iteratively from a sample of random noise till it resembles the desired image) whereas LLMs use the Transformer architecture. The image models do have some understanding of semantics, but they’re far stupider than LLMs when it comes to understanding finer meaning in prompts.
This has now changed.
Almost half a year back, OpenAI [teased](https://x.com/gdb/status/1790869434174746805) the ability of their then unreleased GPT-4o to generate images *natively*. It was the LLM (more of a misnomer now than ever) actually making the image, in the same manner it could output text or audio.
The LLM doesn’t just “talk” to the image generator—it *is* the image generator, processing everything as tokens, much like it handles text or audio.
Unfortunately, we had nothing but radio silence since then, barring a few leaks of front-end code suggesting OAI would finally switch from DALLE-3 for image generation to using GPT-4o, as well as Altman’s assurances that they hadn’t canned the project on the grounds of safety.
You can generate an image, and then ask it to edit a feature. It will then edit the *original* image and present the version modified to your taste, unlike all other competitors, who would basically just re-prompt and hope for better luck on the second roll.
Image generation just got way better, at least in the realm of semantic understanding. Most of the usual give-aways of AI generated imagery, such as butchered text, are largely solved. It isn’t perfect, but you’re looking at a failure rate of 5-10% as opposed to >80% when using DALLE or Flux. It doesn’t beat Midjourney on aesthetics, but we’ll get there.
You can imagine the scope for chicanery, especially if you’re looking to generate images with large amounts of verbiage or numbers involved. I’d expect the usual censoring in consumer applications, especially since the LLM has finer control over things. But it certainly massively expands the mundane utility of image generation, and is something I’ve been looking forward to ever since I saw the capabilities demoed.
Flash 2.0 Experimental is also a model that’s dirt cheap on the API, and while image gen definitely burns more tokens, it’s a trivial expense. I’d strongly expect Google to make this free just to steal OAI’s thunder.
Moderately interesting news in AI image gen:
It’s been a good while since we’ve had AI chat assistants able to generate images on user request. Unfortunately, for about as long, we’ve had people being peeved at the disconnect between what they asked for, and what they actually got. Particularly annoying was the tendency for the assistants to often claim to have generated what you desired, or that they edited an image to change it, without *actually* doing that.
This was an unfortunate consequence of the LLM, being the assistant persona you speak to, and the *actual* image generator that spits out images from prompts, actually being two entirely separate entities. The LLM doesn’t have any more control over the image model than you do when running something like Midjourney or Stable Diffusion. It’s sending a prompt through a function call, getting an image in response, and then trying to modify prompts to meet user needs. Depending on how lazy the devs are, it might not even be ‘looking’ at the final output at all.
The image models, on the other hand, are a fundamentally different architecture, usually being diffusion-based (Google a better explanation, but the gist of it is that they hallucinate iteratively from a sample of random noise till it resembles the desired image) whereas LLMs use the Transformer architecture. The image models do have some understanding of semantics, but they’re far stupider than LLMs when it comes to understanding finer meaning in prompts.
This has now changed.
Almost half a year back, OpenAI [teased](https://x.com/gdb/status/1790869434174746805) the ability of their then unreleased GPT-4o to generate images *natively*. It was the LLM (more of a misnomer now than ever) actually making the image, in the same manner it could output text or audio.
The LLM doesn’t just “talk” to the image generator—it *is* the image generator, processing everything as tokens, much like it handles text or audio.
Unfortunately, we had nothing but radio silence since then, barring a few leaks of front-end code suggesting OAI would finally switch from DALLE-3 for image generation to using GPT-4o, as well as Altman’s assurances that they hadn’t canned the project on the grounds of safety.
Unfortunately for him, [Google has beaten them to the punch](https://developers.googleblog.com/en/experiment-with-gemini-20-flash-native-image-generation/) . Gemini 2.0 Flash Experimental (don’t ask) has now been blessed with the ability to directly generate images. I’m not sure if this has rolled out to the consumer Gemini app, but it’s readily accessible on their developer preview.
First impressions: [It’s good.](https://x.com/robertriachi/status/1899854394751070573)
You can generate an image, and then ask it to edit a feature. It will then edit the *original* image and present the version modified to your taste, unlike all other competitors, who would basically just re-prompt and hope for better luck on the second roll.
Image generation just got way better, at least in the realm of semantic understanding. Most of the usual give-aways of AI generated imagery, such as butchered text, are largely solved. It isn’t perfect, but you’re looking at a failure rate of 5-10% as opposed to >80% when using DALLE or Flux. It doesn’t beat Midjourney on aesthetics, but we’ll get there.
You can imagine the scope for chicanery, especially if you’re looking to generate images with large amounts of verbiage or numbers involved. I’d expect the usual censoring in consumer applications, especially since the LLM has finer control over things. But it certainly massively expands the mundane utility of image generation, and is something I’ve been looking forward to ever since I saw the capabilities demoed.
Flash 2.0 Experimental is also a model that’s dirt cheap on the API, and while image gen definitely burns more tokens, it’s a trivial expense. I’d strongly expect Google to make this free just to steal OAI’s thunder.