No not really, they are not arg-maxers. They combine an unconditional generative model (maps from noise to samples of realistic images by learning to denoise) and a discriminative model (maps from images to text) to sample (via iterative GD) from a conditional model (realistic images which the discriminative model would map to the text query).
“Asking for the most X-like thing” would be basically ignoring or underweighting the generative model, and that results in deepdream like garbage images (it’s one of the main hyperparams in any diffusion model, so this is really easy to try out yourself—samples fully weighted from the discriminator are deam-dream garbage at best, samples fully weighted from the unconditional generative model are boring natural texture patterns).
Basically the discriminative model learns how language slices up the space of all images, and the generative model crucially learns the actual lower-D embedded geometry of the distribution of realistic images—which is not something that pure discriminative models learn. The discriminative model by itself has no knowledge of what images are realistic, and optimizing solely for its extrema results in nonsense because it takes you far from the complex boundary of realistic images.
Nate’s response just seems confused on how diffusion models work.
No not really, they are not arg-maxers. They combine an unconditional generative model (maps from noise to samples of realistic images by learning to denoise) and a discriminative model (maps from images to text) to sample (via iterative GD) from a conditional model (realistic images which the discriminative model would map to the text query).
“Asking for the most X-like thing” would be basically ignoring or underweighting the generative model, and that results in deepdream like garbage images (it’s one of the main hyperparams in any diffusion model, so this is really easy to try out yourself—samples fully weighted from the discriminator are deam-dream garbage at best, samples fully weighted from the unconditional generative model are boring natural texture patterns).
Basically the discriminative model learns how language slices up the space of all images, and the generative model crucially learns the actual lower-D embedded geometry of the distribution of realistic images—which is not something that pure discriminative models learn. The discriminative model by itself has no knowledge of what images are realistic, and optimizing solely for its extrema results in nonsense because it takes you far from the complex boundary of realistic images.
Nate’s response just seems confused on how diffusion models work.
Different results here: https://twitter.com/summerstay1/status/1579759146236510209