A fun thing to think about: the technique used to “attack” CLIP in section 4.3 is very similar to the old “VQGAN+CLIP” image generation technique, which was very popular in 2021 before diffusion models really took off.
VQGAN+CLIP works in the latent space of a VQ autoencoder, rather than in pixel space, but otherwise the two methods are nearly identical.
For instance, VQGAN+CLIP also ascends the gradient of the cosine similarity between a fixed prompt vector and the CLIP embedding of the image averaged over various random augmentations like jitter/translation/etc. And it uses an L2 penalty in the loss, which (via Karush–Kuhn–Tucker) means it’s effectively trying to find the best perturbation within an ϵ-ball, albeit with an implicitly determined ϵ that varies from example to example.
I don’t know if anyone tried using downscaling-then-upscaling the image by varying extents as a VQGAN+CLIP augmentation, but people tried a lot of different augmentations back in the heyday of VQGAN+CLIP, so it wouldn’t surprise me.
(One of the augmentations that was commonly used with VQGAN+CLIP was called “cutouts,” which blacks out everything in the image except for a randomly selected rectangle. This obviously isn’t identical to the multi-resolution thing, but one might argue that it achieves some of the same goals: both augmentations effectively force the method to “use” the low-frequency Fourier modes, creating interesting global structure rather than a homogeneous/incoherent splash of “textured” noise.)
Very interesting paper!
A fun thing to think about: the technique used to “attack” CLIP in section 4.3 is very similar to the old “VQGAN+CLIP” image generation technique, which was very popular in 2021 before diffusion models really took off.
VQGAN+CLIP works in the latent space of a VQ autoencoder, rather than in pixel space, but otherwise the two methods are nearly identical.
For instance, VQGAN+CLIP also ascends the gradient of the cosine similarity between a fixed prompt vector and the CLIP embedding of the image averaged over various random augmentations like jitter/translation/etc. And it uses an L2 penalty in the loss, which (via Karush–Kuhn–Tucker) means it’s effectively trying to find the best perturbation within an ϵ-ball, albeit with an implicitly determined ϵ that varies from example to example.
I don’t know if anyone tried using downscaling-then-upscaling the image by varying extents as a VQGAN+CLIP augmentation, but people tried a lot of different augmentations back in the heyday of VQGAN+CLIP, so it wouldn’t surprise me.
(One of the augmentations that was commonly used with VQGAN+CLIP was called “cutouts,” which blacks out everything in the image except for a randomly selected rectangle. This obviously isn’t identical to the multi-resolution thing, but one might argue that it achieves some of the same goals: both augmentations effectively force the method to “use” the low-frequency Fourier modes, creating interesting global structure rather than a homogeneous/incoherent splash of “textured” noise.)