I think the tax will be surprisingly low. Most of the time people or chatbots see images, they don’t actually need to see much detail. As an analogy, i know multiple people who are almost legally blind (see 5-10x less resolution than normal), and they can have long conversations about the physical world (“where’s my cup?\n[points] over there” ect) without anyone noticing that their vision is subpar.
For example, if you ask a chatbot “Suggest some landscaping products to use on my property
I claim the chatbot will be able to respond almost as well as if you had given it
even though the former image is 1kb vs the latter’s 500kb. Defense feels pretty tractable if each input image is only ~1kb.
I think this is an interesting point. We are actually conducting some follow-up work seeing how robust our attacks are to various additional “defensive” perturbations (e.g. downscaling, adding noise). As Matt notes, when doing these experiments it is important to see how such perturbations also affect the models general vision language modeling performance. My prior right now is that using this technique it may be possible to defend against the L infinity constrained images, but possibly not the moving patch attacks that showed higher level features. In general adversarial attacks are a cat and mouse game, so I expect that if we can show you can defend using techniques like this, a new training scheme will come along that is able to make adversaries that are robust to such defenses. It is worth noting also that most VLMs only accept small low resolution images already. For example LLaVA (with llama 13b), which is state of the art for open source, only accepts ~200 * 200 pixel sized image, so the above example is not necessarily a fair one.
I expect lossy image compression to perform better than downsampling or noising because it’s directly destroying the information that humans don’t notice while keeping information that humans notice. Especially if we develop stronger lossy encoding using vision models, it really feels like we should be able to optimize our encodings to destroy the vast majority of human-unnoticed information.
I think the tax will be surprisingly low. Most of the time people or chatbots see images, they don’t actually need to see much detail. As an analogy, i know multiple people who are almost legally blind (see 5-10x less resolution than normal), and they can have long conversations about the physical world (“where’s my cup?\n[points] over there” ect) without anyone noticing that their vision is subpar.
For example, if you ask a chatbot “Suggest some landscaping products to use on my property
I claim the chatbot will be able to respond almost as well as if you had given it
even though the former image is 1kb vs the latter’s 500kb. Defense feels pretty tractable if each input image is only ~1kb.
I think this is an interesting point. We are actually conducting some follow-up work seeing how robust our attacks are to various additional “defensive” perturbations (e.g. downscaling, adding noise). As Matt notes, when doing these experiments it is important to see how such perturbations also affect the models general vision language modeling performance. My prior right now is that using this technique it may be possible to defend against the L infinity constrained images, but possibly not the moving patch attacks that showed higher level features. In general adversarial attacks are a cat and mouse game, so I expect that if we can show you can defend using techniques like this, a new training scheme will come along that is able to make adversaries that are robust to such defenses. It is worth noting also that most VLMs only accept small low resolution images already. For example LLaVA (with llama 13b), which is state of the art for open source, only accepts ~200 * 200 pixel sized image, so the above example is not necessarily a fair one.
I expect lossy image compression to perform better than downsampling or noising because it’s directly destroying the information that humans don’t notice while keeping information that humans notice. Especially if we develop stronger lossy encoding using vision models, it really feels like we should be able to optimize our encodings to destroy the vast majority of human-unnoticed information.