They also document the fact that the multimodality allows for “typographic attacks”, where labeling an item with a particular text causes the network to misclassify the item as an instance of the text.
“These are not the droids you are looking for.”
“These are not the droids you are looking for.”