gwern comments on [AN #142]: The quest to understand a network well enough to reimplement it by hand

gwern 17 Mar 2021 17:38 UTC
LW: 11 AF: 7
AF

It is quite possible that CLIP “knows” that the image contains a Granny Smith apple with a piece of paper saying “iPod”, but when asked to complete the caption with a single class from the ImageNet classes, it ends up choosing “iPod” instead of “Granny Smith”. I’d caution against saying things like “CLIP thinks it is looking at an iPod”; this seems like too strong a claim given the evidence that we have right now.

Yes, it’s already been solved. These are ‘attacks’ only in the most generous interpretation possible (since it does know the difference), and the fact that CLIP can read text in images to, arguably, correctly note the semantic similarity in embeddings, is to its considerable credit. As the CLIP authors note, some queries benefit from ensembling, more context than a single word class name such as prefixing “A photograph of a ”, and class names can be highly ambiguous: in ImageNet, the class name “crane” could refer to the bird or construction equipment; and the Oxford-IIIT Pet dataset labels one class “boxer”.
What links here?
- [AN #143]: How to make embedded agents that reason probabilistically about their environments by Rohin Shah (24 Mar 2021 17:20 UTC; 13 points)
- Rohin Shah's comment on [AN #142]: The quest to understand a network well enough to reimplement it by hand by Rohin Shah (17 Mar 2021 19:57 UTC; 6 points)
- Rohin Shah 17 Mar 2021 19:57 UTC
  LW: 6 AF: 5
  AF Parent
  Ah excellent, thanks for the links. I’ll send the Twitter thread in the next newsletter with the following summary:
  Last week I speculated that CLIP might “know” that a textual adversarial example is a “picture of an apple with a piece of paper saying an iPod on it” and the zero-shot classification prompt is preventing it from demonstrating this knowledge. Gwern Branwen [commented](https://www.alignmentforum.org/posts/JGByt8TrxREo4twaw/an-142-the-quest-to-understand-a-network-well-enough-to?commentId=keW4DuE7G4SZn9h2r) to link me to this Twitter thread as well as this [YouTube video](https://youtu.be/Rk3MBx20z24) better prompt engineering significantly reduces these textual adversarial examples, demonstrating that CLIP does “know” that it’s looking at an apple with a piece of paper on it.