AI Safety 101 : Introduction to Vision Interpretability

Link post

Last year, we taught an AGISF course to Master’s students at a French university (ENS Ulm). The course consisted of 10 sessions, each lasting 2 hours, and drew inspiration from the official AGISF curriculum. Currently, we are writing the corresponding textbook, which aims to provide a succinct overview of the essential aspects covered during the sessions. Here, we are sharing the section dedicated to vision interpretability to obtain feedback and because we thought it might be valuable outside of the course’s context for anyone trying to learn more about interpretability. Other topics such as reward misspecification, goal misgeneralization, scalable oversight, transformer interpretability, and governance are currently a work in progress, and we plan to share them in the future.

We welcome any comments, corrections, or suggestions you may have to improve this work. This is our first attempt at creating pedagogical content, so there is undoubtedly room for improvement. Please don’t hesitate to contact us at jeanne.salle@yahoo.fr and crsegerie@gmail.com.

The pdf can be found here.

It introduces the following papers:

Feature Visualization (Olah, et al., 2017)
Activation Atlas (Carter, et al., 2019)
Zoom In: An Introduction to Circuits (Olah, et al., 2020)
An Overview of Early Vision in InceptionV1 (Olah, et al., 2020)
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization (Selvaraju, et al., 2017)
Understanding RL Vision (Hilton, et al., 2020)
Network Dissection: Quantifying Interpretability of Deep Visual Representations (Bau, et al., 2017)
Compositional Explanations of Neurons (Mu & Andreas, 2020)
Natural Language Description of Deep Visual Features (Hernandez, et al., 2022)

Other papers briefly mentioned:

The Building Blocks of Interpretability (Olah, et al., 2018)
Sanity Checks for Saliency Maps (Adebayo, et al., 2018)
Multimodal Neurons in Artificial Neural Networks (Goh, et al., 2021)

The associated deck of slides can be found here.

Thanks to Agatha Duzan for her useful feedback.