Stephen McAleese comments on Solving adversarial attacks in computer vision as a baby version of general AI alignment

Stephen McAleese 31 Aug 2024 11:51 UTC
3 points
1
Nice paper! I found reading it quite insightful. Here are some key extracts from the paper:
Improving adversarial robustness by classifying several down-sampled noisy images at once:
“Drawing inspiration from biology [eye saccades], we use multiple versions of the same image at once, downsampled to lower resolutions and augmented with stochastic jitter and noise. We train a model to
classify this channel-wise stack of images simultaneously. We show that this by default yields gains in adversarial robustness without any explicit adversarial training.”
Improving adversarial robustness by using an ensemble of intermediate layer predictions:
“Using intermediate layer predictions. We show experimentally that a successful adversarial
attack on a classifier does not fully confuse its intermediate layer features (see Figure 5). An
image of a dog attacked to look like e.g. a car to the classifier still has predominantly dog-like
intermediate layer features. We harness this de-correlation as an active defense by CrossMax
ensembling the predictions of intermediate layers. This allows the network to dynamically
respond to the attack, forcing it to produce consistent attacks over all layers, leading to robustness
and interpretability.”