Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I’m always happy to hear feedback; you can send it to me by commenting on this post.
Adversarial Examples Are Not Bugs, They Are Features(Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom et al) (summarized by Rohin and Cody): Distill published a discussion of this paper. This highlights section will cover the full discussion; all of these summaries and opinions are meant to be read together.
Consider two possible explanations of adversarial examples. First, they could be caused because the model “hallucinates” a signal that is not useful for classification, and it becomes very sensitive to this feature. We could call these “bugs”, since they don’t generalize well. Second, they could be caused by features that do generalize to the test set, but can be modified by an adversarial perturbation. We could call these “non-robust features” (as opposed to “robust features”, which can’t be changed by an adversarial perturbation). The authors argue that at least some adversarial perturbations fall into the second category of being informative but sensitive features, based on two experiments.
If the “hallucination” explanation were true, the hallucinations would presumably be caused by the training process, the choice of architecture, the size of the dataset, but not by the type of data. So one thing to do would be to see if we can construct a dataset such that a model trained on that dataset is already robust, without adversarial training. The authors do this in the first experiment. They take an adversarially trained robust classifier, and create images whose features (final-layer activations of the robust classifier) match the features of some unmodified input. The generated images only have robust features because the original classifier was robust, and in fact models trained on this dataset are automatically robust.
If the “non-robust features” explanation were true, then it should be possible for a model to learn on a dataset containing only non-robust features (which will look nonsensical to humans) and still generalize to a normal-looking test set. In the second experiment (henceforth WrongLabels), the authors construct such a dataset. Their hypothesis is that adversarial perturbations work by introducing non-robust features of the target class. So, to construct their dataset, they take an image x with original label y, adversarially perturb it towards some class y’ to get image x’, and then add (x’, y’) to their dataset (even though to a human x’ looks like class y). They have two versions of this: in RandLabels, the target class y’ is chosen randomly, whereas in DetLabels, y’ is chosen to be y + 1. For both datasets, if you train a new model on the dataset, you get good performance on the original test set, showing that the “non-robust features” do generalize.
Rohin’s opinion: I buy this hypothesis. It’s a plausible explanation for brittleness towards adversarial noise (“because non-robust features are useful to reduce loss”), and why adversarial examples transfer across models (“because different models can learn the same non-robust features”). In fact, the paper shows that architectures that did worse in ExpWrongLabels (and so presumably are bad at learning non-robust features) are also the ones to which adversarial examples transfer the least. I’ll leave the rest of my opinion to the opinions on the responses.
Response: Learning from Incorrectly Labeled Data(Eric Wallace): This response notes that all of the experiments are of the form: create a dataset D that is consistent with a model M; then, when you train a new model M’ on D you get the same properties as M. Thus, we can interpret these experiments as showing that model distillation can work even with data points that we would naively think of “incorrectly labeled”. This is a more general phenomenon: we can take an MNIST model, select only the examples for which the top prediction is incorrect (labeled with these incorrect top predictions), and train a new model on that—and get nontrivial performance on the original test set, even though the new model has never seen a “correctly labeled” example.
Rohin’s opinion: I definitely agree that these results can be thought of as a form of model distillation. I don’t think this detracts from the main point of the paper: the reason model distillation works even with incorrectly labeled data is probably because the data is labeled in such a way that it incentivizes the new model to pick out the same features that the old model was paying attention to.
Response: Robust Feature Leakage(Gabriel Goh): This response investigates whether the datasets in WrongLabels could have had robust features. Specifically, it checks whether a linear classifier over provably robust features trained on the WrongLabels dataset can get good accuracy on the original test set. This shouldn’t be possible since WrongLabels is meant to correlate only non-robust features with labels. It finds that you can get some accuracy with RandLabels, but you don’t get much accuracy with DetLabels.
The original authors can actually explain this: intuitively, you get accuracy with RandLabels because it’s less harmful to choose labels randomly than to choose them explicitly incorrectly. With random labels on unmodified inputs, robust features should be completely uncorrelated with accuracy. However, with random labels followed by an adversarial perturbation towards the label, there can be some correlation, because the adversarial perturbation can add “a small amount” of the robust feature. However, in DetLabels, the labels are wrong, and so the robust features are negatively correlated with the true label, and while this can be reduced by an adversarial perturbation, it can’t be reversed (otherwise it wouldn’t be robust).
Rohin’s opinion: The original authors’ explanation of these results is quite compelling; it seems correct to me.
Response: Adversarial Examples are Just Bugs, Too(Preetum Nakkiran): The main point of this response is that adversarial examples can be bugs too. In particular, if you construct adversarial examples that explicitly don’t transfer between models, and then run ExpWrongLabels with such adversarial perturbations, then the resulting model doesn’t perform well on the original test set (and so it must not have learned non-robust features).
It also constructs a data distribution where every useful feature of the optimal classifer is guaranteed to be robust, and shows that we can still get adversarial examples with a typical model, showing that it is not just non-robust features that cause adversarial examples.
In their response, the authors clarify that they didn’t intend to claim that adversarial examples could not arise due to “bugs”, just that “bugs” were not the only explanation. In particular, they say that their main thesis is “adversarial examples will not just go away as we fix bugs in our models”, which is consistent with the point in this response.
Rohin’s opinion: Amusingly, I think I’m more bullish on the original paper’s claims than the authors themselves. It’s certainly true that adversarial examples can arise from “bugs”: if your model overfits to your data, then you should expect adversarial examples along the overfitted decision boundary. The dataset constructed in this response is a particularly clean example: the optimal classifier would have an accuracy of 90%, but the model is trained to accuracy 99.9%, which means it must be overfitting.
However, I claim that with large and varied datasets with neural nets, we are typically not in the regime where models overfit to the data, and the presence of “bugs” in the model will decrease. (You certainly can get a neural net to be “buggy”, e.g. by randomly labeling the data, but if you’re using real data with a natural task then I don’t expect it to happen to a significant degree.) Nonetheless, adversarial examples persist, because the features that models use are not the ones that humans use.
It’s also worth noting that this experiment strongly supports the hypothesis that adversarial examples transfer because they are real features that generalize to the test set.
Response: Adversarial Example Researchers Need to Expand What is Meant by ‘Robustness’(Justin Gilmer et al): This response argues that the results in the original paper are simply a consequence of a generally accepted principle: “models lack robustness to distribution shift because they latch onto superficial correlations in the data”. This isn’t just about L_p norm ball adversarial perturbations: for example, one recent paper shows that if the model is only given access to high frequency features of images (which look uniformly grey to humans), it can still get above 50% accuracy. In fact, when we do adversarial training to become robust to L_p perturbations, then the model pays attention to different non-robust features and becomes more vulnerable to e.g. low-frequency fog corruption. The authors call for adversarial examples researchers to move beyond L_p perturbations and think about the many different ways models can be fragile, and to make them more robust to distributional shift.
Rohin’s opinion: I strongly agree with the worldview behind this response, and especially the principle they identified. I didn’t know this was a generally accepted principle, though of course I am not an expert on distributional robustness.
One thing to note is what is meant by “superficial correlation” here. It means a correlation that really does exist in the dataset, that really does generalize to the test set, but that doesn’t generalize out of distribution. A better term might be “fragile correlation”. All of the experiments so far have been looking at within-distribution generalization (aka generalization to the test set), and are showing that non-robust features do generalize within-distribution. This response is arguing that there are many such non-robust features that will generalize within-distribution but will not generalize under distributional shift, and we need to make our models robust to all of them, not just L_p adversarial perturbations.
Response: Two Examples of Useful, Non-Robust Features(Gabriel Goh): This response studies linear features, since we can analytically compute their usefulness and robustness. It plots the singular vectors of the data as features, and finds that such features are either robust and useful, or non-robust and not useful. However, you can get useful, non-robust features by ensembling or contamination (see response for details).
Response: Adversarially Robust Neural Style Transfer(Reiichiro Nakano): The original paper showed that adversarial examples don’t transfer well to VGG, and that VGG doesn’t tend to learn similar non-robust features as a ResNet. Separately, VGG works particularly well for style transfer. Perhaps since VGG doesn’t capture non-robust features as well, the results of style transfer look better to humans? This response and the author’s response investigate this hypothesis in more detail and find that it seems broadly supported, but there are still finnicky details to be worked out.
Rohin’s opinion: This is an intriguing empirical fact. However, I don’t really buy the theoretical argument that style transfer works because it doesn’t use non-robust features, since I would typically expect that a model that doesn’t use L_p-fragile features would instead use features that are fragile or non-robust in some other way.
Delegating open-ended cognitive work(Andreas Stuhlmüller): This is the latest explanation of the approach Ought is experimenting with: Factored Evaluation (in contrast to Factored Cognition (AN #36)). With Factored Cognition, the idea was to recursively decompose a high-level task until you reach subtasks that can be directly solved. Factored Evaluation still does recursive decomposition, but now it is aimed at evaluating the work of experts, along the same lines as recursive reward modeling (AN #34).
This shift means that Ought is attacking a very natural problem: how to effectively delegate work to experts while avoiding principal-agent problems. In particular, we want to design incentives such that untrusted experts under the incentives will be as helpful as experts intrinsically motivated to help. The experts could be human experts or advanced ML systems; ideally our incentive design would work for both.
Currently, Ought is running experiments with reading comprehension on Wikipedia articles. The experts get access to the article while the judge does not, but the judge can check whether particular quotes come from the article. They would like to move to tasks that have a greater gap between the experts and the judge (e.g. allowing the experts to use Google), and to tasks that are more subjective (e.g. whether the judge should get Lasik surgery).
Rohin’s opinion: The switch from Factored Cognition to Factored Evaluation is interesting. While it does make it more relevant outside the context of AI alignment (since principal-agent problems abound outside of AI), it still seems like the major impact of Ought is on AI alignment, and I’m not sure what the difference is there. In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal. The switch would be useful if you expect the reinforcement learning to work significantly better than imitation learning.
However, with Factored Evaluation, the agent that you train iteratively is one that must be good at evaluating tasks, and then you’d need another agent that actually performs the task (or you could train the same agent to do both). In contrast, with Factored Cognition you only need an agent that is performing the task. If the decompositions needed to perform the task are different from the decompositions needed to evaluate the task, then Factored Cognition would presumably have an advantage.
Miscellaneous (Alignment)
Clarifying some key hypotheses in AI alignment(Ben Cottier et al): This post (that I contributed to) introduces a diagram that maps out important and controversial hypotheses for AI alignment. The goal is to help researchers identify and more productively discuss their disagreements.
Towards Empathic Deep Q-Learning(Bart Bussmann et al): This paper introduces the empathic DQN, which is inspired by the golden rule: “Do unto others as you would have them do unto you”. Given a specified reward, the empathic DQN optimizes for a weighted combination of the specified reward, and the reward that other agents in the environment would get if they were a copy of the agent. They show that this results in resource sharing (when there are diminishing returns to resources) and avoiding conflict in two toy gridworlds.
Rohin’s opinion: This seems similar in spirit to impact regularization methods: the hope is that this is a simple rule that prevents catastrophic outcomes without having to solve all of human values.
Ought: why it matters and ways to help(Paul Christiano): This post discusses the work that Ought is doing, and makes a case that it is important for AI alignment (see the summary for Delegating open-ended cognitive work above). Readers can help Ought by applying for their web developer role, by participating in their experiments, and by donating.
[AN #62] Are adversarial examples caused by real but imperceptible features?
Link post
Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I’m always happy to hear feedback; you can send it to me by commenting on this post.
Audio version here (may not be up yet).
Highlights
Call for contributors to the Alignment Newsletter (Rohin Shah): I’m looking for content creators and a publisher for this newsletter! Apply by September 6.
Adversarial Examples Are Not Bugs, They Are Features (Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom et al) (summarized by Rohin and Cody): Distill published a discussion of this paper. This highlights section will cover the full discussion; all of these summaries and opinions are meant to be read together.
Consider two possible explanations of adversarial examples. First, they could be caused because the model “hallucinates” a signal that is not useful for classification, and it becomes very sensitive to this feature. We could call these “bugs”, since they don’t generalize well. Second, they could be caused by features that do generalize to the test set, but can be modified by an adversarial perturbation. We could call these “non-robust features” (as opposed to “robust features”, which can’t be changed by an adversarial perturbation). The authors argue that at least some adversarial perturbations fall into the second category of being informative but sensitive features, based on two experiments.
If the “hallucination” explanation were true, the hallucinations would presumably be caused by the training process, the choice of architecture, the size of the dataset, but not by the type of data. So one thing to do would be to see if we can construct a dataset such that a model trained on that dataset is already robust, without adversarial training. The authors do this in the first experiment. They take an adversarially trained robust classifier, and create images whose features (final-layer activations of the robust classifier) match the features of some unmodified input. The generated images only have robust features because the original classifier was robust, and in fact models trained on this dataset are automatically robust.
If the “non-robust features” explanation were true, then it should be possible for a model to learn on a dataset containing only non-robust features (which will look nonsensical to humans) and still generalize to a normal-looking test set. In the second experiment (henceforth WrongLabels), the authors construct such a dataset. Their hypothesis is that adversarial perturbations work by introducing non-robust features of the target class. So, to construct their dataset, they take an image x with original label y, adversarially perturb it towards some class y’ to get image x’, and then add (x’, y’) to their dataset (even though to a human x’ looks like class y). They have two versions of this: in RandLabels, the target class y’ is chosen randomly, whereas in DetLabels, y’ is chosen to be y + 1. For both datasets, if you train a new model on the dataset, you get good performance on the original test set, showing that the “non-robust features” do generalize.
Rohin’s opinion: I buy this hypothesis. It’s a plausible explanation for brittleness towards adversarial noise (“because non-robust features are useful to reduce loss”), and why adversarial examples transfer across models (“because different models can learn the same non-robust features”). In fact, the paper shows that architectures that did worse in ExpWrongLabels (and so presumably are bad at learning non-robust features) are also the ones to which adversarial examples transfer the least. I’ll leave the rest of my opinion to the opinions on the responses.
Read more: Paper and Author response
Response: Learning from Incorrectly Labeled Data (Eric Wallace): This response notes that all of the experiments are of the form: create a dataset D that is consistent with a model M; then, when you train a new model M’ on D you get the same properties as M. Thus, we can interpret these experiments as showing that model distillation can work even with data points that we would naively think of “incorrectly labeled”. This is a more general phenomenon: we can take an MNIST model, select only the examples for which the top prediction is incorrect (labeled with these incorrect top predictions), and train a new model on that—and get nontrivial performance on the original test set, even though the new model has never seen a “correctly labeled” example.
Rohin’s opinion: I definitely agree that these results can be thought of as a form of model distillation. I don’t think this detracts from the main point of the paper: the reason model distillation works even with incorrectly labeled data is probably because the data is labeled in such a way that it incentivizes the new model to pick out the same features that the old model was paying attention to.
Response: Robust Feature Leakage (Gabriel Goh): This response investigates whether the datasets in WrongLabels could have had robust features. Specifically, it checks whether a linear classifier over provably robust features trained on the WrongLabels dataset can get good accuracy on the original test set. This shouldn’t be possible since WrongLabels is meant to correlate only non-robust features with labels. It finds that you can get some accuracy with RandLabels, but you don’t get much accuracy with DetLabels.
The original authors can actually explain this: intuitively, you get accuracy with RandLabels because it’s less harmful to choose labels randomly than to choose them explicitly incorrectly. With random labels on unmodified inputs, robust features should be completely uncorrelated with accuracy. However, with random labels followed by an adversarial perturbation towards the label, there can be some correlation, because the adversarial perturbation can add “a small amount” of the robust feature. However, in DetLabels, the labels are wrong, and so the robust features are negatively correlated with the true label, and while this can be reduced by an adversarial perturbation, it can’t be reversed (otherwise it wouldn’t be robust).
Rohin’s opinion: The original authors’ explanation of these results is quite compelling; it seems correct to me.
Response: Adversarial Examples are Just Bugs, Too (Preetum Nakkiran): The main point of this response is that adversarial examples can be bugs too. In particular, if you construct adversarial examples that explicitly don’t transfer between models, and then run ExpWrongLabels with such adversarial perturbations, then the resulting model doesn’t perform well on the original test set (and so it must not have learned non-robust features).
It also constructs a data distribution where every useful feature of the optimal classifer is guaranteed to be robust, and shows that we can still get adversarial examples with a typical model, showing that it is not just non-robust features that cause adversarial examples.
In their response, the authors clarify that they didn’t intend to claim that adversarial examples could not arise due to “bugs”, just that “bugs” were not the only explanation. In particular, they say that their main thesis is “adversarial examples will not just go away as we fix bugs in our models”, which is consistent with the point in this response.
Rohin’s opinion: Amusingly, I think I’m more bullish on the original paper’s claims than the authors themselves. It’s certainly true that adversarial examples can arise from “bugs”: if your model overfits to your data, then you should expect adversarial examples along the overfitted decision boundary. The dataset constructed in this response is a particularly clean example: the optimal classifier would have an accuracy of 90%, but the model is trained to accuracy 99.9%, which means it must be overfitting.
However, I claim that with large and varied datasets with neural nets, we are typically not in the regime where models overfit to the data, and the presence of “bugs” in the model will decrease. (You certainly can get a neural net to be “buggy”, e.g. by randomly labeling the data, but if you’re using real data with a natural task then I don’t expect it to happen to a significant degree.) Nonetheless, adversarial examples persist, because the features that models use are not the ones that humans use.
It’s also worth noting that this experiment strongly supports the hypothesis that adversarial examples transfer because they are real features that generalize to the test set.
Response: Adversarial Example Researchers Need to Expand What is Meant by ‘Robustness’ (Justin Gilmer et al): This response argues that the results in the original paper are simply a consequence of a generally accepted principle: “models lack robustness to distribution shift because they latch onto superficial correlations in the data”. This isn’t just about L_p norm ball adversarial perturbations: for example, one recent paper shows that if the model is only given access to high frequency features of images (which look uniformly grey to humans), it can still get above 50% accuracy. In fact, when we do adversarial training to become robust to L_p perturbations, then the model pays attention to different non-robust features and becomes more vulnerable to e.g. low-frequency fog corruption. The authors call for adversarial examples researchers to move beyond L_p perturbations and think about the many different ways models can be fragile, and to make them more robust to distributional shift.
Rohin’s opinion: I strongly agree with the worldview behind this response, and especially the principle they identified. I didn’t know this was a generally accepted principle, though of course I am not an expert on distributional robustness.
One thing to note is what is meant by “superficial correlation” here. It means a correlation that really does exist in the dataset, that really does generalize to the test set, but that doesn’t generalize out of distribution. A better term might be “fragile correlation”. All of the experiments so far have been looking at within-distribution generalization (aka generalization to the test set), and are showing that non-robust features do generalize within-distribution. This response is arguing that there are many such non-robust features that will generalize within-distribution but will not generalize under distributional shift, and we need to make our models robust to all of them, not just L_p adversarial perturbations.
Response: Two Examples of Useful, Non-Robust Features (Gabriel Goh): This response studies linear features, since we can analytically compute their usefulness and robustness. It plots the singular vectors of the data as features, and finds that such features are either robust and useful, or non-robust and not useful. However, you can get useful, non-robust features by ensembling or contamination (see response for details).
Response: Adversarially Robust Neural Style Transfer (Reiichiro Nakano): The original paper showed that adversarial examples don’t transfer well to VGG, and that VGG doesn’t tend to learn similar non-robust features as a ResNet. Separately, VGG works particularly well for style transfer. Perhaps since VGG doesn’t capture non-robust features as well, the results of style transfer look better to humans? This response and the author’s response investigate this hypothesis in more detail and find that it seems broadly supported, but there are still finnicky details to be worked out.
Rohin’s opinion: This is an intriguing empirical fact. However, I don’t really buy the theoretical argument that style transfer works because it doesn’t use non-robust features, since I would typically expect that a model that doesn’t use L_p-fragile features would instead use features that are fragile or non-robust in some other way.
Technical AI alignment
Problems
Problems in AI Alignment that philosophers could potentially contribute to (Wei Dai): Exactly what it says. The post is short enough that I’m not going to summarize it—it would be as long as the original.
Iterated amplification
Delegating open-ended cognitive work (Andreas Stuhlmüller): This is the latest explanation of the approach Ought is experimenting with: Factored Evaluation (in contrast to Factored Cognition (AN #36)). With Factored Cognition, the idea was to recursively decompose a high-level task until you reach subtasks that can be directly solved. Factored Evaluation still does recursive decomposition, but now it is aimed at evaluating the work of experts, along the same lines as recursive reward modeling (AN #34).
This shift means that Ought is attacking a very natural problem: how to effectively delegate work to experts while avoiding principal-agent problems. In particular, we want to design incentives such that untrusted experts under the incentives will be as helpful as experts intrinsically motivated to help. The experts could be human experts or advanced ML systems; ideally our incentive design would work for both.
Currently, Ought is running experiments with reading comprehension on Wikipedia articles. The experts get access to the article while the judge does not, but the judge can check whether particular quotes come from the article. They would like to move to tasks that have a greater gap between the experts and the judge (e.g. allowing the experts to use Google), and to tasks that are more subjective (e.g. whether the judge should get Lasik surgery).
Rohin’s opinion: The switch from Factored Cognition to Factored Evaluation is interesting. While it does make it more relevant outside the context of AI alignment (since principal-agent problems abound outside of AI), it still seems like the major impact of Ought is on AI alignment, and I’m not sure what the difference is there. In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal. The switch would be useful if you expect the reinforcement learning to work significantly better than imitation learning.
However, with Factored Evaluation, the agent that you train iteratively is one that must be good at evaluating tasks, and then you’d need another agent that actually performs the task (or you could train the same agent to do both). In contrast, with Factored Cognition you only need an agent that is performing the task. If the decompositions needed to perform the task are different from the decompositions needed to evaluate the task, then Factored Cognition would presumably have an advantage.
Miscellaneous (Alignment)
Clarifying some key hypotheses in AI alignment (Ben Cottier et al): This post (that I contributed to) introduces a diagram that maps out important and controversial hypotheses for AI alignment. The goal is to help researchers identify and more productively discuss their disagreements.
Near-term concerns
Privacy and security
Evaluating and Testing Unintended Memorization in Neural Networks (Nicholas Carlini et al)
Read more: The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
Machine ethics
Towards Empathic Deep Q-Learning (Bart Bussmann et al): This paper introduces the empathic DQN, which is inspired by the golden rule: “Do unto others as you would have them do unto you”. Given a specified reward, the empathic DQN optimizes for a weighted combination of the specified reward, and the reward that other agents in the environment would get if they were a copy of the agent. They show that this results in resource sharing (when there are diminishing returns to resources) and avoiding conflict in two toy gridworlds.
Rohin’s opinion: This seems similar in spirit to impact regularization methods: the hope is that this is a simple rule that prevents catastrophic outcomes without having to solve all of human values.
AI strategy and policy
AI Algorithms Need FDA-Style Drug Trials (Olaf J. Groth et al)
Other progress in AI
Critiques (AI)
Evidence against current methods leading to human level artificial intelligence (Asya Bergal and Robert Long): This post briefly lists arguments that current AI techniques will not lead to high-level machine intelligence (HLMI), without taking a stance on how strong these arguments are.
News
Ought: why it matters and ways to help (Paul Christiano): This post discusses the work that Ought is doing, and makes a case that it is important for AI alignment (see the summary for Delegating open-ended cognitive work above). Readers can help Ought by applying for their web developer role, by participating in their experiments, and by donating.
Project Proposal: Considerations for trading off capabilities and safety impacts of AI research (David Krueger): This post calls for a thorough and systematic evaluation of whether AI safety researchers should worry about the impact of their work on capabilities.