Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
Note: The newsletter will be slowing down a bit over the next month, as I’ll be fairly busy. I’m currently aiming to produce a newsletter every two weeks, but I don’t know if even that will happen.
HIGHLIGHTS
Debuggable Deep Networks: Usage and Evaluation(Eric Wong, Shibani Santurkar et al) (summarized by Rohin): One simple approach to make neural nets more understandable is to make just the final layer sparse. Neurons in the penultimate layer can be visualized using existing techniques, and the sparsity of the final layer means that it is relatively easy to understand how they are combined together to make predictions. For example, in ImageNet, the final logit for an individual class becomes a weighted combination of around 20 features, instead of 2048 as you would get with a dense model. The authors’ core claim is that this makes the model more understandable and debuggable, at the cost of a small drop in performance (about 1-5 percentage points). They show this using several experiments, many with real humans:
1. The most basic test is simulation: can humans predict what the model would say (regardless of whether or not it is correct)? Unfortunately, if you show people a picture of an airplane, they are probably going to predict that the model says “airplane”, on priors. To avoid this sort of prior knowledge, they first sample a class like “airplane” that they don’t reveal. Instead, they reveal feature visualizations of five randomly chosen features that the model uses to identify images of that class. They then choose three images and ask humans which of the three images will have the highest probability of being assigned to that class, according to the model. They find that when using a sparse final layer, humans have non-trivial performance (72% when the best image really is from the sampled class, and 57% when the best image is from some different class), whereas with a dense final layer they are only slightly better than random chance (44% and 31%, where random chance would be 33%).
2. They can study biases and spurious correlations in models. For example, Toxic-BERT identifies toxic sentences, but does so by searching for identity groups like “christianity”. Debiased-BERT was meant to solve this, but by looking at the feature visualizations (word clouds) below a sparse decision layer, they find that it simply learns a strong negative weight for identity groups. Thus, they are able to fool the model into thinking a toxic comment is non-toxic simply by adding an identity group like “christianity” somewhere in the sentence. (This also applies to the version that uses a dense final layer.)
3. The identified biases or spurious correlations can then be used to generate counterfactuals: for example, in a sentiment analysis system, they can visualize word clouds that represent positive and negative influences on the final sentiment reported by the model. Then, by simply exchanging a positive word for a negative word (or vice versa), they can flip the label that the model assigns to the sentence. (Usually this is correct behavior – if you change “a marvel like you’ve never seen” to “a failure like you’ve never seen”, the sentiment really is different. The point is that the sparse model allows you to create these examples automatically.)
4. In cases where the model makes a mistake, can humans identify why the model made a mistake? The authors note that over 30% of misclassifications can be explained by a single problematic feature, i.e. if you intervene to set that feature to zero, then the model no longer makes a mistake. So one way to check human understanding is to see whether they can reproduce this misclassification. Specifically, we take some image whose true label is y* but which the model incorrectly labels as y’. We then take the highest-activating feature in support of y* and the corresponding feature for y’, and ask humans which of the two features is more present in the image. They find that annotators prefer the feature for y’ 60% of the time – more than random chance (50%). Since the annotators don’t know which feature corresponds to the ground truth and which corresponds to the incorrect model prediction, they probably were not using prior knowledge in answering this question. Thus, doing better than random suggests that even according to humans the feature that the model picked up on really was present in the image.
Rohin’s opinion: I liked this paper especially for its experimental design; it seems like it does a good job of keeping human priors from influencing the results. The results themselves are very much a first step, showing that you’ve gotten at least some understanding and interpretability, but ideally we’d do much much better on these axes. For example, if we “understand” the model, one would hope that we’d be able to get scores of 95+% on the simulation experiment (bullet point 1 above), rather than the current 72% / 57%. It might be interesting to have benchmarks that use these sorts of experiments as their evaluation method. Given that this method just uses feature visualization on the penultimate layer, it seems like there should be room for improvement by studying other layers as well.
Editorial note: I summarized this work because I saw and liked the blog post about it. I don’t generally follow the interpretability literature (it’s huge), and so it’s plausible that there are lots of more useful papers that I happen to not have seen. Most of the time, the highlighted papers can at least be understood as “this is what Rohin thinks is most useful for alignment researchers to read within this field”; that’s not the case here.
TECHNICAL AI ALIGNMENT
MESA OPTIMIZATION
Formal Inner Alignment, Prospectus(Abram Demski) (summarized by Rohin): This post outlines a document that the author plans to write in the future, in which he will define the inner alignment problem formally, and suggest directions for future research. I will summarize that document when it comes out, but if you would like to influence that document, check out the post.
AGENT FOUNDATIONS
Agency in Conway’s Game of Life(Alex Flint) (summarized by Rohin): Conway’s Game of Life (GoL) is a simple cellular automaton which is Turing-complete. As a result, it should be possible to build an “artificial intelligence” system in GoL. One way that we could phrase this is: Imagine a GoL board with 10^30 rows and 10^30 columns, where we are able to set the initial state of the top left 10^20 by 10^20 square. Can we set that initial state appropriately such that after a suitable amount of time, the full board evolves to a desired state (perhaps a giant smiley face) for the vast majority of possible initializations of the remaining area?
This requires us to find some setting of the initial 10^20 by 10^20 square that has expandable, steerable influence. Intuitively, the best way to do this would be to build “sensors” and “effectors” to have inputs and outputs and then have some program decide what the effectors should do based on the input from the sensors. The “goal” of the program would then be to steer the world towards the desired state. Thus, this is a framing of the problem of AI (both capabilities and alignment) in GoL, rather than in our native physics.
Rohin’s opinion: With the tower of abstractions we humans have built, we now naturally think in terms of inputs and outputs for the agents we build. This hypothetical seems good for shaking us out of that mindset, as we don’t really know what the analogous inputs and outputs in GoL would be, and so we are forced to consider those aspects of the design process as well.
PREVENTING BAD BEHAVIOR
AXRP Episode 7 - Side Effects(Daniel Filan and Victoria Krakovna) (summarized by Rohin): This podcast goes over the problem of side effects, and impact regularization as an approach to handle this problem. The core hope is that impact regularization would enable “minimalistic” value alignment, in which the AI system may not be doing exactly what we want, but at the very least it will not take high impact actions that could cause an existential catastrophe.
An impact regularization method typically consists of a deviation measure and a baseline. The baseline is what we compare the agent to in order to determine whether it had an “impact”. The deviation measure is used to quantify how much impact there has been, when comparing the state generated by the agent to the one generated by the baseline.
Deviation measures are relatively uncontroversial – there are several possible measures, but they all seem to do relatively similar things, and there aren’t any obviously bad outcomes traceable to problems with the deviation measure. However, that is not the case with baselines. One typical baseline is the inaction baseline, where you compare against what would have happened if the agent had done nothing. Unfortunately, this leads to offsetting: as a simple example, if some food was going to be thrown away and the agent rescues it, it then has an incentive to throw it away again, since that would minimize impact relative to the case where it had done nothing. A solution is the stepwise inaction baseline, which compares to the case where the agent does nothing starting from the previous state (instead of from the beginning of time). However, this then prevents some beneficial offsetting: for example, if the agent opens the door to leave the house, then the agent is incentivized to leave the door open.
As a result, the author is interested in seeing more work on baselines for impact regularization. In addition, she wants to see impact regularization tested in more realistic scenarios. That being said, she thinks that the useful aspect of impact regularization research so far is in bringing conceptual clarity to what we are trying to do with AI safety, and in identifying the interference and offsetting behaviors, and the incentives for them.
Open Call for Advisees and Collaborators, May 2021(GCRI Website) (summarized by Rohin): GCRI is open to inquiries from potential collaborators or advisees, regardless of background, career point, or geographic location, about any aspect of global catastrophic risk. Participation can consist of a short email exchange to more extensive project work.
FEEDBACK
I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
PODCAST
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.
[AN #151]: How sparsity in the final layer makes a neural net debuggable
Link post
Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
Note: The newsletter will be slowing down a bit over the next month, as I’ll be fairly busy. I’m currently aiming to produce a newsletter every two weeks, but I don’t know if even that will happen.
HIGHLIGHTS
Debuggable Deep Networks: Usage and Evaluation (Eric Wong, Shibani Santurkar et al) (summarized by Rohin): One simple approach to make neural nets more understandable is to make just the final layer sparse. Neurons in the penultimate layer can be visualized using existing techniques, and the sparsity of the final layer means that it is relatively easy to understand how they are combined together to make predictions. For example, in ImageNet, the final logit for an individual class becomes a weighted combination of around 20 features, instead of 2048 as you would get with a dense model. The authors’ core claim is that this makes the model more understandable and debuggable, at the cost of a small drop in performance (about 1-5 percentage points). They show this using several experiments, many with real humans:
1. The most basic test is simulation: can humans predict what the model would say (regardless of whether or not it is correct)? Unfortunately, if you show people a picture of an airplane, they are probably going to predict that the model says “airplane”, on priors. To avoid this sort of prior knowledge, they first sample a class like “airplane” that they don’t reveal. Instead, they reveal feature visualizations of five randomly chosen features that the model uses to identify images of that class. They then choose three images and ask humans which of the three images will have the highest probability of being assigned to that class, according to the model. They find that when using a sparse final layer, humans have non-trivial performance (72% when the best image really is from the sampled class, and 57% when the best image is from some different class), whereas with a dense final layer they are only slightly better than random chance (44% and 31%, where random chance would be 33%).
2. They can study biases and spurious correlations in models. For example, Toxic-BERT identifies toxic sentences, but does so by searching for identity groups like “christianity”. Debiased-BERT was meant to solve this, but by looking at the feature visualizations (word clouds) below a sparse decision layer, they find that it simply learns a strong negative weight for identity groups. Thus, they are able to fool the model into thinking a toxic comment is non-toxic simply by adding an identity group like “christianity” somewhere in the sentence. (This also applies to the version that uses a dense final layer.)
3. The identified biases or spurious correlations can then be used to generate counterfactuals: for example, in a sentiment analysis system, they can visualize word clouds that represent positive and negative influences on the final sentiment reported by the model. Then, by simply exchanging a positive word for a negative word (or vice versa), they can flip the label that the model assigns to the sentence. (Usually this is correct behavior – if you change “a marvel like you’ve never seen” to “a failure like you’ve never seen”, the sentiment really is different. The point is that the sparse model allows you to create these examples automatically.)
4. In cases where the model makes a mistake, can humans identify why the model made a mistake? The authors note that over 30% of misclassifications can be explained by a single problematic feature, i.e. if you intervene to set that feature to zero, then the model no longer makes a mistake. So one way to check human understanding is to see whether they can reproduce this misclassification. Specifically, we take some image whose true label is y* but which the model incorrectly labels as y’. We then take the highest-activating feature in support of y* and the corresponding feature for y’, and ask humans which of the two features is more present in the image. They find that annotators prefer the feature for y’ 60% of the time – more than random chance (50%). Since the annotators don’t know which feature corresponds to the ground truth and which corresponds to the incorrect model prediction, they probably were not using prior knowledge in answering this question. Thus, doing better than random suggests that even according to humans the feature that the model picked up on really was present in the image.
Rohin’s opinion: I liked this paper especially for its experimental design; it seems like it does a good job of keeping human priors from influencing the results. The results themselves are very much a first step, showing that you’ve gotten at least some understanding and interpretability, but ideally we’d do much much better on these axes. For example, if we “understand” the model, one would hope that we’d be able to get scores of 95+% on the simulation experiment (bullet point 1 above), rather than the current 72% / 57%. It might be interesting to have benchmarks that use these sorts of experiments as their evaluation method. Given that this method just uses feature visualization on the penultimate layer, it seems like there should be room for improvement by studying other layers as well.
Editorial note: I summarized this work because I saw and liked the blog post about it. I don’t generally follow the interpretability literature (it’s huge), and so it’s plausible that there are lots of more useful papers that I happen to not have seen. Most of the time, the highlighted papers can at least be understood as “this is what Rohin thinks is most useful for alignment researchers to read within this field”; that’s not the case here.
TECHNICAL AI ALIGNMENT
MESA OPTIMIZATION
Formal Inner Alignment, Prospectus (Abram Demski) (summarized by Rohin): This post outlines a document that the author plans to write in the future, in which he will define the inner alignment problem formally, and suggest directions for future research. I will summarize that document when it comes out, but if you would like to influence that document, check out the post.
AGENT FOUNDATIONS
Agency in Conway’s Game of Life (Alex Flint) (summarized by Rohin): Conway’s Game of Life (GoL) is a simple cellular automaton which is Turing-complete. As a result, it should be possible to build an “artificial intelligence” system in GoL. One way that we could phrase this is: Imagine a GoL board with 10^30 rows and 10^30 columns, where we are able to set the initial state of the top left 10^20 by 10^20 square. Can we set that initial state appropriately such that after a suitable amount of time, the full board evolves to a desired state (perhaps a giant smiley face) for the vast majority of possible initializations of the remaining area?
This requires us to find some setting of the initial 10^20 by 10^20 square that has expandable, steerable influence. Intuitively, the best way to do this would be to build “sensors” and “effectors” to have inputs and outputs and then have some program decide what the effectors should do based on the input from the sensors. The “goal” of the program would then be to steer the world towards the desired state. Thus, this is a framing of the problem of AI (both capabilities and alignment) in GoL, rather than in our native physics.
Rohin’s opinion: With the tower of abstractions we humans have built, we now naturally think in terms of inputs and outputs for the agents we build. This hypothetical seems good for shaking us out of that mindset, as we don’t really know what the analogous inputs and outputs in GoL would be, and so we are forced to consider those aspects of the design process as well.
PREVENTING BAD BEHAVIOR
AXRP Episode 7 - Side Effects (Daniel Filan and Victoria Krakovna) (summarized by Rohin): This podcast goes over the problem of side effects, and impact regularization as an approach to handle this problem. The core hope is that impact regularization would enable “minimalistic” value alignment, in which the AI system may not be doing exactly what we want, but at the very least it will not take high impact actions that could cause an existential catastrophe.
An impact regularization method typically consists of a deviation measure and a baseline. The baseline is what we compare the agent to in order to determine whether it had an “impact”. The deviation measure is used to quantify how much impact there has been, when comparing the state generated by the agent to the one generated by the baseline.
Deviation measures are relatively uncontroversial – there are several possible measures, but they all seem to do relatively similar things, and there aren’t any obviously bad outcomes traceable to problems with the deviation measure. However, that is not the case with baselines. One typical baseline is the inaction baseline, where you compare against what would have happened if the agent had done nothing. Unfortunately, this leads to offsetting: as a simple example, if some food was going to be thrown away and the agent rescues it, it then has an incentive to throw it away again, since that would minimize impact relative to the case where it had done nothing. A solution is the stepwise inaction baseline, which compares to the case where the agent does nothing starting from the previous state (instead of from the beginning of time). However, this then prevents some beneficial offsetting: for example, if the agent opens the door to leave the house, then the agent is incentivized to leave the door open.
As a result, the author is interested in seeing more work on baselines for impact regularization. In addition, she wants to see impact regularization tested in more realistic scenarios. That being said, she thinks that the useful aspect of impact regularization research so far is in bringing conceptual clarity to what we are trying to do with AI safety, and in identifying the interference and offsetting behaviors, and the incentives for them.
OTHER PROGRESS IN AI
DEEP LEARNING
Understanding the Lottery Ticket Hypothesis (Alignment Forum) (summarized by Rohin): This post summarizes work on the lottery ticket hypothesis (AN #52), including its implications for AI alignment.
NEWS
Open Call for Advisees and Collaborators, May 2021 (GCRI Website) (summarized by Rohin): GCRI is open to inquiries from potential collaborators or advisees, regardless of background, career point, or geographic location, about any aspect of global catastrophic risk. Participation can consist of a short email exchange to more extensive project work.
FEEDBACK
I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
PODCAST
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.