Alignment Newsletter #30

Link post

Highlights

Learning Complex Goals with Iterated Amplification (Paul Christiano et al): This blog post and the accompanying paper introduces iterated amplification, focusing on how it can be used to define a training signal for tasks that humans cannot perform or evaluate, such as designing a transit system. The key insight is that humans are capable of decomposing even very difficult tasks into slightly simpler tasks. So, in theory, we could provide ground truth labels for an arbitrarily difficult task by a huge tree of humans, each decomposing their own subquestion and handing off new subquestions to other humans, until questions are easy enough that a human can directly answer them.

We can turn this into an efficient algorithm by having the human decompose the question only once, and using the current AI system to answer the generated subquestions. If the AI isn’t able to answer the subquestions, then the human will get nonsense answers. However, as long as there are questions that the human + AI system can answer but the AI alone cannot answer, the AI can learn from the answers to those questions. To reduce the reliance on human data, another model is trained to predict the decomposition that the human performs. In addition, some tasks could refer to a large context (eg. evaluating safety for a specific rocket design), so they model the human as being able to access small pieces of the context at a time.

They evaluate on simple algorithmic tasks like distance between nodes in a graph, where they can program an automated human decomposition for faster experiments, and there is a ground truth solution. They compare against supervised learning, which trains a model on the ground truth answers to questions (which iterated amplification does not have access to), and find that they can match the performance of supervised learning with only slightly more training steps.

Rohin’s opinion: This is my new favorite post/​paper for explaining how iterated amplification works, since it very succinctly and clearly makes the case for iterated amplification as a strategy for generating a good training signal. I’d recommend reading the paper in full, as it makes other important points that I haven’t included in the summary.

Note that it does not explain a lot of Paul’s thinking. It explains one particular training method that allows you to train an AI system with a more intelligent and informed overseer.

Relational inductive biases, deep learning, and graph networks (Peter W. Battaglia et al) (summarized by Richard): “Part position paper, part review, and part unification”, this paper emphasises the importance of combinatorial generalisation, which is key to how humans understand the world. It argues for approaches which perform computation over discrete entities and the relations between them, such as graph networks. The authors claim that CNNs and RNNs are so successful due to relational inductive biases—for example, the bias towards local structure induced by convolutional layers. Graph networks are promising because they can express arbitrary relational biases: any nodes can be connected with any others depending on the structure of the problem. Further, since graph networks learn functions which are reused for all nodes and edges, each one can be applied to graphs of any shape and size: a form of combinatorial generalisation.

In this paper’s framework, each ‘graph block’ does computations over an input graph and returns an output graph. The relevant part of the output might be the values of edges, or those of nodes, or ‘global’ properties of the overall graph. Graph blocks can be implemented by standard neural network architectures or more unusual ones such as message-passing neural networks or non-local neural networks. The authors note some major open questions: how to generate the graphs in the first place, and how to adaptively modify them during the course of computation.

Richard’s opinion: This paper is an excellent holistic discussion of graph networks and reasons to think they are promising. I’m glad that it also mentioned the open problems, though, since I think they’re pretty crucial to using graphs in deep learning, and current approaches in this area (e.g. capsule networks’ dynamic control flow) aren’t satisfactory.

Technical AI alignment

Iterated amplification

Learning Complex Goals with Iterated Amplification (Paul Christiano et al): Summarized in the highlights!

Agent foundations

When EDT=CDT, ADT Does Well (Diffractor)

Learning human intent

One-Shot Observation Learning (Leo Pauly et al)

Preventing bad behavior

Safe Reinforcement Learning with Model Uncertainty Estimates (Björn Lütjens et al)

Addressing three problems with counterfactual corrigibility: bad bets, defending against backstops, and overconfidence. (Ryan Carey)

Robustness

Learning from Untrusted Data (Charikar, Steinhardt, and Valiant) (summarized by Dan H): This paper introduces semi-verified learning. Here a model learns from a verified or trusted dataset, and from an untrusted dataset which consists in a mixture of legitimate and arbitrary examples. For the untrusted dataset, it is not known which points are legitimate and which are not. This scenario can occur when data is scraped from the internet, recorded by unreliable devices, or gathered through crowdsourcing. Concretely if a (possibly small) fraction of the scraped data is hand-labeled, then this could count as the trusted set, and the remaining data could be considered the untrusted set. This differs from semi-supervised learning where there are labeled and unlabeled task-relevant examples. Here there are trusted examples and examples which are untrusted (e.g., labels may be wrong, features may be out-of-distribution, examples may be malicious, and so on). See the full paper for theorems and an algorithm applicable to tasks such as robust density estimation.

Dan H’s opinion: The semi-verified model seems highly useful for various safety-related scenarios including learning with label corruption, poisoned input data, and minimal supervision.

Uncertainty

Do Deep Generative Models Know What They Don’t Know? (Eric Nalisnick et al)

Read more: Section 4.3 of this paper makes similar observations and ameliorates the issue. This paper also demonstrates the fragility of density estimators on out-of-distribution data.

Forecasting

Thoughts on short timelines (Tobias Baumann): This post argues that the probability of AGI in the next ten years is very low, perhaps 1-2%. The primary argument is that to get AGI that quickly, we would need to be seeing research breakthroughs frequently, and empirically this is not the case. This might not be true if we expect that progress will accelerate in the future, but there’s no reason to expect this—we won’t get recursive self-improvement before AGI and there won’t be a huge increase in resources devoted to AI (since there is already so much excitement). We might also say that we are so clueless that we should assign at least 10% to AGI in ten years, but it doesn’t seem we are that ignorant, and in any case it’s not obvious that a prior should assign 10% to this outcome. Expert surveys estimate non-negligible probability on AGI in ten years, but in practice it seems the predominant opinion is to confidently dismiss a short timelines scenario.

Rohin’s opinion: I do think that the probability of AGI in ten years is larger than 1-2%. I suspect my main disagreement is with the conception of what counts as groundbreaking progress. Tobias gives the example of transfer from one board game to many other board games; I think that AGI wouldn’t be able to solve this problem from scratch, and humans are only capable of this because of good priors from all the other learning we’ve done throughout life, especially since games are designed to be human-understandable. If you make a sufficiently large neural net and give it a complex enough environment, some simple unsupervised learning rewards, and the opportunity to collect as much data as a human gets throughout life, maybe that does result in AGI. (I’d guess not, because it does seem like we have some good priors from birth, but I’m not very confident in that.)

Other progress in AI

Exploration

Curiosity and Procrastination in Reinforcement Learning (Nikolay Savinov and Timothy Lillicrap): This blog post explains Episodic Curiosity through Reachability, discussed in AN #28. As a reminder, this method trains a neural net to predict whether two observations were close in time to each other. Recent observations are stored in memory, and the agent is rewarded for reaching states that are predicted to be far away from any observations in memory.

Rohin’s opinion: This is easier to read than the paper and more informative than our summaries, so I’d recommend it if you were interested in the paper.

Successor Uncertainties: exploration and uncertainty in temporal difference learning (David Janz et al)

Deep learning

Relational inductive biases, deep learning, and graph networks (Peter W. Battaglia et al): Summarized in the highlights!

Relational recurrent neural networks (Adam Santoro, Ryan Faulkner, David Raposo et al) (summarized by Richard): This paper introduces the Relational Memory Core, which allows interactions between memories stored in memory-based neural networks. It does so using a “self-attention mechanism”: each memory updates its contents by attending to all other memories via several “attention heads” which focus on different features. This leads to particularly good performance on the nth-farthest task, which requires the ranking of pairwise distances between a set of vectors (91% accuracy, compared with baseline 30%), and the Mini-Pacman task.

Richard’s opinion: While performance is good on small problems, comparing every memory to every other doesn’t scale well (a concern the authors also mention in their discussion). It remains to be seen how pruning older memories affects performance.

Relational Deep Reinforcement Learning (Vinicius Zambaldi, David Raposo, Adam Santoro et al) (summarized by Richard): This paper uses the self-attention mechanism discussed in ‘Relational recurrent neural networks’ to compute relationships between entities extracted from input data. The system was tested on the Box-World environment, in which an agent needs to use keys to open boxes in a certain order. It generalised very well to test environments which required much longer sequences of actions than any training examples, and improved slightly on a baseline for Starcraft mini-games.

Richard’s opinion: Getting neural networks to generalise to longer versions of training problems is often surprisingly difficult, so I’m impressed by the Box-World results; I would have liked to see what happened on even longer problems.

Relational inductive bias for physical construction in humans and machines (Jessica B. Hamrick, Kelsey R. Allen et al)

Applications

Applying Deep Learning To Airbnb Search (Malay Haldar)

Machine learning

Fluid Annotation: An Exploratory Machine Learning–Powered Interface for Faster Image Annotation (Jasper Uijlings and Vittorio Ferrari): This post describes a system that can be used to help humans label images to generate labels for segmentation. The post summarizes it well: “Fluid Annotation starts from the output of a strong semantic segmentation model, which a human annotator can modify through machine-assisted edit operations using a natural user interface. Our interface empowers annotators to choose what to correct and in which order, allowing them to effectively focus their efforts on what the machine does not already know.”

Rohin’s opinion: I’m excited about techniques like this that allow us to scale up AI systems with less human effort, by focusing human effort on the aspects of the problem that AI cannot yet solve, while using existing AI systems to do the low-level work (generating a shortlist of potential segmentations, in this case). This is an example of the paradigm of using AI to help humans more effectively create better AI, which is one of the key ideas underlying iterated amplification. (Though iterated amplification focuses on how to use existing AI systems to allow the human to provide a training signal for tasks that humans cannot perform or evaluate themselves.)