Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I’m always happy to hear feedback; you can send it to me by replying to this email.
Audio version here (may not be up yet).
Highlights
AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence (Jeff Clune) (summarized by Yuxi Liu and Rohin): Historically, the bitter lesson (AN #49) has been that approaches that leverage increasing computation for learning outperform ones that build in a lot of knowledge. The current ethos towards AGI seems to be that we will come up with a bunch of building blocks (e.g. convolutions, transformers, trust regions, GANs, active learning, curricula) that we will somehow manually combine into one complex powerful AI system. Rather than require this manual approach, we could instead apply learning once more, giving the paradigm of AI-generating algorithms, or AI-GA.
AI-GA has three pillars. The first is to learn architectures: this is analogous to a superpowered neural architecture search that can discover convolutions, recurrence and attention without any hardcoding. The second is to learn the learning algorithms, i.e. meta-learning. The third and most underexplored pillar is to learn to generate complex and diverse environments within which to train our agents. This is a natural extension of meta-learning: with meta-learning, you have to specify the distribution of tasks the agent should perform well on; AI-GA simply says to learn this distribution as well. POET (AN #41) is an example of recent work in this area.
A strong reason for optimism about the AI-GA paradigm is that it mimics the way that humans arose: natural selection was a very simple algorithm that with a lot of compute and a very complex and diverse environment was able to produce a general intelligence: us. Since it would need fewer building blocks (since it aims to learn everything), it could succeed faster than the manual approach, at least if the required amount of compute is not too high. It is also much more neglected than the “manual” approach.
However, there are safety concerns. Any powerful AI that comes from an AI-GA will be harder to understand, since it’s produced by this vast computation where everything is learned, and so it would be hard to get an AI that is aligned with our values. In addition, with such a process it seems more likely that a powerful AI system “catches us by surprise”—at some point the stars align and the giant computation makes one good random choice and suddenly it outputs a very powerful and sample efficient learning algorithm (aka an AGI, at least by some definitions). There is also the ethical concern that since we’d end up mimicking evolution, we might accidentally instantiate large amounts of simulated beings that can suffer (especially if the environment is competitive, as was the case with evolution).
Rohin’s opinion: Especially given the growth of compute (AN #7), this agenda seems like a natural one to pursue to get AGI. Unfortunately, it also mirrors very closely the phenomenon of mesa optimization (AN #58), with the only difference being that it is intended that the method produces a powerful inner optimizer. As the paper acknowledges, this introduces several risks, and so it calls for deep engagement with AI safety researchers (but sadly it does not propose ideas on how to mitigate the risks).
Due to the vast data requirements, most of the environments would have to be simulated. I suspect that this will make the agenda harder than it may seem at first glance—I think that the complexity of the real world was quite crucial, and that simulating environments that reach the appropriate level of complexity will be a very difficult task. (My intuition is that something like Neural MMO (AN #48) is nowhere near enough complexity.)
Technical AI alignment
Problems
The “Commitment Races” problem (Daniel Kokotajlo) (summarized by Rohin): When two agents are in a competitive game, it is often to each agent’s advantage to quickly make a credible commitment before the other can. For example, in Chicken (both players drive a car straight towards the other and the first to swerve out of the way loses), an agent could rip out their steering wheel, thus credibly committing to driving straight. The first agent to do so would likely win the game. Thus, agents have an incentive to make commitments as quickly as possible, before their competitors can make commitments themselves. This trades off against the incentive to think carefully about commitments, and may result in arbitrarily bad outcomes.
Iterated amplification
Towards a mechanistic understanding of corrigibility (Evan Hubinger) (summarized by Rohin): One general approach to align AI is to train and verify that an AI system performs acceptably on all inputs. However, we can’t do this by simply trying out all inputs, and so for verification we need to have an acceptability criterion that is a function of the “structure” of the computation, as opposed to just input-output behavior. This post investigates what this might look like if the acceptability criterion is some flavor of corrigibility, for an AI trained via amplification.
Agent foundations
Troll Bridge (Abram Demski) (summarized by Rohin): This is a particularly clean exposition of the Troll Bridge problem in decision theory. In this problem, an agent is determining whether to cross a bridge guarded by a troll who will blow up the agent if its reasoning is inconsistent. It turns out that an agent with consistent reasoning can prove that if it crosses, it will be detected as inconsistent and blown up, and so it decides not to cross. This is rather strange reasoning about counterfactuals—we’d expect perhaps that the agent is uncertain about whether its reasoning is consistent or not.
Two senses of “optimizer” (Joar Skalse) (summarized by Rohin): The first sense of “optimizer” is an optimization algorithm, that given some formally specified problem computes the solution to that problem, e.g. a SAT solver or linear program solver. The second sense is an algorithm that acts upon its environment to change it. Joar believes that people often conflate the two in AI safety.
Rohin’s opinion: I agree that this is an important distinction to keep in mind. It seems to me that the distinction is whether the optimizer has knowledge about the environment: in canonical examples of the first kind of optimizer, it does not. If we somehow encoded the dynamics of the world as a SAT formula and asked a super-powerful SAT solver to solve for the actions that accomplish some goal, it would look like the second kind of optimizer.
Adversarial examples
Testing Robustness Against Unforeseen Adversaries (Daniel Kang et al) (summarized by Cody): This paper demonstrates that adversarially training on just one type or family of adversarial distortions fails to provide general robustness against different kinds of possible distortions. In particular, they show that adversarial training against L-p norm ball distortions transfer reasonably well to other L-p norm ball attacks, but provides little value, and can in fact reduce robustness, when evaluated on other families of attacks, such as adversarially-chosen Gabor noise, “snow” noise, or JPEG compression. In addition to proposing these new perturbation types beyond the typical L-p norm ball, the paper also provides a “calibration table” with epsilon sizes they judge to be comparable between attack types, by evaluating them according to how much they reduce accuracy on either a defended or undefended model. (Because attacks are so different in approach, a given numerical value of epsilon won’t correspond to the same “strength” of attack across methods)
Cody’s opinion: I didn’t personally find this paper hugely surprising, given the past pattern of whack-a-mole between attack and defense suggesting that defenses tend to be limited in their scope, and don’t confer general robustness. That said, I appreciate how centrally the authors lay this lack of transfer as a problem, and the effort they put in to generating new attack types and calibrating them so they can be meaningfully compared to existing L-p norm ball ones.
Rohin’s opinion: I see this paper as calling for adversarial examples researchers to stop focusing just on the L-p norm ball, in line with one of the responses (AN #62) to the last newsletter’s highlight, Adversarial Examples Are Not Bugs, They Are Features (AN #62).
Read more: Testing Robustness Against Unforeseen Adversaries
Robustness
An Empirical Evaluation on Robustness and Uncertainty of Regularization Methods (Sanghyuk Chun et al) (summarized by Dan H): There are several small tricks to improve classification performance such as label smoothing, dropout-like regularization, mixup, and so on. However, this paper shows that many of these techniques have mixed and often negative effects on various notions of robustness and uncertainty estimates.
Critiques (Alignment)
Conversation with Ernie Davis (Robert Long and Ernie Davis)
Miscellaneous (Alignment)
Distance Functions are Hard (Grue_Slinky) (summarized by Rohin): Many ideas in AI alignment require some sort of distance function. For example, in Functional Decision Theory, we’d like to know how “similar” two algorithms are (which can influence whether or not we think we have “logical control” over them). This post argues that defining such distance functions is hard, because they rely on human concepts that are not easily formalizable, and the intuitive mathematical formalizations usually have some flaw.
Rohin’s opinion: I certainly agree that defining “conceptual” distance functions is hard. It has similar problems to saying “write down a utility function that captures human values”—it’s possible in theory but in practice we’re not going to think of all the edge cases. However, it seems possible to learn distance functions rather than defining them; this is already done in perception and state estimation.
AI Alignment Podcast: On Consciousness, Qualia, and Meaning (Lucas Perry, Mike Johnson and Andrés Gómez Emilsson)
AI strategy and policy
Soft takeoff can still lead to decisive strategic advantage (Daniel Kokotajlo) (summarized by Rohin): Since there will be an improved version of this post soon, I will summarize it then.
FLI Podcast: Beyond the Arms Race Narrative: AI & China (Ariel Conn, Helen Toner and Elsa Kania)
Reducing malicious use of synthetic media research: Considerations and potential release practices for machine learning (Aviv Ovadya et al)
Other progress in AI
Reinforcement learning
Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? (Andrew Ilyas et al) (summarized by Cody) (H/T Lawrence Chan): This paper investigates whether and to what extent the stated conceptual justifications for common Policy Gradient algorithms are actually the things driving their success. The paper has two primary strains of empirical investigation.
In the first, they examine a few of the more rigorously theorized aspects of policy gradient methods: learned value functions as baselines for advantage calculations, surrogate rewards, and enforcement of a “trust region” where the KL divergence between old and updated policy is bounded in some way. For value functions and surrogate rewards, the authors find that both of these approximations are weak and perform poorly relative to the true value function and reward landscape respectively.
Basically, it turns out that we lose a lot by approximating in this context. When it comes to enforcing a trust region, they show that TRPO is able to enforce a bound on mean KL, but that it’s much looser than the (more theoretically justified) bound on max KL that would be ideal but is hard to calculate. PPO is even stranger: they find that it enforces a mean KL bound, but only when optimizations present in the canonical implementation, but not the core definition of the algorithm, are present. These optimizations include: a custom weight initialization scheme, learning rate annealing on Adam, and reward values that are normalized according to a rolling sum. All of these optimizations contribute to non-trivial increases in performance over the base algorithm, in addition to apparently being central to how PPO maintains its trust region.
Cody’s opinion: This paper seems like one that will make RL researchers usefully uncomfortable, by pointing out that the complexity of our implementations means that just having a theoretical story of your algorithm’s performance and empirical validation of that heightened performance isn’t actually enough to confirm that the theory is actually the thing driving the performance. I do think the authors were a bit overly critical at points: I don’t think anyone working in RL would have expected that the learned value function was perfect, or that gradient updates were un-noisy. But, it’s a good reminder that saying things like “value functions as a baseline decrease variance” should be grounded in an empirical examination of how good they are at it, rather than just a theoretical argument that they should.
Learning to Learn with Probabilistic Task Embeddings (Kate Rakelly, Aurick Zhou et al) (summarized by Cody): This paper proposes a solution to off-policy meta reinforcement learning, an appealing problem because on-policy RL is so sample-intensive, and meta-RL is even worse because it needs to solve a distribution over RL problems. The authors’ approach divides the problem into two subproblems: infer an embedding, z, of the current task given context, and learning an optimal policy q function conditioned on that task embedding. At the beginning of each task, z is sampled from the (Gaussian) prior, and as the agent gains more samples of that particular task, it updates its posterior over z, which can be thought of as refining its guess as to which task it’s been dropped into this time. The trick here is that this subdividing of the problem allows it to be done mostly off-policy, because you only need to use on-policy learning for the task inference component (predicting z given current task transitions), and can learn the Actor-Critic model conditioned on z with off-policy data. The method works by alternating between these two learning modes.
Cody’s opinion: I enjoyed this; it’s a well-written paper that uses a few core interesting ideas (posterior sampling over a task distribution, representation of a task distribution as a distribution of embedding vectors passed in to condition Q functions), and builds them up to make a method that achieves some impressive empirical results.
Read more: Efficient Off-Policy Meta-RL via Probabilistic Context Variables
Paul made some arguments that contradict this on the 80k podcast:
Yup, I am aware of these arguments and disagree with them, though I haven’t written up the reasons anywhere.
Would be cool to hear at some point :)
As far as I can tell, IA-GA doesn’t fit into any of the current AI safety success stories, and it seems hard to imagine what kind of success story it might fit into. I’m curious if anyone is more optimistic about this.
As phrased in the paper I’m pretty pessimistic, mostly because the paper presents a story with a discontinuity where you throw a huge amount of computation and then at some point AGI emerges abruptly.
I think it’s more likely that there won’t be discontinuities—the giant blob of computation keeps spitting out better and better learning algorithms, and we develop better ways of adapting them to tasks in the real world.
At some point one of these algorithms tries and fails to deceive us, we notice the problem and either fix it or stop using the AI-GA approach / limit ourselves to not-too-capable AI systems.
It seems plausible that you could get something like the Interim Quality-of-Life Improver out of such an approach. You’d have to deal with the problem that by default these AI systems are going to have weird alien drives that would likely make them misaligned with us, but you probably do get examples of systems that would deceive us that you can study and fix.
The Ernie Davis interview was pretty interesting as a really good delve into what people are thinking when they don’t see AI alignment work as important.
The disagreement on how impactful superintelligent AI would be seems important, but not critically important. As long as you agree the impact of AIs that make plans about the real world will be “big enough,” you’re probably on board with wanting them to make plans that are aligned with human values.
The “common sense” disagreement definitely seems more central. The argument goes something like “Any AI that actually makes good plans has to have common sense about the world, it’s common sense that killing is wrong, the AI won’t kill people.”
Put like this, there’s a bit of a package deal fallacy, where common sense is treated as a package deal even though “fire is hot” and “killing is bad” are easy to logically separate.
But we can steelman this by talking about learning methods—if we have a learning method that works for all the common sense that learns “fire is hot,” wouldn’t it be easy to also use that to learn “killing is bad?” Well, maybe not necessarily, because of the is/ought distinction. If the AI represents “is” statements with a world model, and then rates actions in the world by using an “ought” model, then it’s possible for a method to do really well at learning “is”’s without being good at learning “ought”s.
Thinking in terms of learning methods also opens up a second question—is it really necessary for an AI to have human-like common sense? If you just throw a bunch of compute at the problem, could you get an AI that takes clever actions in the world without ever learning the specific fact “fire is hot”? How likely is this possibility?
What you say about is/ought is basically the alignment problem, right? My take is: I have high confidence that future AIs will know intellectually what it is that humans regard as common-sense morality, since that knowledge is instrumentally useful for any goal involving predicting or interacting with humans. I have less confidence that we’ll figure out how to ensure that those AIs adopt human common-sense morality. Even humans, who probably have an innate drive to follow societal norms, will sometimes violate norms anyway, or do terrible things in a way that works around those constraints.
I’m skeptical of this. I think that it’s well within our capabilities to create a virtual environment with a degree of complexity comparable to the ancestral environment. For instance, the development of minecraft with all of it’s complexity can be upper bounded by the cost of paying ~25 developers over the course of 10 years. But the core features of the game, minecraft alpha, were done by a single person in his spare time over 2 years.
I think a smallish competent team with a 10-100 million dollar budget could easily throw together a virtual environment with ample complexity, possibly including developing FPGA’s or ASICs to run it at the required speed.
It seems to me that a SAT solver can be arbitrarily competent at solving SAT problems without being the second kind of optimizer (i.e. without acting upon its environment to change it), even while it solves SAT problems that encode the dynamics of our world. For example, this seems to be the case for a SAT solver that is just a brute force search with arbitrarily large amount of computing power.
[EDIT: When writing this comment, I considered “the environment of a SAT solver” to be the world that contains the computer running the SAT solver. However, this seem to contradict what Joar had in mind in his post].
simple?
I’m unsure whether you are drawing attention to the word “sample.” If so, sample efficiency refers to the amount of experience an RL agent needs in order to perform well in an environment. See here.
Yup, this is what I meant.