Alignment research exercises
It’s currently hard to know where to start when trying to get better at thinking about alignment. So below I’ve listed a few dozen exercises which I expect to be helpful. They assume a level of background alignment knowledge roughly equivalent to what’s covered in the technical alignment track of the AGI safety fundamentals course. They vary greatly in difficulty—some are standard knowledge in ML, some are open research questions. I’ve given the exercises star ratings from * to *** for difficulty (note: not for length of time to complete—many require reading papers before engaging with them). However, I haven’t tried to solve them all myself, so the star ratings may be significantly off.
I’ve erred on the side of including exercises which seem somewhat interesting and alignment-related even when I’m uncertain about their value; when working through them, you should keep the question “is this actually useful? Why or why not?” in mind as a meta-exercise. This post will likely be updated over time to remove less useful exercises and add new ones.
I’d appreciate any contributions of:
Comments about which exercises seem most or least useful.
Answers to the exercises.
More exercises! The ideal exercises are nerdsnipe-style problems which can be stated clearly, and seem well-defined, but lead into interesting depths when explored.
Reward learning
* Look at the examples of human feedback mechanisms discussed in the reward-rational implicit choice paper. Think of another type of human feedback. What is the choice set? What is the grounding function?
* This paper by Anthropic introduces a technique called context distillation. Describe this in terms of the reward-rational implicit choice framework.
* Estimate the bandwidth of information conveyed by different types of human feedback. Describe a rough model for how this might change as training progresses. By contrast, how much information is conveyed by the choice of a programmatic reward function? (Consider both the case where the agent is given the exact reward function, and where it learns from reward observations.)
* Look at the examples of biases discussed in learning the preferences of ignorant agents. Identify another bias which similarly influences human decision-making. Describe an example situation where a human with that bias might make the wrong decision. Formulate an algorithm that infers that human’s true preferences.
** Given that humans can be assigned any values, why does reward learning ever work in practice?
** Explain why cooperative inverse reinforcement learning doesn’t solve the alignment problem.
Agency
** In this paper, researchers devised a test for whether a recurrent network is doing planning: by seeing whether its performance improves when given more time to “think” before it can act. In the AlphaGo paper, researchers compared the performance of their MCTS+neural network algorithm against the network alone. Think of some other test that we could run that would give us evidence about the extent to which some neural network is internally doing planning.
* Consider HCH, an attempted formalisation of “a human’s enlightened judgment”. Why might an implementation of HCH not be aligned? What assumptions would be needed to prevent that?
*** In a later post, Paul defines a stronger version of HCH which “increases the complexity-theoretic expressiveness of HCH. The old version could be computed in EXPTIME, while the new version can compute any decidable function.” Try to rederive a new version of HCH with these properties.
* Ask the OpenAI API about what steps it would take to perform some long-term plan. Work in groups: think of a task that you expect it will be difficult to generate a good plan for, and then see who can design a prompt that will produce the best plan from the API.
* Some steps of a plan generated by the API can also be performed by the API—e.g. a step which requires writing a poem about a given topic. What’s the hardest task you can find for which the API can not only generate a plan, but also perform each of the steps in that plan?
** Pearl argues that neural networks trained on supervised or self-supervised data can’t learn to reason about interventions and counterfactuals (see this post for an explanation of the distinction). What’s the strongest counterargument against his position?
Reinforcement learning
** How is supervised learning on reward-maximising trajectories related (mathematically) to policy gradient with sparse, binary rewards?
** What decision theories are implemented by different RL algorithms?
** What might lead an RL agent to learn a policy which sacrifices reward in its current episode to get higher reward in a later episode?
* Self-play in zero-sum two-player games converges to an optimal strategy (given sufficient assumptions about the model class). In other games, this isn’t the case—why not?
** Evaluate this paper (Reward is Enough). Does their argument hold up?
** After doing that: consider a bird practicing singing, which listens to its own song, and does RL using the rule: the better the song sounds, the higher the reward. But the bird is also deciding how much time to spend practicing singing versus foraging, etc. And the worse it sings, the more important it is to practice! So you really want the rule: the worse the song sounds, the more rewarding it is to practice singing. How could you resolve this conflict?
* Why can a behaviourally cloned policy perform well when run for a small set of timesteps, but poorly when run over a longer series of timesteps? How can this be fixed?
** If a deep q-learning agent is trained in an environment where some actions lead to large negative rewards, it will never stop trying these actions (the policy will sometimes take these actions even when not randomly exploring due to epsilon exploration). Why does this happen? How could it be prevented?
** RL agents have become capable of competent behaviour over longer and longer episodes. What difficulties arise in trying to measure improvements in how long they can act competently for? What metrics are most useful?
The same question, but for sample efficiency rather than episode length.
Neural networks
* Consider this paper on modularity in neural networks. Evaluate their metric of clustering; what others could we use instead?
** Consider the following alignment proposal: a neural network has two output heads, one of which chooses actions, the other of which predicts the longer-term consequences of those actions. Suppose that we train the latter head to maximise human-evaluated prediction quality. What differences might we expect from backpropagating that loss all the way through the network, versus only backpropagating through the prediction head? What complications arise if we try to train the prediction head via RL? What advantages might there be of doing so?
** “Gradient hacking” is a hypothesised phenomenon by which a model decides its actions partly on the basis of its observations of its own parameters, thereby changing the way its parameters are updated. Does the gradient hacking mechanism described in the linked post work? If not, does any variant of it work?
* Read Jacob Steinhardt’s list of examples of emergent shifts in machine learning. Can you think of any others? What about shifts that you expect in the near future?
** What might it look like for the circuits hypothesis to be false?
* This paper discusses the metric of “effective data transferred”. What are the limitations of this metric? What are some alternative ways to measure data transfer?
Alignment theory
* Consider extending reinforcement learning to the case where rewards can depend on the parameters of a model. Why do classic convergence proofs no longer work?
*** Are there any limiting assumptions which might lead to interesting theoretical results?
** One concern with proposals to train using loss functions that depend directly on neural activations is that if we train a network to avoid carrying out any particular piece of cognition, that cognition will instead just be distributed across the network in a way that we can’t detect. Describe a toy example of a cognitive trait that we can currently detect automatically. Design an experiment to determine whether, after training to remove that trait, the network has learned to implement an equivalent trait in a less-easily-detectable way.
*** Rederive some of the proofs in the following papers. For b) and c), explain what assumptions are being made about the optimality of the agents involved, and how they might break down in practice:
*** Produce a proposal for the ELK prize (note that this requires engaging with the ELK writeup, which is very long).
** Suppose that we’re training a model via behavioural cloning of a human, but the human starts off with different prior knowledge to the model (either more knowledge, or less knowledge). How might this lead the model to behave in a misaligned way?
Agent foundations
Evolution and economics
* An old study split insects into several groups which each lived together, and artificially selected in favour of smaller groups, in an attempt to study whether they would evolve to voluntarily restrain their breeding. Predict the outcome of the study.
Some answers here. Did the bias discussed in this post influence your expectations?
** What might explain why there are so few hermaphroditic animal species, given that every individual being able to bear children could potentially double the number of children in the next generation?
* Read this post about evolving to extinction. Mathematically demonstrate that segregation-distorters could in fact lead a species to evolve to extinction.
* Evaluate Fletcher and Doebeli’s model of the evolution of altruism.
Use the model to show how the green-beard effect could lead to the evolution of (a certain type of) altruism.
** Why are roughly equal numbers of males and females born in most species?
* Comparing GDP across time requires reference to a standard basket of goods and services. What difficulties might this cause in taking GDP comparisons at face value?
** Evaluate Roodman’s model of explosive economic growth.
* In cooperative game theory, the “core” is the term for the set of allocations of payoffs to agents where no subset of the agents can form a coalition to improve their payoffs. For example, consider a group of N miners, who have discovered large bars of gold. Assume that two miners can carry one piece of gold, and so the payoff of any coalition S is floor(|S|/2). If N is even, then the core consists of the single payoff distribution where each miner gets ½. If N is odd, then the core is empty (because the miner who is left out can always make a better offer to some miner who currently has a gold-carrying partner). Identify the core for the following games:
A game with 2001 players: 1000 of them have 1 left shoe, 1001 have 1 right shoe. A left-shoe/right-shoe pair can be sold for $10.
Mr A and Mr B each have three gloves. Any two gloves make a pair that they can sell for $5.
* How should coalitions decide how to split the payoffs they receive? The concept of Shapley values provides one answer. Convince yourself that Shapley values have the properties of linearity, null player and the stand-alone test described in the linked article.
Some important concepts in ML
These are intended less as exercises and more as pointers to open questions at the cutting edge of deep learning.
Why do they have the form they do?
Miscellaneous
* Fill in your estimates in Cotra’s timeline model. Does the model broadly make sense to you; are there ways you’d change it?
* Try playing OpenAI’s implementation of the Debate game.
** Identify an important concept in alignment that isn’t currently very well-explained; write a more accessible explanation.
- How to pursue a career in technical AI alignment by 4 Jun 2022 21:36 UTC; 265 points) (EA Forum;
- Some conceptual alignment research projects by 25 Aug 2022 22:51 UTC; 176 points) (
- How to pursue a career in technical AI alignment by 4 Jun 2022 21:11 UTC; 69 points) (
- AI safety university groups: a promising opportunity to reduce existential risk by 30 Jun 2022 18:37 UTC; 53 points) (EA Forum;
- List of technical AI safety exercises and projects by 19 Jan 2023 9:35 UTC; 41 points) (
- Career Advice: Philosophy + Programming → AI Safety by 18 Mar 2022 15:09 UTC; 30 points) (EA Forum;
- Follow along with Columbia EA’s Advanced AI Safety Fellowship! by 2 Jul 2022 6:07 UTC; 27 points) (EA Forum;
- Updated Deference is not a strong argument against the utility uncertainty approach to alignment by 24 Jun 2022 19:32 UTC; 26 points) (
- There should be an AI safety project board by 14 Mar 2022 16:08 UTC; 24 points) (EA Forum;
- List of technical AI safety exercises and projects by 19 Jan 2023 9:35 UTC; 15 points) (EA Forum;
- AI safety university groups: a promising opportunity to reduce existential risk by 1 Jul 2022 3:59 UTC; 14 points) (
- 1 Jul 2022 7:04 UTC; 11 points) 's comment on $500 bounty for alignment contest ideas by (EA Forum;
- List of links for getting into AI safety by 4 Jan 2023 19:45 UTC; 6 points) (
- Follow along with Columbia EA’s Advanced AI Safety Fellowship! by 2 Jul 2022 17:45 UTC; 3 points) (
I particularly appreciate the questions that ask one to look at a way that a problem was reified/specified/ontologized in a particular domain and asks for alternative such specifications. I thought Superintelligence (2014) might be net harmful because it introduced a lot of such specifications that I then noticed were hard to think around. I think there are a subset of prompts from the online course/book Framestorming that might be useful there, I’ll go see if I can find them.
I also have this impression regarding Superintelligence. I’m wondering if you have examples of a particular concept or part of the framing that you think was net harmful?
Speed, collective, quality superintelligences is the one that sounds most readily to mind, but quite a few of the distinctions struck me this way at the time I read it.
I also thought the treacherous turn, and the chapter on multipolar cooperation baked in a lot of specifics.
I really worry about this and it has become quite a block. I want to support fragile baby ontologies emerging in me amidst a cacophony of “objective”/”reward”/etc. taken for granted.
Unfortunately, going off and trying to deconfuse the concepts on my own is slow and feedback-impoverished and makes it harder to keep up with current developments.
I think repurposing “roleplay” could work somewhat, with clearly marked entry and exit into a framing. But ontological assumptions absorb so illegibly that deliberate unseeing is extremely hard, at least without being constantly on guard.
Are there other ways that you recommend (from Framestorming or otherwise?)
I think John Cleese’s relatively recent book on creativity and Olicia Fox Cabane’s The Butterfly and the Net are both excellent.
Great list of interesting questions, trains of thought and project ideas in this post.
I was a little surprised to not find any exercises on interpretability. Perhaps there was a reason for excluding it, but if not then here an idea for another exercise/group of exercises to include (perhaps could be merged into the “Neural networks” section):
Interpretability
Mechanistic interpretability is a research direction that systematically investigates the neurons or nodes of a neural network or ML model to try and understand what various neurons are doing. This approach has been applied already to learn about some of the neurons in early vision models as well as transformer language models. This research has also yielded some findings on how some groups of neurons work together, called “circuits”.
Review the above links and use a similar approach to investigate some neurons in another kind of neural network. There are many kinds of neural networks that could be investigated, but a few examples include reinforcement-learning (RL) agents, generative adversarial networks (GANs), protein-folding networks, etc.
Yeah, I’m keen to add exercises on interpretability. I like the direction of your one but it feels a bit too hard, in the sense that it’s a pretty broad request which is difficult to know where to start or how much progress they’re making. Any ideas on what more specific things we could ask them to do, or ways to make the exercise more legible to them?
That’s a fair point. I had thought this would be around the same level of difficulty as some of the exercises in the list such as “Produce a proposal for the ELK prize”. But I’m probably biased because I have spent a bit of time working in this area already.
I don’t know off the top of my head any ways to decompose the problem or simplify it further, but I’ll post back if I think of any. I think it will help as Lucid and Lucent get better, or perhaps if Anthropic open-sources their interpretability tooling. That could make it significantly easier to onboard people to these kinds of problems and scale up the effort.
Difference IMO is mainly that Circuits steps you through the problem in a way designed to help you understand their thinking, whereas ELK steps you through the problem in a way designed to get people to contribute.
(Perhaps “produce a proposal for something to investigate” might be of a similar difficulty as the ELK prize, but also Circuits work is much more bottom-up so it seems hard to know what to latch onto before having played around a bunch. Agreed that new tooling for playing around with things would help a lot.)
This post, by example, seems like a really good argument that we should spend a little more effort on didactic posts of this sort. E.g. rather than just saying “physical systems have multiple possible interpretations,” we could point people to a post about a gridworld with a deterministic system playing the role of the agent, such that there are a couple different pretty-good ways of describing this agent that mostly agree but generalize in different ways.
This perspective might also be a steelmanning of that sort of paper where there’s an abstract argument that does all the work, and then some code that tells you nothing new if you followed the abstract argument. The code (in the steelmanned story) isn’t just there to make the paper be in the right literary genre or provide a semi-trustworthy signal that you’re not a crank who makes bad abstract arguments, it’s a didactic tool to help the reader do these sorts of exercises.
Suggestion on Agency 2.1: rephrase so that the “Before reading his post” part comes before the link to the post. I assume there’ll otherwise be some overzealous link followers.
Thanks, done!
Curated. Exercises are crucial for the mastery of topics and the transfer of knowledge, it’s great to see someone coming up with them for the nebulous field of Alignment.
Feedback: I clicked through to the provided answer and had a great deal of difficulty understanding how it was relevant—it makes a number of assumptions about agents and utility functions and I wasn’t able to connect it to why I should expect an agent trained using CIRL to kill me.
FWIW here’s my alternative answer:
In other words it’s only a solution to “Learn from Teacher” in Paul’s 2019 decomposition of alignment, not to the whole alignment problem.
I thought about Agency Q4 (counterargument to Pearl) recently, but couldn’t come up with anything convincing. Does anyone have a strong view/argument here?
I don’t see any claim that it’s impossible for neural nets to handle causality. Pearl’s complaining about AI researchers being uninterested in that goal.
I suspect that neural nets are better than any other approach at handling the hard parts of causal modeling: distinguishing plausible causal pathways from ridiculous ones.
Neural nets currently look poor at causal modeling for roughly the same reason that High Modernist approaches weren’t willing to touch causal claims: without a world model that’s comprehensive enough to approximate common sense, causal modeling won’t come close to human-level performance.
A participant in Moderna’s vaccine trial was struck by lightning. How much evidence is that for our concern that the vaccine is risky?
If I try to follow the High Modernist approach, I think it says something like we should either be uncertain enough to avoid any conclusion, or we should treat the lightning strike as evidence of vaccine risks.
As far as I can tell, AI approaches other than neural nets perform like scientists who blindly follow a High Modernist approach (assuming the programmers didn’t think to encode common sense about whether vaccines affect behavior in a lightning-strike-seeking way).
Whereas GPT-3 has some hints about human beliefs that make it likely to guess a little bit better than the High Modernist.
GPT-3 wasn’t designed to be good at causality. It’s somewhat close to being a passive observer. If I were designing a neural net to handle causality, I’d give it an ability to influence an environment that resembles what an infant has.
If there are any systems today that are good at handling causality, I’d guess they’re robocar systems. What I’ve read about those suggests they’re limited by the difficulty of common sense, not causality.
I expect that when causal modeling becomes an important aspect of what AI needs for further advances, it will be done with systems that use neural nets as important components. They’ll probably look a bit more like Drexler’s QNR than like GPT-3.
Just a quick logistical thing: do you have any better source of Pearl making that argument? The current quanta magazine link isn’t totally satisfactory, but I’m having trouble replacing it.