Jan Leike leads the safety team at OpenAI. This post is from a Substack series titled “Musings on the Alignment Problem”. I’ve found his posts helpful for understanding how the OpenAI safety team is thinking about alignment & for developing my own models of alignment/AI x-risk. I encourage more people to engage with Jan’s blog posts.
In the post distinguishing three alignment taxes, Jan distinguishes between three types of alignment taxes. I appreciate how the post (a) offers concrete definitions of various types of alignment taxes, (b) discusses how they might affect AI labs in a competitive market, and (c) discusses how they might affect AI-assisted alignment research.
I’m including some quotes below (headings are added by me):
The three types of alignment taxes
In the general sense an alignment tax is any additional cost that is incurred in the process of aligning an AI system. Let’s distinguish three different types of alignment taxes:
Performance taxes: Performance regressions caused via alignment compared to an unaligned baseline.
Development taxes: Effort or expenses incurred for aligning the model: researcher time, compute costs, compensation for human feedback, etc
Time-to-deployment taxes: Wall-clock time taken to produce a sufficiently aligned model from a pretrained model.
Past work on performance taxes
In the past this performance tax has been measured by how much the model’s score is reduced on standard benchmarks after fine-tuning. While training the first version of InstructGPT OpenAI observed performance regressions on some standard benchmarks on question answering and translation. Those were mostly but not entirely mitigated by mixing pretraining data into the fine-tuning process. Anthropic, DeepMind, and Google have also studied alignment taxes as part of their alignment efforts, and sometimes alignment fine-tuning can even increase performance on several benchmarks, corresponding to a negative performance tax.
One way to operationally define a performance tax
However, there is a more natural way we could quantify this tax that lets us translate this tax more directly into monetary terms by measuring how much extra compute we need to spend at inference time to compensate for the performance regression. If our more aligned model needs to spend T% more inference-time compute to get from performance Z’ back to performance Z on capability X, then we say there is a T% alignment tax. For example, if we always need to run best-of-2, this corresponds to a 100% alignment tax. If we need to run best-of-4 for 10% of all tasks, this corresponds to a 4*10% = 40% alignment tax.
The size of a development taxes might be fairly unrelated to the size of the model
Today’s development taxes include building an RLHF codebase, hiring and managing human labelers, compute, and researcher effort. My (pretty rough) guess is that the total development costs of InstructGPT probably sum up to about 5-20% of GPT-3’s development cost. However, most of this development cost is independent of the size of the model, and similarly improving the alignment of a 10x smaller or larger language model would have taken a similar amount of effort. In fact, in reality it is probably the other way around: higher development cost of larger language models justify a larger effort on making it more aligned, such as having a larger team working on it.
Time-to-deployment taxes are difficult to predict
For GPT-3 this pipeline took us about 9 months, while today we have enough infrastructure to produce a pretty good model within 3 months since we can reuse a lot of the existing data and code.
This calculus is flawed for an important reason: at some point more capable models can’t be aligned with the same techniques. Therefore simply optimizing our existing training loop won’t help reduce the time-to-deployment of future models. In particular, once models get capable enough to do hard tasks that humans struggle to evaluate, we’d want to use AI-assisted evaluation to train them. However, the infrastructure to do this well is still being developed.
Performance taxes may not matter much for automated alignment research
When using AI systems to do automated alignment research, these AI systems will also be subject to alignment taxes. However, in this case our AI system is not directly competing with other AI systems in a market, and thus the performance taxes won’t matter as much. Yet the time-to-deployment tax still matters: if alignment progress can’t keep up with AI capabilities, we’d have to slow down or pause AI progress, which would be a very difficult coordination problem.
The performance tax that could be sustained for automated alignment research depends a lot on the total amount of work the system needs to do. In these cases, the development taxes will be the dominating factor.
I had missed this earlier. Seems like a useful crystallization.
Something seems wrong here. The examples are:
2x as much compute on 100% of tasks --> 100% alignment tax
4x as much compute on 10% of tasks (and 1x as much compute on 90% of tasks) --> 40% alignment tax
In (1), we spend 2*100% = 200% of the compute we spent before, which is 100% more. But in (2), we spend (4*10% + 1*90%) = 130% of the compute we spent before, which is 30% more. So I think the 2nd example is a 30% alignment tax, not 40%?