Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
HIGHLIGHTS
Scaling Laws for Transfer(Danny Hernandez et al) (summarized by Asya): This paper studies empirical scaling laws for transfer learning in language models. The authors use Transformer-based models to predict Python code by training on three different dataset curricula:
- Training from-scratch on Python code
- Pre-training on natural language, then fine-tuning on Python code
- Pre-training on natural language and non-Python code, then fine-tuning on Python code
The authors then measure the “effective data transferred” from pre-training—if we wanted to replace all the pre-training steps with from-scratch training, maintaining the same loss, how much additional from-scratch data would we need?
They find that when the amount of data used to train is small, effective data transferred is described by a simple power-law function of D_F, the amount of data used for fine-tuning, and N, the number of parameters: k (D_F)^α (N)^β, for constants k, α, and β.
In their experiments, β doesn’t change between pre-training on natural language and pre-training on a mixture of natural language and non-Python code. They hypothesize that β measures how the model architecture generalizes on the target distribution, and doesn’t depend on the contents of the pre-training data.
The authors think that α is a measure of the directed proximity of the pre-training and from-scratch distributions, with smaller α indicating closer proximity. Measuring α can be done cheaply by changing the finetuning dataset size while holding the pretrained model constant, making it useful for deciding between collecting more fine-tuning data and increasing model size. For pre-training on natural language and fine-tuning on Python, β is about 2 * α, so for decreasing loss, increasing the fine-tuning dataset size by a factor of C (e.g., 100x) would be worth approximately the same as increasing the model size by √C (e.g. 10x).
The authors find that pre-training on a mixture of natural language and non-Python code has a higher k but lower α than pre-training on natural language alone. The higher k indicates that the mixture model has better transfer performance when trained in the low data regime, while the lower α value means that benefits of the mixture model diminish as more data is used.
The authors also observe that:
- Not counting pre-training compute, pre-trained models are generally more compute efficient than from-scratch models when trained in the low data regime, approximately as compute efficient in the medium data regime, and less compute efficient in the high data regime (close to convergence).
- Small pre-trained models perform worse than small from-scratch models in the high data regime. The authors call this phenomenon “ossification”—a term used to suggest that small pre-trained models may have a hard time moving away from bad initializations.
- In general, pre-trained models of a given size are compute efficient (on the frontier of loss given compute) for a large portion of their fine-tuning. From-scratch models, by contrast, are only compute efficient for a narrow window of training—using too little compute for a given model dramatically increases loss and suggests that you should instead be using a smaller model. This makes pre-trained models in some sense “easier” to train.
Asya’s opinion: It’s extremely cool to have a mathematical characterization of the power of pre-training. I would love to see similar analyses measuring effective data transferred for tasks other than predicting the next token—if it turns out that modest increases in model size compensate for small datasets in a wide variety of tasks, that makes a strong case that unsupervised learning will be most of the work towards transformative abilities.
Reading these scaling papers really makes me think that there’s some deeper theoretical understanding of distributions and predictive models that these results are accessing, maybe encapsulated by this theory paper that I still haven’t read...
Rohin’s opinion: Like Asya, I really like the simplicity of the functional form of the scaling law, and the fits to the data seem quite good. I was quite surprised that β seemed to be independent of the pretraining distribution; I would not have predicted that in advance.
Note that despite the form of the scaling law, the effective data transferred isn’t independent of the pretraining dataset size—that dependence effectively comes through the model size N. This is because authors use compute-optimal pretraining, and so a given model size N corresponds to a specific amount of pretraining compute, which is probably almost the same as having a specific amount of pretraining data.
I am confused by the claim that distributions that are “closer” have lower α. It does make sense that for identical distributions, we should have α = 0. However, if closer distributions have lower α, and β is independent of the distribution, then as your finetuning dataset gets sufficiently large, you actually prefer the further-away distribution! For example, once your finetuning dataset hits 10 trillion points, the scaling laws predict that you prefer to have pretrained on text, rather than 50% text and 50% code, which seems really bizarre. Possibly the scaling laws break down before that point, and in any case a finetuning dataset of 10 trillion data points would be ridiculously massive, but it still seems like something needs to be explained here.
Could we use this to improve timelines estimates? I take a stab at a calculation here; the very rough and uncertain conclusion is that while transfer does seem to be a useful way to reduce compute requirements, the overall effect is not large.
TECHNICAL AI ALIGNMENT
PROBLEMS
Challenges of Aligning Artificial Intelligence with Human Values(Margit Sutrop) (summarized by Rohin): This paper argues that since immoral humans could use AI systems to do harm, we must build ethical rules into AI systems. For this purpose, the traditional notion of “value alignment” is not enough, as it only requires that the AI system do what the user wants, which might not be ethical. But we also cannot embed a single theory of ethics into an AI system, as there is no agreement on such a theory. Instead, we should focus on what we don’t want an AI system to do, and rule out that behavior, while remaining uncertain or agnostic on what should be done.
Rohin’s opinion: I agree that successful AI alignment does not rule out the possibility of malicious use of AI systems. This paper is proposing putting rules inside the AI system that handle this problem. But as the paper itself notes, it seems quite challenging to even figure out what rules we would want.
I personally am more optimistic about the alternate route, where we put rules outside the AI system to handle the problem, that is, we create laws and regulations around the use of AI, such that malicious uses can be blamed on the human operator, and enforcement mechanisms that ensure that we actually catch and punish such malicious uses.
PREVENTING BAD BEHAVIOR
Challenges for Using Impact Regularizers to Avoid Negative Side Effects(David Lindner, Kyle Matoba, and Alexander Meulemans) (summarized by Rohin): I’m not summarizing this literature review on impact regularization because we’ve covered almost all of the ideas previously in this newsletter (e.g. this blog post (AN #49)). However, I do recommend it for its short, high-level introduction to existing ideas in impact regularization, as well as its ideas for future work.
1. Unless they are willing to “fall behind” others, individual actors will need to use AI systems to stay competitive.
2. Those AI systems will optimize for their owner’s goals, even though a better outcome could be achieved if all AI systems optimized for the average welfare across all actors.
MISCELLANEOUS (ALIGNMENT)
Distinguishing claims about training vs deployment(Richard Ngo) (summarized by Rohin): One story for AGI is that we train an AI system on some objective function, such as an objective that rewards the agent for following commands given to it by humans using natural language. We then deploy the system without any function that produces reward values; we instead give the trained agent commands in natural language. Many key claims in AI alignment benefit from more precisely stating whether they apply during training or during deployment.
For example, consider the instrumental convergence argument. The author proposes that we instead think of the training convergence thesis: a wide range of environments in which we could train an AGI will lead to the development of goal-directed behavior aimed towards certain convergent goals (such as self-preservation). This could happen either via the AGI internalizing them directly as final goals, or by the AGI learning final goals for which these goals are instrumental.
The author similarly clarifies goal specification, the orthogonality thesis, fragility of value, and Goodhart’s Law.
Putting the humanity into inhuman systems: How human factors and ergonomics can be used to manage the risks associated with artificial general intelligence(Paul M. Salmon et al) (summarized by Rohin): This paper argues that the methods of Human Factors and Ergonomics (HFE) should be applied to AGI safety. They list fifteen different methods from the field, typically used to analyze the performance of humans in systems, which could be applied to AGI instead (on the assumption that AGI will be more like humans than like machines in today’s systems). They then give examples of how these might be applied to the Prometheus story in the prologue of Life 3.0.
Rohin’s opinion: I’m not very familiar with this field, but among other techniques the paper mentions STAMP and STPA which we’ve previously seen in Engineering a Safer World (AN #112). It does seem to me like these techniques would be useful to apply to the entire sociotechnical system, of which an AGI system is just one part (and this is what the paper’s examples do). It is less clear to me whether it makes sense to take techniques designed for humans and apply them to AGI: perhaps we’ll have enough understanding of the differences between humans and AGI that we could do this in a reasonable way, but I think there is a real risk that the methods give incorrect conclusions simply because they make incorrect assumptions about how AGI works (given that they were designed for humans). Nonetheless, I do agree with the core claim of this paper that HFE is worth exploring.
Beyond Engagement: Aligning Algorithmic Recommendations With Prosocial Goals(Jonathan Stray) (summarized by Rohin): To decide what item to show a user, a recommender system needs to have some metric by which to rank items. Since this metric must usually be automated, it often contains in large part some operationalization of “engagement”. Unfortunately, this metric may not be able to differentiate between clickbait or extremist content on the one hand, and actually valuable posts on the other. In a workshop on the topic, participants brainstormed five main approaches for improvement:
1. Build better controls: Offer users more and better ways to control their feed.
2. Develop standardized survey-based metrics: Surveys should be able to get a significantly higher quality signal to optimize than engagement.
3. Pay users for better data, such as survey data.
4. Recommend feeds, not items: If we rank items individually, it is quite likely that all the posts of the same type (e.g. controversial posts) will get high scores. By ranking entire feeds, we can also optimize for the diversity of items within the feed.
5. Incentivize the creation of different feeds, so that users can choose which ones they prefer all things considered.
What are you optimizing for? Aligning Recommender Systems with Human Values(Jonathan Stray et al) (summarized by Rohin): While the previous blog post focused on societal-level approaches to recommender systems, this paper looks at what can be done at a technical level. By analyzing existing case studies of improvements to recommender systems (some of which we’ve seen before (AN #96)), the authors identify a typical approach taken in industry today.
First, engineers identify a problem with an already-deployed recommendation engine, perhaps from user feedback, or through monitoring. Second, they develop a concrete procedure to identify instances of this problem in the recommendations—a typical approach is to curate a dataset and train an ML classifier to identify these instances, though it is also possible to use manual review. Finally, the recommender system is adjusted to avoid the problem, for example by adding a term to the objective when training the recommender system, or by filtering its outputs based on the classifier’s output.
The authors then propose four high-level technical approaches to recommender alignment:
1. Develop better measures of what we want out of a recommendation engine, for example, an operationalization of “well-being” rather than “engagement”.
2. Allow users to collaboratively design the recommendation engine (called participatory design). Rather than have a company decide on how to trade off between different objectives, allow the users to settle upon the appropriate tradeoffs themselves.
3. Interactively learn about the user’s values. While this could look like building better controls as suggested in the previous post, it could also involve e.g. using Inverse Reward Design (AN #69) to maintain appropriate uncertainty over what the user cares about.
4. Design around “time well spent”, as evaluated by users on reflection or after consideration, rather than revealed preferences or immediate judgments. For example, we could show users a summary of their activity over the past month and ask how happy they are about it.
Rohin’s opinion: Both this paper and the previous post seem like meaningful progress in ideas for making recommendation engines better (though as a caveat, I don’t follow this space and so don’t know to what extent this has been said before). I’m glad that we’re getting to the point of actually proposing technical solutions; I hope to see more papers implementing such solutions (we’ve seen one from Twitter recently (AN #123)).
FEEDBACK
I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
PODCAST
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.
[AN #137]: Quantifying the benefits of pretraining on downstream task performance
Link post
Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
HIGHLIGHTS
Scaling Laws for Transfer (Danny Hernandez et al) (summarized by Asya): This paper studies empirical scaling laws for transfer learning in language models. The authors use Transformer-based models to predict Python code by training on three different dataset curricula:
- Training from-scratch on Python code
- Pre-training on natural language, then fine-tuning on Python code
- Pre-training on natural language and non-Python code, then fine-tuning on Python code
The authors then measure the “effective data transferred” from pre-training—if we wanted to replace all the pre-training steps with from-scratch training, maintaining the same loss, how much additional from-scratch data would we need?
They find that when the amount of data used to train is small, effective data transferred is described by a simple power-law function of D_F, the amount of data used for fine-tuning, and N, the number of parameters: k (D_F)^α (N)^β, for constants k, α, and β.
In their experiments, β doesn’t change between pre-training on natural language and pre-training on a mixture of natural language and non-Python code. They hypothesize that β measures how the model architecture generalizes on the target distribution, and doesn’t depend on the contents of the pre-training data.
The authors think that α is a measure of the directed proximity of the pre-training and from-scratch distributions, with smaller α indicating closer proximity. Measuring α can be done cheaply by changing the finetuning dataset size while holding the pretrained model constant, making it useful for deciding between collecting more fine-tuning data and increasing model size. For pre-training on natural language and fine-tuning on Python, β is about 2 * α, so for decreasing loss, increasing the fine-tuning dataset size by a factor of C (e.g., 100x) would be worth approximately the same as increasing the model size by √C (e.g. 10x).
The authors find that pre-training on a mixture of natural language and non-Python code has a higher k but lower α than pre-training on natural language alone. The higher k indicates that the mixture model has better transfer performance when trained in the low data regime, while the lower α value means that benefits of the mixture model diminish as more data is used.
The authors also observe that:
- Not counting pre-training compute, pre-trained models are generally more compute efficient than from-scratch models when trained in the low data regime, approximately as compute efficient in the medium data regime, and less compute efficient in the high data regime (close to convergence).
- Small pre-trained models perform worse than small from-scratch models in the high data regime. The authors call this phenomenon “ossification”—a term used to suggest that small pre-trained models may have a hard time moving away from bad initializations.
- In general, pre-trained models of a given size are compute efficient (on the frontier of loss given compute) for a large portion of their fine-tuning. From-scratch models, by contrast, are only compute efficient for a narrow window of training—using too little compute for a given model dramatically increases loss and suggests that you should instead be using a smaller model. This makes pre-trained models in some sense “easier” to train.
Read more: Twitter thread
Asya’s opinion: It’s extremely cool to have a mathematical characterization of the power of pre-training. I would love to see similar analyses measuring effective data transferred for tasks other than predicting the next token—if it turns out that modest increases in model size compensate for small datasets in a wide variety of tasks, that makes a strong case that unsupervised learning will be most of the work towards transformative abilities.
Reading these scaling papers really makes me think that there’s some deeper theoretical understanding of distributions and predictive models that these results are accessing, maybe encapsulated by this theory paper that I still haven’t read...
Rohin’s opinion: Like Asya, I really like the simplicity of the functional form of the scaling law, and the fits to the data seem quite good. I was quite surprised that β seemed to be independent of the pretraining distribution; I would not have predicted that in advance.
Note that despite the form of the scaling law, the effective data transferred isn’t independent of the pretraining dataset size—that dependence effectively comes through the model size N. This is because authors use compute-optimal pretraining, and so a given model size N corresponds to a specific amount of pretraining compute, which is probably almost the same as having a specific amount of pretraining data.
I am confused by the claim that distributions that are “closer” have lower α. It does make sense that for identical distributions, we should have α = 0. However, if closer distributions have lower α, and β is independent of the distribution, then as your finetuning dataset gets sufficiently large, you actually prefer the further-away distribution! For example, once your finetuning dataset hits 10 trillion points, the scaling laws predict that you prefer to have pretrained on text, rather than 50% text and 50% code, which seems really bizarre. Possibly the scaling laws break down before that point, and in any case a finetuning dataset of 10 trillion data points would be ridiculously massive, but it still seems like something needs to be explained here.
Could we use this to improve timelines estimates? I take a stab at a calculation here; the very rough and uncertain conclusion is that while transfer does seem to be a useful way to reduce compute requirements, the overall effect is not large.
TECHNICAL AI ALIGNMENT
PROBLEMS
Challenges of Aligning Artificial Intelligence with Human Values (Margit Sutrop) (summarized by Rohin): This paper argues that since immoral humans could use AI systems to do harm, we must build ethical rules into AI systems. For this purpose, the traditional notion of “value alignment” is not enough, as it only requires that the AI system do what the user wants, which might not be ethical. But we also cannot embed a single theory of ethics into an AI system, as there is no agreement on such a theory. Instead, we should focus on what we don’t want an AI system to do, and rule out that behavior, while remaining uncertain or agnostic on what should be done.
Rohin’s opinion: I agree that successful AI alignment does not rule out the possibility of malicious use of AI systems. This paper is proposing putting rules inside the AI system that handle this problem. But as the paper itself notes, it seems quite challenging to even figure out what rules we would want.
I personally am more optimistic about the alternate route, where we put rules outside the AI system to handle the problem, that is, we create laws and regulations around the use of AI, such that malicious uses can be blamed on the human operator, and enforcement mechanisms that ensure that we actually catch and punish such malicious uses.
PREVENTING BAD BEHAVIOR
Challenges for Using Impact Regularizers to Avoid Negative Side Effects (David Lindner, Kyle Matoba, and Alexander Meulemans) (summarized by Rohin): I’m not summarizing this literature review on impact regularization because we’ve covered almost all of the ideas previously in this newsletter (e.g. this blog post (AN #49)). However, I do recommend it for its short, high-level introduction to existing ideas in impact regularization, as well as its ideas for future work.
HANDLING GROUPS OF AGENTS
Norms for beneficial A.I.: A computational analysis of the societal value alignment problem (Pedro Fernandes et al) (summarized by Rohin): This paper presents a simple quantitative model to argue for the following two observations:
1. Unless they are willing to “fall behind” others, individual actors will need to use AI systems to stay competitive.
2. Those AI systems will optimize for their owner’s goals, even though a better outcome could be achieved if all AI systems optimized for the average welfare across all actors.
MISCELLANEOUS (ALIGNMENT)
Distinguishing claims about training vs deployment (Richard Ngo) (summarized by Rohin): One story for AGI is that we train an AI system on some objective function, such as an objective that rewards the agent for following commands given to it by humans using natural language. We then deploy the system without any function that produces reward values; we instead give the trained agent commands in natural language. Many key claims in AI alignment benefit from more precisely stating whether they apply during training or during deployment.
For example, consider the instrumental convergence argument. The author proposes that we instead think of the training convergence thesis: a wide range of environments in which we could train an AGI will lead to the development of goal-directed behavior aimed towards certain convergent goals (such as self-preservation). This could happen either via the AGI internalizing them directly as final goals, or by the AGI learning final goals for which these goals are instrumental.
The author similarly clarifies goal specification, the orthogonality thesis, fragility of value, and Goodhart’s Law.
Putting the humanity into inhuman systems: How human factors and ergonomics can be used to manage the risks associated with artificial general intelligence (Paul M. Salmon et al) (summarized by Rohin): This paper argues that the methods of Human Factors and Ergonomics (HFE) should be applied to AGI safety. They list fifteen different methods from the field, typically used to analyze the performance of humans in systems, which could be applied to AGI instead (on the assumption that AGI will be more like humans than like machines in today’s systems). They then give examples of how these might be applied to the Prometheus story in the prologue of Life 3.0.
Rohin’s opinion: I’m not very familiar with this field, but among other techniques the paper mentions STAMP and STPA which we’ve previously seen in Engineering a Safer World (AN #112). It does seem to me like these techniques would be useful to apply to the entire sociotechnical system, of which an AGI system is just one part (and this is what the paper’s examples do). It is less clear to me whether it makes sense to take techniques designed for humans and apply them to AGI: perhaps we’ll have enough understanding of the differences between humans and AGI that we could do this in a reasonable way, but I think there is a real risk that the methods give incorrect conclusions simply because they make incorrect assumptions about how AGI works (given that they were designed for humans). Nonetheless, I do agree with the core claim of this paper that HFE is worth exploring.
The Challenge of Value Alignment: from Fairer Algorithms to AI Safety (Iason Gabriel et al) (summarized by Rohin): This book chapter provides an introduction to AI alignment from a philosophical lens.
NEAR-TERM CONCERNS
RECOMMENDER SYSTEMS
Beyond Engagement: Aligning Algorithmic Recommendations With Prosocial Goals (Jonathan Stray) (summarized by Rohin): To decide what item to show a user, a recommender system needs to have some metric by which to rank items. Since this metric must usually be automated, it often contains in large part some operationalization of “engagement”. Unfortunately, this metric may not be able to differentiate between clickbait or extremist content on the one hand, and actually valuable posts on the other. In a workshop on the topic, participants brainstormed five main approaches for improvement:
1. Build better controls: Offer users more and better ways to control their feed.
2. Develop standardized survey-based metrics: Surveys should be able to get a significantly higher quality signal to optimize than engagement.
3. Pay users for better data, such as survey data.
4. Recommend feeds, not items: If we rank items individually, it is quite likely that all the posts of the same type (e.g. controversial posts) will get high scores. By ranking entire feeds, we can also optimize for the diversity of items within the feed.
5. Incentivize the creation of different feeds, so that users can choose which ones they prefer all things considered.
What are you optimizing for? Aligning Recommender Systems with Human Values (Jonathan Stray et al) (summarized by Rohin): While the previous blog post focused on societal-level approaches to recommender systems, this paper looks at what can be done at a technical level. By analyzing existing case studies of improvements to recommender systems (some of which we’ve seen before (AN #96)), the authors identify a typical approach taken in industry today.
First, engineers identify a problem with an already-deployed recommendation engine, perhaps from user feedback, or through monitoring. Second, they develop a concrete procedure to identify instances of this problem in the recommendations—a typical approach is to curate a dataset and train an ML classifier to identify these instances, though it is also possible to use manual review. Finally, the recommender system is adjusted to avoid the problem, for example by adding a term to the objective when training the recommender system, or by filtering its outputs based on the classifier’s output.
The authors then propose four high-level technical approaches to recommender alignment:
1. Develop better measures of what we want out of a recommendation engine, for example, an operationalization of “well-being” rather than “engagement”.
2. Allow users to collaboratively design the recommendation engine (called participatory design). Rather than have a company decide on how to trade off between different objectives, allow the users to settle upon the appropriate tradeoffs themselves.
3. Interactively learn about the user’s values. While this could look like building better controls as suggested in the previous post, it could also involve e.g. using Inverse Reward Design (AN #69) to maintain appropriate uncertainty over what the user cares about.
4. Design around “time well spent”, as evaluated by users on reflection or after consideration, rather than revealed preferences or immediate judgments. For example, we could show users a summary of their activity over the past month and ask how happy they are about it.
Read more: Aligning AI Optimization to Community Well-Being (an expanded version of Aligning AI to Human Values means Picking the Right Metrics (AN #96))
Rohin’s opinion: Both this paper and the previous post seem like meaningful progress in ideas for making recommendation engines better (though as a caveat, I don’t follow this space and so don’t know to what extent this has been said before). I’m glad that we’re getting to the point of actually proposing technical solutions; I hope to see more papers implementing such solutions (we’ve seen one from Twitter recently (AN #123)).
FEEDBACK
I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
PODCAST
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.