Why aren’t you doing research on making pre-training better for alignment?
I was on a call today, and we talked about projects that involve studying how pre-trained models evolve throughout training and how we could guide the pre-training process to make models safer. For example, could models trained on synthetic/transformed data make models significantly more robust and essentially solve jailbreaking? How about the intersection of pretraining from human preferences and synthetic data? Could the resulting model be significantly easier to control? How would it impact the downstream RL process? Could we imagine a setting where we don’t need RL (or at least we’d be able to confidently use resulting models to automate alignment research)? I think many interesting projects could fall out of this work.
So, back to my main question: why aren’t you doing research on making pre-training better for alignment? Is it because it’s too expensive and doesn’t seem like a low-hanging fruit? Or do you feel it isn’t a plausible direction for aligning models?
We were wondering if there are technical bottlenecks that would make this kind of research more feasible for alignment research to better study how to guide the pretraining process in a way that benefits alignment. As in, would researchers be more inclined to do experiments in this direction if the entire pre-training code was handled and you’d just have to focus on whatever specific research question you have in mind? If we could access a large amount of compute (let’s say, through government resources) to do things like data labeling/filtering and pre-training multiple models, would this kind of work be more interesting for you to pursue?
I think many alignment research directions have grown simply because they had low-hanging fruits that didn’t require much compute (e.g., evals, and mech interp). It seems we’ve implicitly left all of the high-compute projects for the AGI labs to figure out. But what if we weren’t as bottlenecked on this anymore? It’s possible to retrain GPT-2 1.5B with under 700$ now (and 125M for 20$). I think we can find ways to do useful experiments, but my guess is that the level of technical expertise required to get it done is a bit high, and alignment researchers would rather avoid these kinds of projects since they are currently high-effort.
Rho-1: Not All Tokens Are What You Need RHO-1-1B and 7B achieves SotA results of 40.6% and 51.8% on MATH dataset, respectively — matching DeepSeekMath with only 3% of the pretraining tokens.
How to Train Data-Efficient LLMs Models trained on ASK-LLM data consistently outperform full-data training—even when we reject 90% of the original dataset, while converging up to 70% faster
AlignEZ: Using the self-generated preference data, we identify the subspaces that: (1) facilitate and (2) are harmful to alignment. During inference, we surgically modify the LM embedding using these identified subspaces. Jacques note: could we apply this iteratively throughout training (and other similar methods)?
What do we mean by “alignment”? What makes the model safe?
GPT-2 1.5B is small by today’s standards. I hypothesize people are not sure if findings made for models of this scale will generalize to frontier models (or at least to the level of LLaMa-3.1-70B), and that’s why nobody is working on it.
However, I was impressed by “Pre-Training from Human Preferences”. I suppose that pretraining could be improved, and it would be a massive deal for alignment.
how to guide the pretraining process in a way that benefits alignment
One key question here, I think: a major historical alignment concern has been that for any given finite set of outputs, there are an unbounded number of functions that could produce it, and so it’s hard to be sure that a model will generalize in a desirable way. Nora Belrose goes so far as to suggest that ‘Alignment worries are quite literally a special case of worries about generalization.’ This is relevant for post-training but I think even more so for pre-training.
I know that there’s been research into how neural networks generalize both from the AIS community and the larger ML community, but I’m not very familiar with it; hopefully someone else can provide some good references here.
Why aren’t you doing research on making pre-training better for alignment?
I was on a call today, and we talked about projects that involve studying how pre-trained models evolve throughout training and how we could guide the pre-training process to make models safer. For example, could models trained on synthetic/transformed data make models significantly more robust and essentially solve jailbreaking? How about the intersection of pretraining from human preferences and synthetic data? Could the resulting model be significantly easier to control? How would it impact the downstream RL process? Could we imagine a setting where we don’t need RL (or at least we’d be able to confidently use resulting models to automate alignment research)? I think many interesting projects could fall out of this work.
So, back to my main question: why aren’t you doing research on making pre-training better for alignment? Is it because it’s too expensive and doesn’t seem like a low-hanging fruit? Or do you feel it isn’t a plausible direction for aligning models?
We were wondering if there are technical bottlenecks that would make this kind of research more feasible for alignment research to better study how to guide the pretraining process in a way that benefits alignment. As in, would researchers be more inclined to do experiments in this direction if the entire pre-training code was handled and you’d just have to focus on whatever specific research question you have in mind? If we could access a large amount of compute (let’s say, through government resources) to do things like data labeling/filtering and pre-training multiple models, would this kind of work be more interesting for you to pursue?
I think many alignment research directions have grown simply because they had low-hanging fruits that didn’t require much compute (e.g., evals, and mech interp). It seems we’ve implicitly left all of the high-compute projects for the AGI labs to figure out. But what if we weren’t as bottlenecked on this anymore? It’s possible to retrain GPT-2 1.5B with under 700$ now (and 125M for 20$). I think we can find ways to do useful experiments, but my guess is that the level of technical expertise required to get it done is a bit high, and alignment researchers would rather avoid these kinds of projects since they are currently high-effort.
I talk about other related projects here.
Synthesized various resources for this “pre-training for alignment” type work:
Data
Synthetic Data
The RetroInstruct Guide To Synthetic Text Data
Alignment In The Age of Synthetic Data
Leveraging Agentic AI for Synthetic Data Generation
**AutoEvol**: Automatic Instruction Evolving for Large Language Models We build a fully automated Evol-Instruct pipeline to create high-quality, highly complex instruction tuning data
Synthetic Data Generation and AI Feedback notebook
The impact of models training on their own outputs and how its actually done well in practice
Google presents Best Practices and Lessons Learned on Synthetic Data for Language Models
Transformed/Enrichment of Data
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. TLDR: You can train 3x faster and with upto 10x lesser data with just synthetic rephrases of the web!
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Rho-1: Not All Tokens Are What You Need RHO-1-1B and 7B achieves SotA results of 40.6% and 51.8% on MATH dataset, respectively — matching DeepSeekMath with only 3% of the pretraining tokens.
Data Attribution
In-Run Data Shapley
Scaling Laws for the Value of Individual Data Points in Machine Learning We show how some data points are only valuable in small training sets; others only shine in large datasets.
What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions
Data Mixtures
Methods for finding optimal data mixture
RegMix: Data Mixture as Regression for Language Model Pre-training
Curriculum Learning
On transforming data into a curriculum to improve learning efficiency and capability
Curriculum learning that actually works?
Active Data Selection
MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models MATES significantly elevates the scaling curve by selecting the data based on the model’s evolving needs.
Data Filtering
Scaling Laws for Data Filtering—Data Curation cannot be Compute Agnostic Argues that data curation cannot be agnostic of the total compute that a model will be trained for Github
How to Train Data-Efficient LLMs Models trained on ASK-LLM data consistently outperform full-data training—even when we reject 90% of the original dataset, while converging up to 70% faster
On Pre-Training
Pre-Training from Human Preferences
Ethan Perez wondering if jailbreaks would be solved with this pre-training approach
LAION uses this approach for finegrained control over outputs during inference.
Nora Belrose thinks that alignment via pre-training would make models more robust to unlearning (she doesn’t say this, but this may be a good thing if you pre-train such that you don’t need unlearning)
Tomek describing some research direction for improving pre-training alignment
Simple and Scalable Strategies to Continually Pre-train Large Language Models
Neural Networks Learn Statistics of Increasing Complexity
Pre-Training towards the basin of attraction for alignment
Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis
Requirements for a Basin of Attraction to Alignment
A “Bitter Lesson” Approach to Aligning AGI and ASI
Alignment techniques
AlignEZ: Using the self-generated preference data, we identify the subspaces that: (1) facilitate and (2) are harmful to alignment. During inference, we surgically modify the LM embedding using these identified subspaces. Jacques note: could we apply this iteratively throughout training (and other similar methods)?
What do we mean by “alignment”? What makes the model safe?
Values
What does it mean for a model to have a value?
On making the model “care”
GPT-2 1.5B is small by today’s standards. I hypothesize people are not sure if findings made for models of this scale will generalize to frontier models (or at least to the level of LLaMa-3.1-70B), and that’s why nobody is working on it.
However, I was impressed by “Pre-Training from Human Preferences”. I suppose that pretraining could be improved, and it would be a massive deal for alignment.
One key question here, I think: a major historical alignment concern has been that for any given finite set of outputs, there are an unbounded number of functions that could produce it, and so it’s hard to be sure that a model will generalize in a desirable way. Nora Belrose goes so far as to suggest that ‘Alignment worries are quite literally a special case of worries about generalization.’ This is relevant for post-training but I think even more so for pre-training.
I know that there’s been research into how neural networks generalize both from the AIS community and the larger ML community, but I’m not very familiar with it; hopefully someone else can provide some good references here.