In-Run Data Shapley: Data attribution method efficient enough for pre-training data attribution.
Essentially, it can track how individual data points (or clusters) impact model performance across pre-training. You just need to develop a set of validation examples to continually check the model’s performance on those examples during pre-training. Amazingly, you can do this over the course of a single training run; no need to require multiple pre-training runs like other data attribution methods have required.
Other methods, like influence functions, are too computationally expensive to run during pre-training and can only be run post-training.
So, here’s why this might be interesting from an alignment perspective:
You might be able to set up a bunch of validation examples to test specific behaviour in the models so that we are hyper-aware of which data points contribute the most to that behaviour. For example, self-awareness or self-preservation.
Given that this is possible to run during pre-training, you might understand model behaviour at such a granular level that you can construct data mixtures/curriculums that push the model towards internalizing ‘human values’ much sooner than it develops behaviours or capabilities we wouldn’t want. Or, you delay self-awareness and such much further along in the training process.
In this @RogerDearnaleypost, A “Bitter Lesson” Approach to Aligning AGI and ASI, Roger proposes training an AI on a synthetic dataset where all intelligences are motivated by the collective well-being of humanity. You are trying to bias the model to be as close to the basin of attraction for alignment as possible. In-Run Data Shapley could be used to construct such a dataset and guide the training process so that the training data best exemplifies the desired aligned behaviour.
If you are interested in this kind of research, let me know! I’d love to brainstorm some potential projects and then apply for funding if there is something promising there.
I sent some related project ideas to @RogerDearnaley via DMs, but figured I should share them here to in case someone would like to give feedback or would like to collaborate on one of them.
I think data is underrated among the alignment community (synthetic/transformed data even more). I have been thinking about it from the perspective of pre-training and post-training. My initial look into synthetic data was related to online learning and essentially controlling model behaviour. I was interested in papers like this one by Google, where they significantly reduce sycophancy in an LLM via 1k synthetically generated examples. Data shapes behaviour, and I think many people do not acknowledge this enough (which sometimes leads them to make confused conclusions about model behaviour).
In terms of specific research projects, my current ideas fall into these kinds of buckets:
Pre-training close to the basin of attraction for alignment
How much can we improve “Pretraining Language Models with Human Preferences”? I’d like to transform training in various ways (as mentioned in your posts). For example, I could take fineweb and pre-train a GPT-2 sized model with the original dataset and a transformed version. Unclear so far which things I’d like to measure the most at that model size, though. A downstream experiment: is one model more likely to reward hack over the other? Does shard theory help us come up with useful experiments (pre-training with human feedback is almost like reinforcing behaviour and leveraging some form of shard theory)? Note that Google used a similar pre-training scheme for PaLM 2:
How can the “basin of attraction for alignment” be mathematically formalized?
Trying to the impact of systematic errors:
Studying reward misspecification: do the reward labels have a systematic effect and bias in pushing the model? How much of the model’s behaviour is determined by the data itself vs. the reward model’s misspecification? My current reading of the literature on this is a bit unclear. However, there’s a paper saying: “We present a novel observation about the behaviour of offline reinforcement learning (RL) algorithms: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with “wrong” reward labels, such as those that are zero everywhere or are negatives of the true rewards.”
How do we design the training curriculum to significantly bias the model’s pre-training close to the basin of attraction for alignment?
Studying some form of iterative training where we have a synthetically trained model vs a normally trained model and then measure things like model drift. For example, is the model more likely to drift (in an online setting) in ways we wouldn’t want it to if it is pre-trained on normal text, but the process is more safely guided through synthetic pre-training?
Part of the alignment challenge (for example, the concern of scheming AIs) is that the order in which the model learns things might matter. For example, you’d want the model to internalize a solid world model of human values before it gains the situational awareness required to manipulate its training process (scheme). So, can we design a training curriculum for specific capabilities s.t. the model learns capabilities in an ideal sequence?
Data attribution project ideas
How to make this approach work in tandem with unlearning?
Use data attribution methods to understand how specific data shapes model behaviour and use that information to reconstruct pre-training to shape model behaviour in the way we want. For example, can we side-step the need for unlearning? Can these data attribution methods augment unlearning to work better?
As Roger said in his comment, we can try to manage the dataset to prevent WMB-dangerous capabilities and things like self-replication. It’s quite possible that unlearning will not be enough.
Another project would be to fine-tune on a dataset with and without the dangerous capabilities we don’t want and use that as a benchmark for unlearning methods (and how easy it is to fine-tune the capability back into the model).
Including other methods beyond data attribution (e.g. SAEs) to measure model evolution through training.
Is it possible to better understand and predict emergence via data attribution?
Studying model generalization via data attribution (doing similar things to the influence functions paper, but through time). Though the most interesting behaviour may only come at scales I wouldn’t have the compute for.
Would there be value in using an early checkpoint in training and then training on the synthetic data from that point forward? At which point in training does this make sense to do?
In this @RogerDearnaleypost, A “Bitter Lesson” Approach to Aligning AGI and ASI, Roger proposes training an AI on a synthetic dataset where all intelligences are motivated by the collective well-being of humanity. You are trying to bias the model to be as close to the basin of attraction for alignment as possible. In-Run Data Shapley could be used to construct such a dataset and guide the training process so that the training data best exemplifies the desired aligned behaviour.
I love this idea! Thanks for suggesting it. (It is of course, not a Bitter Lesson approach, but may well still be a great idea.)
Another area where being able to do this efficiently at scale is going to be really important is once models start showing dangerous levels of capability on WMB-dangerous chem/bio/radiological/nuclear (CBRN) and self-replication skills. The best way to deal with this is to make sure these skills aren’t in the model at all, so the model can’t be fine-tuned back to these capabilities (as is required to produce a model of this level where one could at least discuss open-sourcing it, rather than that being just flagrantly crazy and arguably perhaps already illegal), is to omit key knowledge from the training set entirely. Which inevitably isn’t going to succeed on the first pass, but this technique applied to the first pass gives us a way to find (hopefully) everything we need to remove from the training set so we can do a second training run that has specific, focused, narrow gaps in its capabilities.
And yes, I’m interested in work in this area (my AI day-job allowing).
Hey, we’ve been brainstorm ideas about better training strategies for base models and what types of experiments we can run at a small scale (e.g. training gpt-2 ) to get initial information. I think this idea is really promising and would love to chat about it.
You might be able to set up a bunch of validation examples to test specific behaviour in the models so that we are hyper-aware of which data points contribute the most to that behaviour. For example, self-awareness or self-preservation.
Another related idea (besides / on top of e.g. delaying the learning of dangerous capabilities / prerequisites to scheming) could be to incentivize them to e.g. be retrieved in-context, rather than be learned in-weights (to the degree they’re important for performance), for (differential) transparency reasons.
Also, similarly to recent unlearning papers, it might be useful to also have a validation dataset as a proxy for which capabilities should be preserved; and potentially try (cheap) synthetic data to compensate for any capabilities losses on that one.
Yeah, I was thinking about using SAD. The main issue is that for non-AGI-lab-sized models, you’ll have a tough time eliciting SA. However, we could potentially focus on precursor capabilities and such.
If you are concerned about capabilities like SA, then you might ask yourself, “it seems highly unlikely that you can figure out which data points impact SA the most because it will likely be a mix of many things and each data point will play a role in accumulating to SA.” My guess is that you can break down SA into enough precursor capabilities that this approach can still be highly predictive even if it isn’t 100%/
I think forcing them to retrieve in-context sounds good, but I also think labs may not want this, not sure. Basically, they’ll want to train things into the model eventually, like for many CoT things.
Agreed on having a validation set for reducing the alignment tax.
Here’s Claude-3.5 (though I had to push it a bit in the direction of explicitly considering combing SAD and Data Shapley): ’Combining the Situational Awareness Dataset (SAD) benchmark with Shapley values, particularly the In-Run Data Shapley approach described in the other paper, could yield some interesting insights. Here are some potential ways to integrate these two approaches:
Attribute situational awareness to training data: Use In-Run Data Shapley to determine which training data contributes most to performance on SAD tasks. This could help identify what types of data are most important for developing situational awareness in AI models.
Analyze task-specific contributions: Calculate Shapley values for each category or individual task within SAD. This could reveal which parts of the training data are most influential for different aspects of situational awareness.
Track situational awareness development: Apply In-Run Data Shapley at different stages of training to see how the importance of different data points for situational awareness changes over time.
Identify potential deception enablers: Look for training data with high Shapley values for both SAD performance and other capabilities that might enable deception. This could help pinpoint data that contributes to potentially risky combinations of abilities.
Curate training data: Use the Shapley values to guide the curation of training datasets, potentially removing or de-emphasizing data that contributes disproportionately to unwanted levels of situational awareness.
Comparative analysis across models: Compare Shapley values for SAD performance across different model architectures or training regimes to understand how different approaches affect the development of situational awareness.
Investigate prompt influence: Apply In-Run Data Shapley to analyze how much the “situating prompt” contributes to SAD performance compared to other parts of the input.
Correlation studies: Examine correlations between Shapley values for SAD performance and other metrics like general knowledge or reasoning abilities.
Targeted intervention experiments: Use Shapley values to identify high-impact training examples for situational awareness, then experiment with modifying or removing these examples to see how it affects model behavior.
Robustness analysis: Assess how stable the Shapley values are for SAD performance across different runs or slight variations in the training process. This could provide insights into how consistently situational awareness develops.
Transfer learning insights: If fine-tuning models on SAD-like tasks, use Shapley to understand which pre-training data contributes most to quick adaptation.
Bias detection: Look for any demographic biases in the training data that have high Shapley values for SAD performance, which could indicate skewed development of situational awareness.
By combining these approaches, researchers could gain a more nuanced understanding of how situational awareness develops in AI models and what factors contribute most to this development. This could inform strategies for developing AI systems with appropriate levels of situational awareness while mitigating risks associated with excessive or misaligned awareness.′
Recent paper I thought was cool:
In-Run Data Shapley: Data attribution method efficient enough for pre-training data attribution.
Essentially, it can track how individual data points (or clusters) impact model performance across pre-training. You just need to develop a set of validation examples to continually check the model’s performance on those examples during pre-training. Amazingly, you can do this over the course of a single training run; no need to require multiple pre-training runs like other data attribution methods have required.
Other methods, like influence functions, are too computationally expensive to run during pre-training and can only be run post-training.
So, here’s why this might be interesting from an alignment perspective:
You might be able to set up a bunch of validation examples to test specific behaviour in the models so that we are hyper-aware of which data points contribute the most to that behaviour. For example, self-awareness or self-preservation.
Given that this is possible to run during pre-training, you might understand model behaviour at such a granular level that you can construct data mixtures/curriculums that push the model towards internalizing ‘human values’ much sooner than it develops behaviours or capabilities we wouldn’t want. Or, you delay self-awareness and such much further along in the training process.
In this @RogerDearnaley post, A “Bitter Lesson” Approach to Aligning AGI and ASI, Roger proposes training an AI on a synthetic dataset where all intelligences are motivated by the collective well-being of humanity. You are trying to bias the model to be as close to the basin of attraction for alignment as possible. In-Run Data Shapley could be used to construct such a dataset and guide the training process so that the training data best exemplifies the desired aligned behaviour.
If you are interested in this kind of research, let me know! I’d love to brainstorm some potential projects and then apply for funding if there is something promising there.
I sent some related project ideas to @RogerDearnaley via DMs, but figured I should share them here to in case someone would like to give feedback or would like to collaborate on one of them.
I think data is underrated among the alignment community (synthetic/transformed data even more). I have been thinking about it from the perspective of pre-training and post-training. My initial look into synthetic data was related to online learning and essentially controlling model behaviour. I was interested in papers like this one by Google, where they significantly reduce sycophancy in an LLM via 1k synthetically generated examples. Data shapes behaviour, and I think many people do not acknowledge this enough (which sometimes leads them to make confused conclusions about model behaviour).
In terms of specific research projects, my current ideas fall into these kinds of buckets:
Pre-training close to the basin of attraction for alignment
How much can we improve “Pretraining Language Models with Human Preferences”? I’d like to transform training in various ways (as mentioned in your posts). For example, I could take fineweb and pre-train a GPT-2 sized model with the original dataset and a transformed version. Unclear so far which things I’d like to measure the most at that model size, though. A downstream experiment: is one model more likely to reward hack over the other? Does shard theory help us come up with useful experiments (pre-training with human feedback is almost like reinforcing behaviour and leveraging some form of shard theory)? Note that Google used a similar pre-training scheme for PaLM 2:
How can the “basin of attraction for alignment” be mathematically formalized?
Trying to the impact of systematic errors:
Studying reward misspecification: do the reward labels have a systematic effect and bias in pushing the model? How much of the model’s behaviour is determined by the data itself vs. the reward model’s misspecification? My current reading of the literature on this is a bit unclear. However, there’s a paper saying: “We present a novel observation about the behaviour of offline reinforcement learning (RL) algorithms: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with “wrong” reward labels, such as those that are zero everywhere or are negatives of the true rewards.”
How do we design the training curriculum to significantly bias the model’s pre-training close to the basin of attraction for alignment?
Studying some form of iterative training where we have a synthetically trained model vs a normally trained model and then measure things like model drift. For example, is the model more likely to drift (in an online setting) in ways we wouldn’t want it to if it is pre-trained on normal text, but the process is more safely guided through synthetic pre-training?
Part of the alignment challenge (for example, the concern of scheming AIs) is that the order in which the model learns things might matter. For example, you’d want the model to internalize a solid world model of human values before it gains the situational awareness required to manipulate its training process (scheme). So, can we design a training curriculum for specific capabilities s.t. the model learns capabilities in an ideal sequence?
Data attribution project ideas
How to make this approach work in tandem with unlearning?
Use data attribution methods to understand how specific data shapes model behaviour and use that information to reconstruct pre-training to shape model behaviour in the way we want. For example, can we side-step the need for unlearning? Can these data attribution methods augment unlearning to work better?
As Roger said in his comment, we can try to manage the dataset to prevent WMB-dangerous capabilities and things like self-replication. It’s quite possible that unlearning will not be enough.
Another project would be to fine-tune on a dataset with and without the dangerous capabilities we don’t want and use that as a benchmark for unlearning methods (and how easy it is to fine-tune the capability back into the model).
Including other methods beyond data attribution (e.g. SAEs) to measure model evolution through training.
Is it possible to better understand and predict emergence via data attribution?
Studying model generalization via data attribution (doing similar things to the influence functions paper, but through time). Though the most interesting behaviour may only come at scales I wouldn’t have the compute for.
Would there be value in using an early checkpoint in training and then training on the synthetic data from that point forward? At which point in training does this make sense to do?
I love this idea! Thanks for suggesting it. (It is of course, not a Bitter Lesson approach, but may well still be a great idea.)
Another area where being able to do this efficiently at scale is going to be really important is once models start showing dangerous levels of capability on WMB-dangerous chem/bio/radiological/nuclear (CBRN) and self-replication skills. The best way to deal with this is to make sure these skills aren’t in the model at all, so the model can’t be fine-tuned back to these capabilities (as is required to produce a model of this level where one could at least discuss open-sourcing it, rather than that being just flagrantly crazy and arguably perhaps already illegal), is to omit key knowledge from the training set entirely. Which inevitably isn’t going to succeed on the first pass, but this technique applied to the first pass gives us a way to find (hopefully) everything we need to remove from the training set so we can do a second training run that has specific, focused, narrow gaps in its capabilities.
And yes, I’m interested in work in this area (my AI day-job allowing).
Hey, we’ve been brainstorm ideas about better training strategies for base models and what types of experiments we can run at a small scale (e.g. training gpt-2 ) to get initial information. I think this idea is really promising and would love to chat about it.
It’s cool that you point to @Tomek Korbak because I was wondering if we could think of ways to extend his Pretraining Language Models with Human Preferences paper in ways that Roger mentions in his post.
Happy to chat!
This might be relatively straightforward to operationalize using (subsets of) the dataset from Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs.
Another related idea (besides / on top of e.g. delaying the learning of dangerous capabilities / prerequisites to scheming) could be to incentivize them to e.g. be retrieved in-context, rather than be learned in-weights (to the degree they’re important for performance), for (differential) transparency reasons.
Also, similarly to recent unlearning papers, it might be useful to also have a validation dataset as a proxy for which capabilities should be preserved; and potentially try (cheap) synthetic data to compensate for any capabilities losses on that one.
Yeah, I was thinking about using SAD. The main issue is that for non-AGI-lab-sized models, you’ll have a tough time eliciting SA. However, we could potentially focus on precursor capabilities and such.
If you are concerned about capabilities like SA, then you might ask yourself, “it seems highly unlikely that you can figure out which data points impact SA the most because it will likely be a mix of many things and each data point will play a role in accumulating to SA.” My guess is that you can break down SA into enough precursor capabilities that this approach can still be highly predictive even if it isn’t 100%/
I think forcing them to retrieve in-context sounds good, but I also think labs may not want this, not sure. Basically, they’ll want to train things into the model eventually, like for many CoT things.
Agreed on having a validation set for reducing the alignment tax.
Here’s Claude-3.5 (though I had to push it a bit in the direction of explicitly considering combing SAD and Data Shapley):
’Combining the Situational Awareness Dataset (SAD) benchmark with Shapley values, particularly the In-Run Data Shapley approach described in the other paper, could yield some interesting insights. Here are some potential ways to integrate these two approaches:
Attribute situational awareness to training data: Use In-Run Data Shapley to determine which training data contributes most to performance on SAD tasks. This could help identify what types of data are most important for developing situational awareness in AI models.
Analyze task-specific contributions: Calculate Shapley values for each category or individual task within SAD. This could reveal which parts of the training data are most influential for different aspects of situational awareness.
Track situational awareness development: Apply In-Run Data Shapley at different stages of training to see how the importance of different data points for situational awareness changes over time.
Identify potential deception enablers: Look for training data with high Shapley values for both SAD performance and other capabilities that might enable deception. This could help pinpoint data that contributes to potentially risky combinations of abilities.
Curate training data: Use the Shapley values to guide the curation of training datasets, potentially removing or de-emphasizing data that contributes disproportionately to unwanted levels of situational awareness.
Comparative analysis across models: Compare Shapley values for SAD performance across different model architectures or training regimes to understand how different approaches affect the development of situational awareness.
Investigate prompt influence: Apply In-Run Data Shapley to analyze how much the “situating prompt” contributes to SAD performance compared to other parts of the input.
Correlation studies: Examine correlations between Shapley values for SAD performance and other metrics like general knowledge or reasoning abilities.
Targeted intervention experiments: Use Shapley values to identify high-impact training examples for situational awareness, then experiment with modifying or removing these examples to see how it affects model behavior.
Robustness analysis: Assess how stable the Shapley values are for SAD performance across different runs or slight variations in the training process. This could provide insights into how consistently situational awareness develops.
Transfer learning insights: If fine-tuning models on SAD-like tasks, use Shapley to understand which pre-training data contributes most to quick adaptation.
Bias detection: Look for any demographic biases in the training data that have high Shapley values for SAD performance, which could indicate skewed development of situational awareness.
By combining these approaches, researchers could gain a more nuanced understanding of how situational awareness develops in AI models and what factors contribute most to this development. This could inform strategies for developing AI systems with appropriate levels of situational awareness while mitigating risks associated with excessive or misaligned awareness.′