Bogdan Ionut Cirstea comments on jacquesthibs’s Shortform

Bogdan Ionut Cirstea 24 Jul 2024 19:01 UTC
1 point
0
Here’s Claude-3.5 (though I had to push it a bit in the direction of explicitly considering combing SAD and Data Shapley):
’Combining the Situational Awareness Dataset (SAD) benchmark with Shapley values, particularly the In-Run Data Shapley approach described in the other paper, could yield some interesting insights. Here are some potential ways to integrate these two approaches:
1. Attribute situational awareness to training data: Use In-Run Data Shapley to determine which training data contributes most to performance on SAD tasks. This could help identify what types of data are most important for developing situational awareness in AI models.
2. Analyze task-specific contributions: Calculate Shapley values for each category or individual task within SAD. This could reveal which parts of the training data are most influential for different aspects of situational awareness.
3. Track situational awareness development: Apply In-Run Data Shapley at different stages of training to see how the importance of different data points for situational awareness changes over time.
4. Identify potential deception enablers: Look for training data with high Shapley values for both SAD performance and other capabilities that might enable deception. This could help pinpoint data that contributes to potentially risky combinations of abilities.
5. Curate training data: Use the Shapley values to guide the curation of training datasets, potentially removing or de-emphasizing data that contributes disproportionately to unwanted levels of situational awareness.
6. Comparative analysis across models: Compare Shapley values for SAD performance across different model architectures or training regimes to understand how different approaches affect the development of situational awareness.
7. Investigate prompt influence: Apply In-Run Data Shapley to analyze how much the “situating prompt” contributes to SAD performance compared to other parts of the input.
8. Correlation studies: Examine correlations between Shapley values for SAD performance and other metrics like general knowledge or reasoning abilities.
9. Targeted intervention experiments: Use Shapley values to identify high-impact training examples for situational awareness, then experiment with modifying or removing these examples to see how it affects model behavior.
10. Robustness analysis: Assess how stable the Shapley values are for SAD performance across different runs or slight variations in the training process. This could provide insights into how consistently situational awareness develops.
11. Transfer learning insights: If fine-tuning models on SAD-like tasks, use Shapley to understand which pre-training data contributes most to quick adaptation.
12. Bias detection: Look for any demographic biases in the training data that have high Shapley values for SAD performance, which could indicate skewed development of situational awareness.
By combining these approaches, researchers could gain a more nuanced understanding of how situational awareness develops in AI models and what factors contribute most to this development. This could inform strategies for developing AI systems with appropriate levels of situational awareness while mitigating risks associated with excessive or misaligned awareness.′