I think it’s interesting to see how much improvements in different types of safety benchmarks correlate with advancement in model capabilities. I also agree that designing decorrelated benchmarks is important, simply because it indicates they won’t be saturated as easily. However, I have some doubts regarding the methodology and would appreciate clarifications if I misinterpreted something:
Using model performance based correlation: If I’m not wrong, the correlation of capability and safety is measured using the performance of various models on benchmarks. This seems more a metric of how AI progress has been in the past, rather than saying much about the benchmark itself. Its quite possible that models with more capabilities also have more safety interventions (as they came out later, where presumably more safety research had been used), and that’s why there is a correlation. On the flipside, if future model releases apply weapon-risk reduction techniques like unlearning, then those benchmarks will also start showing a positive correlation. Thus, I’m not sure if this methodology provides robust insights for judging benchmarks. Further, it can also be gamed (artificially lower correlation) by strategically picking more models with safety interventions applied.
Projection on principal component: Why is comparing the projection on the first principal component of the Capability/Safety_benchmark x Model matrix preferable to just comparing average accuracies across capabilities and safety benchmarks?
I’ll begin by saying more about our approach for measuring “capabilities scores” for a set of models (given their scores on a set of b standard capabilities benchmarks). We’ll assume that all benchmark scores have been normalized. We need some way of converting each model’s b benchmark scores into one capabilities score per model. Averaging involves taking a weighted combination of the benchmarks, where the weights are equal. Our method is similarly a weighted combination of the benchmarks, but where the weights are higher for benchmarks that better discriminate model performance. This can be done by choosing the weights according to the first principal component of the benchmark scores matrix. I think this is a more principled way to choose the weights, but I also suspect an analysis involving averaged benchmark scores as the “capabilities score” would produce very similar qualitative results.
To your first point, I agree that correlations should not be inferred as a statement of intrinsic or causal connection between model general capabilities and a particular safety benchmark (or the underlying safety property it operationalizes). After all, a safety benchmark’s correlations depend both on the choice of capabilities benchmarks used to produce capabilities scores, as well as the set of models. We don’t claim that correlations observed for one set of models will persist for more capable models, or models which incorporate new safety techniques.
Instead, I see correlations as a practical starting point for answering the question “Which safety issues will persist with scale?” This is ultimately a key question safety researchers and funders should be trying to discern when allocating their resources toward differential safety progress. I believe that an empirical “science of benchmarking” is necessary to improve this effort, and correlations are a good first step.
How do our recommendations account for the limitations of capabilities correlations? From the Discussion section:
As new training techniques (e.g., base, chat fine-tuning, refusal training, adversarial training, circuit breakers) are integrated into new models, the relevance and adequacy of existing safety benchmarks—and their entanglement with capabilities benchmarks—should also be regularly reassessed.
[This figure was cut from the paper, but I think it’s helpful here.]
An earlier draft of the paper placed more emphasis on the idea of a model class, referring to the mix of techniques used to produce a model (RLHF, adv training, unlearning, etc). Different model classes may have different profiles of entanglements between safety properties and capabilities. For example, RLHF might be expected to entangle truthfulness and reduced toxicity with model capabilities, and indeed we see different correlations for base and chat models on various alignment, truthfulness, and toxicity evals (although the difference was smaller and less consistent than we expected). I think our methodology makes the most sense when correlations are calculated only use models from the same class (or even models from the same family). Future safety interventions can and should attempt to entangle a wider range of safety properties with capabilities; ideally, we’ll identify training recipes which entangle all safety properties with capabilities.
I think it’s interesting to see how much improvements in different types of safety benchmarks correlate with advancement in model capabilities. I also agree that designing decorrelated benchmarks is important, simply because it indicates they won’t be saturated as easily. However, I have some doubts regarding the methodology and would appreciate clarifications if I misinterpreted something:
Using model performance based correlation: If I’m not wrong, the correlation of capability and safety is measured using the performance of various models on benchmarks. This seems more a metric of how AI progress has been in the past, rather than saying much about the benchmark itself. Its quite possible that models with more capabilities also have more safety interventions (as they came out later, where presumably more safety research had been used), and that’s why there is a correlation. On the flipside, if future model releases apply weapon-risk reduction techniques like unlearning, then those benchmarks will also start showing a positive correlation. Thus, I’m not sure if this methodology provides robust insights for judging benchmarks. Further, it can also be gamed (artificially lower correlation) by strategically picking more models with safety interventions applied.
Projection on principal component: Why is comparing the projection on the first principal component of the Capability/Safety_benchmark x Model matrix preferable to just comparing average accuracies across capabilities and safety benchmarks?
I’ll begin by saying more about our approach for measuring “capabilities scores” for a set of models (given their scores on a set of b standard capabilities benchmarks). We’ll assume that all benchmark scores have been normalized. We need some way of converting each model’s b benchmark scores into one capabilities score per model. Averaging involves taking a weighted combination of the benchmarks, where the weights are equal. Our method is similarly a weighted combination of the benchmarks, but where the weights are higher for benchmarks that better discriminate model performance. This can be done by choosing the weights according to the first principal component of the benchmark scores matrix. I think this is a more principled way to choose the weights, but I also suspect an analysis involving averaged benchmark scores as the “capabilities score” would produce very similar qualitative results.
To your first point, I agree that correlations should not be inferred as a statement of intrinsic or causal connection between model general capabilities and a particular safety benchmark (or the underlying safety property it operationalizes). After all, a safety benchmark’s correlations depend both on the choice of capabilities benchmarks used to produce capabilities scores, as well as the set of models. We don’t claim that correlations observed for one set of models will persist for more capable models, or models which incorporate new safety techniques.
Instead, I see correlations as a practical starting point for answering the question “Which safety issues will persist with scale?” This is ultimately a key question safety researchers and funders should be trying to discern when allocating their resources toward differential safety progress. I believe that an empirical “science of benchmarking” is necessary to improve this effort, and correlations are a good first step.
How do our recommendations account for the limitations of capabilities correlations? From the Discussion section:
[This figure was cut from the paper, but I think it’s helpful here.]
An earlier draft of the paper placed more emphasis on the idea of a model class, referring to the mix of techniques used to produce a model (RLHF, adv training, unlearning, etc). Different model classes may have different profiles of entanglements between safety properties and capabilities. For example, RLHF might be expected to entangle truthfulness and reduced toxicity with model capabilities, and indeed we see different correlations for base and chat models on various alignment, truthfulness, and toxicity evals (although the difference was smaller and less consistent than we expected). I think our methodology makes the most sense when correlations are calculated only use models from the same class (or even models from the same family). Future safety interventions can and should attempt to entangle a wider range of safety properties with capabilities; ideally, we’ll identify training recipes which entangle all safety properties with capabilities.
Thanks, the rationale for using PCA was quite interesting. I also quite like the idea of separating different model classes for this evaluation.