You’re mostly right about evals/thresholds. Mea culpa. Sorry for my sloppiness.
For misuse, xAI has benchmarks and thresholds—or rather examples of benchmarks thresholds to appear in the real future framework—and based on the right column they seem very reasonably low.
Unlike other similar documents, these are not thresholds at which to implement mitigations but rather thresholds to reduce performance to. So it seems the primary concern is probably not the thresholds are too high but rather xAI’s mitigations won’t be robust to jailbreaks and xAI won’t elicit performance on post-mitigation models well. E.g. it would be inadequate to just run a benchmark with a refusal-trained model, note that it almost always refuses, and call it a success. You need something like: a capable red-team tries to break the mitigations and use the model for harm, and either the red-team fails or it’s so costly that the model doesn’t make doing harm cheaper.
(For “Loss of Control,” one of the two cited benchmarks was published today—I’m dubious that it measures what we care about but I’ve only spent ~3 mins engaging—and one has not yet been published. [Edit: and, like, on priors, I’m very skeptical of alignment evals/metrics, given the prospect of deceptive alignment, how we care about worst-case in addition to average-case behavior, etc.])
xAI’s thresholds are entirely concrete and not extremely high.
They are specified and as high-quality as you can get. (If there are better datasets let me know.)
I’m not saying it’s perfect, but I wouldn’t but them all in the same bucket. Meta’s is very different from DeepMind’s or xAI’s.
xAI Risk Management Framework (Draft)
You’re mostly right about evals/thresholds. Mea culpa. Sorry for my sloppiness.
For misuse, xAI has benchmarks and thresholds—or rather examples of benchmarks thresholds to appear in the real future framework—and based on the right column they seem very reasonably low.
Unlike other similar documents, these are not thresholds at which to implement mitigations but rather thresholds to reduce performance to. So it seems the primary concern is probably not the thresholds are too high but rather xAI’s mitigations won’t be robust to jailbreaks and xAI won’t elicit performance on post-mitigation models well. E.g. it would be inadequate to just run a benchmark with a refusal-trained model, note that it almost always refuses, and call it a success. You need something like: a capable red-team tries to break the mitigations and use the model for harm, and either the red-team fails or it’s so costly that the model doesn’t make doing harm cheaper.
(For “Loss of Control,” one of the two cited benchmarks was published today—I’m dubious that it measures what we care about but I’ve only spent ~3 mins engaging—and one has not yet been published. [Edit: and, like, on priors, I’m very skeptical of alignment evals/metrics, given the prospect of deceptive alignment, how we care about worst-case in addition to average-case behavior, etc.])