I’m disappointed that there weren’t any non-capability metrics reported. IMO it would be good if companies could at least partly race and market on reliability metrics like “not hallucinating” and “not being easy to jailbreak”.
Edit: As pointed out in reply, addendum contains metrics on refusals which show progress, yay! Broader point still stands, I wish there were more measurements and they were more prominent.
If anyone wants to work on this, there’s a contest with $50K and $20K prizes for creating safety relevant benchmarks.
https://www.mlsafety.org/safebench
A thing I’d really like to exist is a good auto-interpretability benchmark, e.g. that asks the model about interpreting GPT-2 neurons given max activating examples.
Would be nice, but I was thinking of metrics that require “we’ve done the hard work of understanding our models and making them more reliable”, better neuron explanation seems more like it’s another smartness test.
Yeah, I agree it’s largely smartness, and I agree that it’d also be nice to have more non-smartness benchmarks—but I think an auto-interp-based thing would be a substantial improvement over current smartness benchmarks.
Maybe we should make fake datasets for this? Neurons often aren’t that interpretable and we’re still confused about SAE features a lot of the time. It would be nice to distinguish “can do autointerp | interpretable generating function of complexity x” from “can do autointerp”.
Can you say more about why you would want this to exist? Is it just that “do auto-interpretability well” is a close proxy for “model could be used to help with safety research”? Or are you also thinking about deception / sandbagging, or other considerations.
Funnily enough, Nvidia’s recent 340B parameter chat assistant release didboast about being number one on the reward model leaderboard, however, the reward model only claims to capture helpfulness and a bunch of other metrics of usefulness to the individual user. But that’s still pretty good.
I’m disappointed that there weren’t any non-capability metrics reported. IMO it would be good if companies could at least partly race and market on reliability metrics like “not hallucinating” and “not being easy to jailbreak”.
Edit: As pointed out in reply, addendum contains metrics on refusals which show progress, yay! Broader point still stands, I wish there were more measurements and they were more prominent.
Their addendum contains measurements on refusals and harmlessness, though these aren’t that meaningful and weren’t advertised.
If anyone wants to work on this, there’s a contest with $50K and $20K prizes for creating safety relevant benchmarks. https://www.mlsafety.org/safebench
Agree. I think Google DeepMind might actually be the most forthcoming about this kind of thing, e.g., see their Evaluating Frontier Models for Dangerous Capabilities report.
I thought that paper was just dangerous-capability evals, not safety-related metrics like adversarial robustness.
A thing I’d really like to exist is a good auto-interpretability benchmark, e.g. that asks the model about interpreting GPT-2 neurons given max activating examples.
Would be nice, but I was thinking of metrics that require “we’ve done the hard work of understanding our models and making them more reliable”, better neuron explanation seems more like it’s another smartness test.
Yeah, I agree it’s largely smartness, and I agree that it’d also be nice to have more non-smartness benchmarks—but I think an auto-interp-based thing would be a substantial improvement over current smartness benchmarks.
Maybe we should make fake datasets for this? Neurons often aren’t that interpretable and we’re still confused about SAE features a lot of the time. It would be nice to distinguish “can do autointerp | interpretable generating function of complexity x” from “can do autointerp”.
Can you say more about why you would want this to exist? Is it just that “do auto-interpretability well” is a close proxy for “model could be used to help with safety research”? Or are you also thinking about deception / sandbagging, or other considerations.
Funnily enough, Nvidia’s recent 340B parameter chat assistant release did boast about being number one on the reward model leaderboard, however, the reward model only claims to capture helpfulness and a bunch of other metrics of usefulness to the individual user. But that’s still pretty good.