I want there to be a collection of safety & reliability benchmarks. Like AIR-Bench but with x-risk-relevant metrics. This could help us notice how well labs are doing on them and incentivize the labs to do better.
So I want someone to collect safety benchmarks (e.g. TruthfulQA, some of HarmBench, refusals maybe like Anthropic reported, idk) and run them on current models (Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, Llama 3.1 405B, idk), and update this as new safety benchmarks and models appear.
I want there to be a collection of safety & reliability benchmarks. Like AIR-Bench but with x-risk-relevant metrics. This could help us notice how well labs are doing on them and incentivize the labs to do better.
So I want someone to collect safety benchmarks (e.g. TruthfulQA, some of HarmBench, refusals maybe like Anthropic reported, idk) and run them on current models (Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, Llama 3.1 405B, idk), and update this as new safety benchmarks and models appear.
h/t @William_S