Strongly agreed. The question is how to make durable benchmarks for ai safety that are not themselves vulnerable to goodharting. Some prior work on benchmark design (selected from the results for a metaphor.systems query for this comment):
++++ https://bair.berkeley.edu/blog/2021/07/08/basalt/ - “a NeurIPS competition and benchmark called BASALT: a set of Minecraft environments and a human evaluation protocol that we hope will stimulate research and investigation into solving tasks with no pre-specified reward function”
++ https://arxiv.org/abs/1907.01475 ”… . This paper investigates safety and generalization from a limited number of training environments in deep reinforcement learning (RL). We find RL algorithms can fail dangerously on unseen test environments even when performing perfectly on training environments. …”
+ https://arxiv.org/abs/2008.09510 ”… . We apply new theorems extending Conservative Bayesian Inference (CBI), which exploit the rigour of Bayesian methods while reducing the risk of involuntary misuse associated with now-common applications of Bayesian inference; we define additional conditions needed for applying these methods to AVs. Results: Prior knowledge can bring substantial advantages if the AV design allows strong expectations of safety before road testing. We also show how naive attempts at conservative assessment may lead to over-optimism instead; why …”
+ https://arxiv.org/abs/2007.06898 “Models that surpass human performance on several popular benchmarks display significant degradation in performance on exposure to Out of Distribution (OOD) data. Recent research has shown that models overfit to spurious biases and `hack’ datasets, in lieu of learning generalizable features like humans. In order to stop the inflation in model performance—and thus overestimation in AI systems’ capabilities—we propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation.”
++ https://openreview.net/forum?id=B1xhQhRcK7 “We show that rare but catastrophic failures may be missed entirely by random testing, which poses issues for safe deployment. Our proposed approach for adversarial testing fixes this”
https://arxiv.org/abs/1907.04446 - “Let’s Keep It Safe: Designing User Interfaces that Allow Everyone to Contribute to AI Safety”—“We first present a task design in which workers evaluate the safety of individual state-action pairs, and propose several variants of this task with improved task design and filtering mechanisms. Although this first design is easy to understand, it scales poorly to large state spaces. Therefore, we develop a new user interface that allows workers to write constraint rules without any programming.”
https://www.lesswrong.com/posts/ZHXutm7KpoWEj9G2s/an-unaligned-benchmark—“I think of the possibly-unaligned AIs as a benchmark: it’s what AI alignment researchers need to compete with. The further we fall short of the benchmark, the stronger the competitive pressures will be for everyone to give up on aligned AI and take their chances”
Strongly agreed. The question is how to make durable benchmarks for ai safety that are not themselves vulnerable to goodharting. Some prior work on benchmark design (selected from the results for a metaphor.systems query for this comment):
(Relevance ratings are manual labels by me.)
++++++ https://benchmarking.mlsafety.org/index.html—“Up to $500,000 in prizes for ML Safety benchmark ideas.”
+++++ https://github.com/HumanCompatibleAI/overcooked_ai—“A benchmark environment for fully cooperative human-AI performance.”—eight papers are shown on the github as having used this benchmark
+++++ https://partnershiponai.org/introducing-the-safelife-leaderboard-a-competitive-benchmark-for-safer-ai/
++++ https://bair.berkeley.edu/blog/2021/07/08/basalt/ - “a NeurIPS competition and benchmark called BASALT: a set of Minecraft environments and a human evaluation protocol that we hope will stimulate research and investigation into solving tasks with no pre-specified reward function”
++ https://arxiv.org/abs/1907.01475 ”… . This paper investigates safety and generalization from a limited number of training environments in deep reinforcement learning (RL). We find RL algorithms can fail dangerously on unseen test environments even when performing perfectly on training environments. …”
++ https://arxiv.org/abs/1911.01875 - summary: question your data too. “Metrology for AI: From Benchmarks to Instruments”
+ https://arxiv.org/abs/2008.09510 ”… . We apply new theorems extending Conservative Bayesian Inference (CBI), which exploit the rigour of Bayesian methods while reducing the risk of involuntary misuse associated with now-common applications of Bayesian inference; we define additional conditions needed for applying these methods to AVs. Results: Prior knowledge can bring substantial advantages if the AV design allows strong expectations of safety before road testing. We also show how naive attempts at conservative assessment may lead to over-optimism instead; why …”
+ https://arxiv.org/abs/2007.06898 “Models that surpass human performance on several popular benchmarks display significant degradation in performance on exposure to Out of Distribution (OOD) data. Recent research has shown that models overfit to spurious biases and `hack’ datasets, in lieu of learning generalizable features like humans. In order to stop the inflation in model performance—and thus overestimation in AI systems’ capabilities—we propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation.”
+ https://openai.com/blog/safety-gym/ - old openai benchmark, I’m not sure how much it’s actually been used?
tangential, but interesting:
++ https://openreview.net/forum?id=B1xhQhRcK7 “We show that rare but catastrophic failures may be missed entirely by random testing, which poses issues for safe deployment. Our proposed approach for adversarial testing fixes this”
https://arxiv.org/abs/1907.04446 - “Let’s Keep It Safe: Designing User Interfaces that Allow Everyone to Contribute to AI Safety”—“We first present a task design in which workers evaluate the safety of individual state-action pairs, and propose several variants of this task with improved task design and filtering mechanisms. Although this first design is easy to understand, it scales poorly to large state spaces. Therefore, we develop a new user interface that allows workers to write constraint rules without any programming.”
https://ai-safety-papers.quantifieduncertainty.org/table—an overview of papers someone thought relevant to ai safety, last updated 2020
https://deepmindsafetyresearch.medium.com/building-safe-artificial-intelligence-52f5f75058f1 - “Building safe artificial intelligence: specification, robustness, and assurance”—intro post
https://www.lesswrong.com/posts/ZHXutm7KpoWEj9G2s/an-unaligned-benchmark—“I think of the possibly-unaligned AIs as a benchmark: it’s what AI alignment researchers need to compete with. The further we fall short of the benchmark, the stronger the competitive pressures will be for everyone to give up on aligned AI and take their chances”