Specification gaming examples in AI
[Cross-posted from personal blog]
Various examples (and lists of examples) of unintended behaviors in AI systems have appeared in recent years. One interesting type of unintended behavior is finding a way to game the specified objective: generating a solution that literally satisfies the stated objective but fails to solve the problem according to the human designer’s intent. This occurs when the objective is poorly specified, and includes reinforcement learning agents hacking the reward function, evolutionary algorithms gaming the fitness function, etc. While ‘specification gaming’ is a somewhat vague category, it is particularly referring to behaviors that are clearly hacks, not just suboptimal solutions.
Since these examples are currently scattered across several lists, I have put together a master list of examples collected from the various existing sources. This list is intended to be comprehensive and up-to-date, and serve as a resource for AI safety research and discussion. If you know of any interesting examples of specification gaming that are missing from the list, please submit them through this form.
Thanks to Gwern Branwen, Catherine Olsson, Alex Irpan, and others for collecting and contributing examples!
- 2018 Review: Voting Results! by 24 Jan 2020 2:00 UTC; 135 points) (
- 2018 Review: Voting Results! by 24 Jan 2020 2:00 UTC; 135 points) (
- Review of A Map that Reflects the Territory by 12 Sep 2021 19:43 UTC; 65 points) (
- Will OpenAI’s work unintentionally increase existential risks related to AI? by 11 Aug 2020 18:16 UTC; 53 points) (
- Introduction to Reducing Goodhart by 26 Aug 2021 18:38 UTC; 48 points) (
- Competent Preferences by 2 Sep 2021 14:26 UTC; 30 points) (
- Should AI learn human values, human norms or something else? by 17 Sep 2022 6:19 UTC; 5 points) (
- 11 Sep 2022 8:52 UTC; 4 points) 's comment on Can “Reward Economics” solve AI Alignment? by (
- Can “Reward Economics” solve AI Alignment? by 7 Sep 2022 7:58 UTC; 3 points) (
- 10 Nov 2018 12:11 UTC; 2 points) 's comment on Specification gaming examples in AI by (
- 3 Sep 2022 22:15 UTC; 1 point) 's comment on Q Home’s Shortform by (
- 10 Sep 2022 6:31 UTC; 1 point) 's comment on Can “Reward Economics” solve AI Alignment? by (
I see this referred to a lot, and also find myself referring to it a lot. Having concrete examples of specification gaming is a valuable shortcut when explaining safety problems, as a “proof of concept” of something going wrong.
It’s really valuable to make records like this. Insofar as it’s accurate, and any reviewers can add to it, that’s great. (Perhaps in the future it could be a wiki page of some sort, but for now I think it was a very valuable thing that got posted in 2018.)
Thanks Ben! I’m happy that the list has been a useful resource. A lot of credit goes to Gwern, who collected many examples that went into the specification gaming list: https://www.gwern.net/Tanks#alternative-examples.