johnswentworth comments on Specification gaming: the flip side of AI ingenuity

johnswentworth 13 May 2020 19:19 UTC
LW: 2 AF: 1
AF
I’m not sure who the intended audience is for this post.
I would guess that for most people on LW, the content is mostly stuff that was already obvious (that was certainly the case for me). The one potentially-novel part is highlighting three particular barriers (faithfully capture the human concept, avoid mistaken implicit assumptions, and reward tampering), but it’s not clear that this is a particularly natural way to break up the problem. Why this break-down, rather than some other? (For instance, if we can faithfully capture the human concept, why would we ever need to worry about any other sub-problems at all? Or is it only supposed to be a sufficient condition for a solution, rather than a necessary condition?)
On the other hand, if the intended audience is e.g. an undergraduate deep learning class full of people who’ve never thought about Goodhart at all, then this post is awesome. It gives a very accessible explanation of the problem, well-written, with very vivid examples including great visuals.
(EDIT: None of this is to say that the post shouldn’t be here; it’s a great post. I left this comment just because I heard the authors wanted more feedback on the OP.)
- habryka 14 May 2020 0:35 UTC
  LW: 6 AF: 3
  AF Parent
  Note: This post was originally posted to the DeepMind blog, so presumably the target audience is a broader audience of Machine Learning researchers and people in that broad orbit. I pinged Vika about crossposting it because it also seemed like a good reference post that I expected would get linked to a bunch more frequently if it was available on LessWrong and the AIAF.
- Vika 19 Jun 2020 17:50 UTC
  LW: 4 AF: 2
  AF Parent
  Thanks John for the feedback! As Oliver mentioned, the target audience is ML researchers (particularly RL researchers). The post is intended as an accessible introduction to the specification gaming problem for an ML audience that connects their perspective with a safety perspective on the problem. It is not intended to introduce novel concepts or a principled breakdown of the problem (I’ve made a note to clarify this in a later version of the post).
  Regarding your specific questions about the breakdown, I think faithfully capturing the human concept of the task in a reward function is complementary to the other subproblems (mistaken assumptions and reward tampering). If we had a reward function that perfectly captures the task concept, we would still need to implement it based on correct assumptions about the environment, and make sure the agent does not tamper with its implementation in the environment. We could say that capturing the task concept happens at the design specification level, while the other subproblems happen at the implementation specification level, as given in this post.