Rohin Shah comments on Defining AI wireheading

Rohin Shah 24 Nov 2019 23:01 UTC
LW: 2 AF: 2
AF
Planned summary:
This post points out that “wireheading” is a fuzzy category. Consider a weather-controlling AI tasked with increasing atmospheric pressure, as measured by the world’s barometers. If it made a tiny dome around each barometer and increased air pressure within the domes, we would call it wireheading. However, if we increase the size of the domes until it’s a dome around the entire Earth, then it starts sounding like a perfectly reasonable way to optimize the reward function. Somewhere in the middle, it must have become unclear whether or not it was wireheading. The post suggests that wireheading can be defined as a subset of <@specification gaming@>(@Specification gaming examples in AI@), where the “gaming” happens by focusing on some narrow measurement channel, and the fuzziness comes from what counts as a “narrow measurement channel”.
Planned opinion:
You may have noticed that this newsletter doesn’t talk about wireheading very much; this is one of the reasons why. It seems like wireheading is a fuzzy subset of specification gaming, and is not particularly likely to be the only kind of specification gaming that could lead to catastrophe. I’d be surprised if we found some sort of solution where we’d say “this solves all of wireheading, but it doesn’t solve specification gaming”—there don’t seem to be particular distinguishing features that would allow us to have a solution to wireheading but not specification gaming. There can of course be solutions to particular kinds of wireheading that _do_ have clear distinguishing features, such as <@reward tampering@>(@Designing agent incentives to avoid reward tampering@), but I don’t usually expect these to be the major sources of AI risk.