Quite. “wirehead” is a shorthand term for measurement proxy divergence—Goodheart’s law. Doing something for a measurement/reward, rather than to achieve a real goal.
Could we say that wireheading is a direct access to one’s reward function via self-modification and putting it on maximal level, which makes the function insensitive to any changes of the outside world? I think that such definition is stronger than just goodhearting.
Maybe we can define wireheading as a subset of goodharting, in a way similar to what you’re defining.
However, we need the extra assumption that putting the reward on the maximal level is not what we actually desire; the reward function is part of the world, just as the AI is.
We could say whatever we like—Stuart’s main point is in the first line: it’s not a natural category.
I’d argue that your wording is a fine example of wireheading, but not a definition. There are many behaviors other than just that, which I’d categorize as wireheading. The original usage (Larry Niven around 1970, as far as I can tell) wasn’t about self-modification or change of reward functions, it was direct brain stimulation as an addictive pleasure.
Quite. “wirehead” is a shorthand term for measurement proxy divergence—Goodheart’s law. Doing something for a measurement/reward, rather than to achieve a real goal.
Could we say that wireheading is a direct access to one’s reward function via self-modification and putting it on maximal level, which makes the function insensitive to any changes of the outside world? I think that such definition is stronger than just goodhearting.
Maybe we can define wireheading as a subset of goodharting, in a way similar to what you’re defining.
However, we need the extra assumption that putting the reward on the maximal level is not what we actually desire; the reward function is part of the world, just as the AI is.
Yes, that is what I meant.
We could say whatever we like—Stuart’s main point is in the first line: it’s not a natural category.
I’d argue that your wording is a fine example of wireheading, but not a definition. There are many behaviors other than just that, which I’d categorize as wireheading. The original usage (Larry Niven around 1970, as far as I can tell) wasn’t about self-modification or change of reward functions, it was direct brain stimulation as an addictive pleasure.