Algon comments on The Alignment Problem

Algon 11 Jul 2022 17:16 UTC
4 points
1
“A superhuman intelligence will, by default, hack its reward function. If you base the reward function on sensory inputs then the AGI will hack the sensory inputs. If you base the reward function on human input then it will hack its human operators.”
I can’t think clearly right now, but this doesn’t feel like the central part of Eliezer’s view. Yes, it will do things that feel like manipulation of what you intended its reward function to be. But that doesn’t mean that’s actually where you aimed it. Defining a reward in terms of sensory input is not the same as defining a reward function over “reality” or a reward function over a particular ontology or world model.
You can build a reward function over a world model which an AI will not attempt to tamper with, you can do that right now. But the world models don’t share our ontology, and so we can’t aim them at the thing we want because the thing we want doesn’t even come into the AI’s consideration. Also, these world models and agent designs probably aren’t going to generalise to AGI but whatever.