“don’t design system whose goals system is walled off from its updateable knowledge base”
Connecting the goal system to the knowledge base is not sufficient at all. You have to ensure that the labels used in the goal system converge to the meaning that we desire them to have.
I’ll try and build practical examples of the failures I have in mind, so that we can discuss them more formally, instead of very nebulously as we are now.
Connecting the goal system to the knowledge base is not sufficient at all. You have to ensure that the labels used in the goal system converge to the meaning that we desire them to have.
Ok, assuming you are starting from a compartmentalied system, it has to be connected in the right way. That is more of a nitpick than a knockdown.
But the deeper issue is whether you are starting from a system with a distinct utility funciton:
RL:”.. talking in terms of an AI that actually HAS such a thing as a “utility function”. And it gets worse: the idea of a “utility function” has enormous implications for how the entire control mechanism (the motivations and goals system) is designed.A good deal of this debate about my paper is centered in a clash of paradigms: on the one side a group of people who cannot even imagine the existence of any control mechanism except a utility-function-based goal stack, and on the other side me and a pretty large community of real AI builders who consider a utility-function-based goal stack to be so unworkable that it will never be used in any real AI.Other AI builders that I have talked to (including all of the ones who turned up for the AAAI symposium where this paper was delivered, a year ago) are unequivocal: they say that a utility-function-and-goal-stack approach is something they wouldn’t dream of using in a real AI system. To them, that idea is just a piece of hypothetical silliness put into AI papers by academics who do not build actual AI systems.And for my part, I am an AI builder with 25 years experience, who was already rejecting that approach in the mid-1980s, and right now I am working on mechanisms that only have vague echoes of that design in them.Meanwhile, there are very few people in the world who also work on real AGI system design (they are a tiny subset of the “AI builders” I referred to earlier), and of the four others that I know (Ben Goertzel, Peter Voss, Monica Anderson and Phil Goetz) I can say for sure that the first three all completely accept the logic in this paper. (Phil’s work I know less about: he stays off the social radar most of the time, but he’s a member of LW so someone could ask his opinion)”.
Simpler AIs may adopt a simpler version of a goal than the human programmers intentions. It’s not clear that they do so because have a motivation to do so. In a sense, a RL agent is only motivated to avoid negative reinforcement.
But simpler AIs don’t pose much of a threat. Wireheading doesn’t pose much of a threat either.
AFAICS, it’s an open question whether the goal-simplifying behaviour of simple AI’s is due to limitation or motivation.
The contentious claims are concerned with AIs that are human level, or above, sophisticated enough to appreciate human intentions directly, but nonetheless get them wrong. A RL AI that has NL, but nonetheless misunderstand “chocolate” or “happiness”, but only on the context of its goals, not in its general world knowledge, needs an architecture that allows it to do that, that allows it to engage in compartmentalisation or doublethink. Doublethink is second nature to humans, because we are optimised for primate politics.
Connecting the goal system to the knowledge base is not sufficient at all. You have to ensure that the labels used in the goal system converge to the meaning that we desire them to have.
I’ll try and build practical examples of the failures I have in mind, so that we can discuss them more formally, instead of very nebulously as we are now.
Ok, assuming you are starting from a compartmentalied system, it has to be connected in the right way. That is more of a nitpick than a knockdown.
But the deeper issue is whether you are starting from a system with a distinct utility funciton:
The problem exists for reinforcement learning agents and many other designs as well. In fact RL agents are more vulnerable, because of the risk of wireheading on top of everything else. See Laurent Orseau’s work on that: https://www6.inra.fr/mia-paris/Equipes/LInK/Les-anciens-de-LInK/Laurent-Orseau/Mortal-universal-agents-wireheading
Simpler AIs may adopt a simpler version of a goal than the human programmers intentions. It’s not clear that they do so because have a motivation to do so. In a sense, a RL agent is only motivated to avoid negative reinforcement. But simpler AIs don’t pose much of a threat. Wireheading doesn’t pose much of a threat either.
AFAICS, it’s an open question whether the goal-simplifying behaviour of simple AI’s is due to limitation or motivation.
The contentious claims are concerned with AIs that are human level, or above, sophisticated enough to appreciate human intentions directly, but nonetheless get them wrong. A RL AI that has NL, but nonetheless misunderstand “chocolate” or “happiness”, but only on the context of its goals, not in its general world knowledge, needs an architecture that allows it to do that, that allows it to engage in compartmentalisation or doublethink. Doublethink is second nature to humans, because we are optimised for primate politics.