This idea isn’t perfect, but there’s some merit to it. It’s better than any of Goertzel’s previous proposals that I’m aware of; I’m glad to see he’s taking the friendliness issue seriously now and looking for ways to deal with it.
I agree that freezing an AI’s intelligence level somewhere short of superintelligence, by building in a time-limited deontological prohibition against self-modification and self-improvement, is probably a good safeguard. However, I think this makes sense only for a shorter duration, as a step in development and testing. Capping an AI’s intelligence but still giving it full control over the world has most of the same difficulty and safety issues that a full-blown friendly AI would. There are two main problems. First, the goal-stability problem doesn’t go away entirely just because the AI is avoiding self-modification; it can still suffer value-drift as the world, and the definitions it uses to parse the world, change. Second, there’s also a lot of hidden complexity (and chance for disastrous error) hidden in statements like this one:
A strong inhibition against carrying out actions with a result that a strong majority of humans would oppose, if they knew about the action in advance
The problem is that whether humans object depends more on how an action is presented, and subtle factors that the AI could manipulate, than on the action itself. There are obvious loopholes—what about actions which are too complex for humans to understand and object to? What about highly-objectionable actions which can be partitioned into innocent-looking pieces? It’s also quite likely that a majority of humans would develop trust in the AI, such that they wouldn’t object to anything. And then there’s this:
A mandate to be open-minded toward suggestions by intelligent, thoughtful humans about the possibility that it may be misinterpreting its initial, preprogrammed goals
Which sounds like a destabilizing factor and a security hole. It’s very hard to separate being open to corrections from incorrect interpretations to better ones, from being open to corrections in the wrong direction. This might work if it were limited to taking suggestions from a trustworthy set of exceptionally good human thinkers, though, and if those humans were able to retain their sanity and values in spite of extensive aging.
I’m also unclear on how to reconcile handing over control to a better AI in 200 years, with inhibiting the advancement of threatening technologies. The better AI would itself be a threatening technology, and preparing it to take over would require research and prototyping.
I think that an important underlying difference of perspective here is that the Less Wrong memes tend to automatically think of all AGIs as essentially computer programs whereas Goertzel-like memes tend to automatically think of at least some AGIs as non-negligibly essentially person-like. I think this is at least partially because the Less Wrong memes want to write an FAI that is essentially some machine learning algorithms plus a universal prior on top of sound decision theory whereas the Goertzel-like memes want to write an FAI that is essentially roughly half progam-like and half person-like. Less Wrong memes think that person AIs won’t be sufficiently person-like but they sort of tend to assume that conclusion rather than argue for it, which causes memes that aren’t familiar with Less Wrong memes to wonder why Less Wrong memes are so incredibly confident that all AIs will necessarily act like autistic OCD people without any possibility at all of acting like normal reasonable people. From that perspective the Goertzel-like memes look justified in being rather skeptical of Less Wrong memes. After all, it is easy to imagine a gradation between AIXI and whole brain emulations. Goertzel-like memes wish to create an AI somewhere between those two points, Less Wrong memes wish to create an AI that’s even more AIXI-like than AIXI is (in the sense of being more formally and theoretically well-founded than AIXI is). It’s important that each look at the specific kinds of AI that the other has in mind and start the exchange from there.
This idea isn’t perfect, but there’s some merit to it. It’s better than any of Goertzel’s previous proposals that I’m aware of; I’m glad to see he’s taking the friendliness issue seriously now and looking for ways to deal with it.
I agree that freezing an AI’s intelligence level somewhere short of superintelligence, by building in a time-limited deontological prohibition against self-modification and self-improvement, is probably a good safeguard. However, I think this makes sense only for a shorter duration, as a step in development and testing. Capping an AI’s intelligence but still giving it full control over the world has most of the same difficulty and safety issues that a full-blown friendly AI would. There are two main problems. First, the goal-stability problem doesn’t go away entirely just because the AI is avoiding self-modification; it can still suffer value-drift as the world, and the definitions it uses to parse the world, change. Second, there’s also a lot of hidden complexity (and chance for disastrous error) hidden in statements like this one:
The problem is that whether humans object depends more on how an action is presented, and subtle factors that the AI could manipulate, than on the action itself. There are obvious loopholes—what about actions which are too complex for humans to understand and object to? What about highly-objectionable actions which can be partitioned into innocent-looking pieces? It’s also quite likely that a majority of humans would develop trust in the AI, such that they wouldn’t object to anything. And then there’s this:
Which sounds like a destabilizing factor and a security hole. It’s very hard to separate being open to corrections from incorrect interpretations to better ones, from being open to corrections in the wrong direction. This might work if it were limited to taking suggestions from a trustworthy set of exceptionally good human thinkers, though, and if those humans were able to retain their sanity and values in spite of extensive aging.
I’m also unclear on how to reconcile handing over control to a better AI in 200 years, with inhibiting the advancement of threatening technologies. The better AI would itself be a threatening technology, and preparing it to take over would require research and prototyping.
I think that an important underlying difference of perspective here is that the Less Wrong memes tend to automatically think of all AGIs as essentially computer programs whereas Goertzel-like memes tend to automatically think of at least some AGIs as non-negligibly essentially person-like. I think this is at least partially because the Less Wrong memes want to write an FAI that is essentially some machine learning algorithms plus a universal prior on top of sound decision theory whereas the Goertzel-like memes want to write an FAI that is essentially roughly half progam-like and half person-like. Less Wrong memes think that person AIs won’t be sufficiently person-like but they sort of tend to assume that conclusion rather than argue for it, which causes memes that aren’t familiar with Less Wrong memes to wonder why Less Wrong memes are so incredibly confident that all AIs will necessarily act like autistic OCD people without any possibility at all of acting like normal reasonable people. From that perspective the Goertzel-like memes look justified in being rather skeptical of Less Wrong memes. After all, it is easy to imagine a gradation between AIXI and whole brain emulations. Goertzel-like memes wish to create an AI somewhere between those two points, Less Wrong memes wish to create an AI that’s even more AIXI-like than AIXI is (in the sense of being more formally and theoretically well-founded than AIXI is). It’s important that each look at the specific kinds of AI that the other has in mind and start the exchange from there.
That’s a hypothesis.
That’s a great insight.
This comment covers most of my initial reactions.
Probably “unauthorized/unsupervised advancement.”