If an AI can do that, why would the humans who build it be unable to notice any malicious intentions?
I meant not noticing that it escaped to the Internet. But “noticing malicious intentions” is a rather strange thing to say. You notice behavior, not intentions. It’s stupid to signal your true intentions if you’ll be condemned for them.
Why wouldn’t the humans who created it not be able to use the same algorithms that the AI uses to predict what it will do?
What what will do, predict in what sense to what end? AI in the wild acts depending on what it encounters, all instances are unique (and beware of the watchers).
In other words, could you elaborate on why you believe that what the AI is going to do will be opaque to its creators but predictable to its initial self?
I didn’t talk of this.
If it takes very long for the first GAI to be created and if it is then created by means of a single breakthrough that somehow combines all previous discoveries and expert systems into a much more powerful single entity, with huge amounts of hard-coded knowledge, a complex utility-function and various dangerous drives, then I agree.
I don’t see how those assumptions are relevant. Also, all drives are dangerous, to the extent their combination differs from ours. Utility is not temper or personality or tendency to act in a certain way. Utility is what shapes long-term plans, any of whose elements might have arbitrary appearance, as necessary to dominate the circumstances.
In other words, could you elaborate on why you believe that what the AI is going to do will be opaque to its creators but predictable to its initial self?
I didn’t talk of this.
Maybe I misunderstood you. But I still believe that it is an important question.
To be able to self-improve efficiently an AI has to make some sort of predictions on how modifications will affect its behavior. The desired solution is actually much stronger than that. The AI will have to prove the friendliness of its modified self, respectively its successor, with respect to its utility-function.
The question is, if the AI can make such predictions about the behavior of improved versions of itself, why wouldn’t humans be able to do the same?
The fear is that an AI will do something that eventually leads to the extinction of all human value. But the AI must have the same fear about improved versions of itself. The AI must fear that its successor will cause the demise of what it values. Therefore it has to be able to make sure that this won’t happen. But why wouldn’t humans not be able to do the same?
An AI is not a black box to itself. It won’t be a black box to its creators. Inventing molecular nanotechnology and taking over the world in its spare time seems like something that should be noticeable.
What if the AI makes mistakes? Meaning, it mistakenly believes the successor it has just wrote has the same utility function? The same way a human could mistakenly believe the AI he has just build is friendly? In the same vein, what if the AI cannot accurately assess its own utility function, but go on optimizing anyway?
Such a badly done AI may automatically flatline, and not be able to improve itself. I don’t know. But even if the AI is friendly to itself, we humans could still botch the utility function (even if that utility function is as meta as CEV).
I meant not noticing that it escaped to the Internet. But “noticing malicious intentions” is a rather strange thing to say. You notice behavior, not intentions. It’s stupid to signal your true intentions if you’ll be condemned for them.
What what will do, predict in what sense to what end? AI in the wild acts depending on what it encounters, all instances are unique (and beware of the watchers).
I didn’t talk of this.
I don’t see how those assumptions are relevant. Also, all drives are dangerous, to the extent their combination differs from ours. Utility is not temper or personality or tendency to act in a certain way. Utility is what shapes long-term plans, any of whose elements might have arbitrary appearance, as necessary to dominate the circumstances.
Maybe I misunderstood you. But I still believe that it is an important question.
To be able to self-improve efficiently an AI has to make some sort of predictions on how modifications will affect its behavior. The desired solution is actually much stronger than that. The AI will have to prove the friendliness of its modified self, respectively its successor, with respect to its utility-function.
The question is, if the AI can make such predictions about the behavior of improved versions of itself, why wouldn’t humans be able to do the same?
The fear is that an AI will do something that eventually leads to the extinction of all human value. But the AI must have the same fear about improved versions of itself. The AI must fear that its successor will cause the demise of what it values. Therefore it has to be able to make sure that this won’t happen. But why wouldn’t humans not be able to do the same?
An AI is not a black box to itself. It won’t be a black box to its creators. Inventing molecular nanotechnology and taking over the world in its spare time seems like something that should be noticeable.
What if the AI makes mistakes? Meaning, it mistakenly believes the successor it has just wrote has the same utility function? The same way a human could mistakenly believe the AI he has just build is friendly? In the same vein, what if the AI cannot accurately assess its own utility function, but go on optimizing anyway?
Such a badly done AI may automatically flatline, and not be able to improve itself. I don’t know. But even if the AI is friendly to itself, we humans could still botch the utility function (even if that utility function is as meta as CEV).