The distinction I have in mind is that a self-modifying AI can come up with a new thinking algorithm to use and decide to trust it, whereas a non-self-modifying AI could come up with a new algorithm or whatever, but would be unable to trust the algorithm without sufficient justification.
Likewise, if an AI’s decision-making algorithm is immutably hard-coded as “think about the alternatives and select the one that’s rated the highest”, then the AI would not be able to simply “write a new AI … and then just hand off all its tasks to it”; in order to do that, it would somehow have to make it so that the highest-rated alternative is always the one that the new AI would pick. (Of course, this is no benefit unless the rating system is also immutably hard-coded.)
I guess my idea in a nutshell is that instead of starting with a flexible system and trying to figure out how to make it safe, we should start with a safe system and try to figure out how to make it flexible. My major grounds for believing this, I think, is that it’s probably going to be much easier to understand a safe but inflexible system than it is to understand a flexible but unsafe system, so if we take this approach, then the development process will be easier to understand and will therefore go better.
Likewise, if an AI’s decision-making algorithm is immutably hard-coded as “think about the alternatives and select the one that’s rated the highest”, then the AI would not be able to simply “write a new AI … and then just hand off all its tasks to it”; in order to do that, it would somehow have to make it so that the highest-rated alternative is always the one that the new AI would pick.
You basically say that the AI should be unable to learn to trust a process that was effective in the past to also be effective in the future. I think that would restrict intelligence a lot.
Yeah, that’s a good point. What I want to say is, “oh, a non-self-modifying AI would still be able to hand off control to a sub-AI, but it will automatically check to make sure the sub-AI is behaving correctly; it won’t be able to turn off those checks”. But my idea here is definitely starting to feel more like a pipe dream.
Hmm, might still be something gleaned for attempting to steelman this or work in different related directions.
Edit; maybe something with an AI not being able to tolerate things it can’t make certain proofs about? Problem is it’d have to be able to make those proofs about humans if they are included in its environment, and if they are not it might make UFAI there (Intuition pump; a system that consists of a program it can prove everything about, and humans that program asks questions to). Yea this doesn’t seem very useful.
You can’t really tell whether something that is smarter than yourself is behaving correctly. In the end a non-self-modifying AI checking on whether a self-modifying sub-AI is behaving correctly isn’t much different from a safety perspective than a human checking whether the self modifying AI is behaving correctly.
The distinction I have in mind is that a self-modifying AI can come up with a new thinking algorithm to use and decide to trust it, whereas a non-self-modifying AI could come up with a new algorithm or whatever, but would be unable to trust the algorithm without sufficient justification.
Likewise, if an AI’s decision-making algorithm is immutably hard-coded as “think about the alternatives and select the one that’s rated the highest”, then the AI would not be able to simply “write a new AI … and then just hand off all its tasks to it”; in order to do that, it would somehow have to make it so that the highest-rated alternative is always the one that the new AI would pick. (Of course, this is no benefit unless the rating system is also immutably hard-coded.)
I guess my idea in a nutshell is that instead of starting with a flexible system and trying to figure out how to make it safe, we should start with a safe system and try to figure out how to make it flexible. My major grounds for believing this, I think, is that it’s probably going to be much easier to understand a safe but inflexible system than it is to understand a flexible but unsafe system, so if we take this approach, then the development process will be easier to understand and will therefore go better.
You basically say that the AI should be unable to learn to trust a process that was effective in the past to also be effective in the future. I think that would restrict intelligence a lot.
Yeah, that’s a good point. What I want to say is, “oh, a non-self-modifying AI would still be able to hand off control to a sub-AI, but it will automatically check to make sure the sub-AI is behaving correctly; it won’t be able to turn off those checks”. But my idea here is definitely starting to feel more like a pipe dream.
Hmm, might still be something gleaned for attempting to steelman this or work in different related directions.
Edit; maybe something with an AI not being able to tolerate things it can’t make certain proofs about? Problem is it’d have to be able to make those proofs about humans if they are included in its environment, and if they are not it might make UFAI there (Intuition pump; a system that consists of a program it can prove everything about, and humans that program asks questions to). Yea this doesn’t seem very useful.
You can’t really tell whether something that is smarter than yourself is behaving correctly. In the end a non-self-modifying AI checking on whether a self-modifying sub-AI is behaving correctly isn’t much different from a safety perspective than a human checking whether the self modifying AI is behaving correctly.
immutably hard-coding something in is a lot easier to say than to do.