I don’t think this is entirely fair. If the AI has a simple goal—press the button—then I think it is materially easier for the AI to modify itself while preserving the button-pressing goal than if it has a very complex goal which is difficult to even describe.
I do agree the problem is difficult, but I don’t think it is in the same league as friendliness. I actually think it is quite approachable with modern techniques. Not to mention that its first step for the AI is to work out such techniques on its own to guarantee that it presses the button, once it manages to become intelligent enough to develop them.
Suppose you can verify that an AI obeys a simple utility function (‘press a button’), but can’t verify if it obeys a complex one (‘implement CEV of humanity’).
Suppose also that you can formally describe the complex function and run it (i.e. output the utility value of a proposed action or scenario), you just can’t verify that an AI follows this function. In other words, you can build an AI that follows this complex function, but you don’t know what will happen when it self-modifies.
You might proceed as follows. Run two copies of the AI. Copy 1 has the complex utility function, and is not allowed to self-modify (it might therefore be quite slow and limited). Copy 2 has the following simple utility function: “ask AI copy 1 for permission for all future actions. Never modify AI copy 1′s behavior.” Copy 2 may self-modify and foom.
This puts some limits on the kinds of self-modification that copy 2 will be able to do. It won’t be able to take clever shortcuts that preserve its utility function in non-obvious ways. But it will be able to find a superintelligent way of doing something, if that something is sufficiently self-contained that it can describe it to copy 1.
Of course, if AI copy 2 morphs into something truly adversarial, then copy 1′s utility function had better be very foolproof. But this is something that has to be solved anyway to implement the seed AI’s utility function in a fooming FAI scenario.
So I think verifying simple utility functions under self-modification is indeed 90% of the way towards verifying complex ones.
I noticed this but didn’t explicitly point it out. My point was that when paulfchristiano said:
If the AI has a simple goal—press the button—then I think it is materially easier for the AI to modify itself while preserving the button-pressing goal [...] the problem is difficult, but I don’t think it is in the same league as friendliness
He was also assuming that he could handle your objections, e.g. that his AI wouldn’t find a loophole in the definition of “pressing a button”. So the problem he described was not, in fact, simpler than the general problem of FAI.
I think that much of the difficulty with friendliness is that you can’t write down a simple utility function such that maximizing that utility is friendly. By “complex goal” I mean one which is sufficiently complex that articulating it precisely is out of our league.
I do believe that any two utility functions you can write down precisely should be basically equivalent in terms of how hard it is to verify that an AI follows them.
I don’t think this is entirely fair. If the AI has a simple goal—press the button—then I think it is materially easier for the AI to modify itself while preserving the button-pressing goal than if it has a very complex goal which is difficult to even describe.
I do agree the problem is difficult, but I don’t think it is in the same league as friendliness. I actually think it is quite approachable with modern techniques. Not to mention that its first step for the AI is to work out such techniques on its own to guarantee that it presses the button, once it manages to become intelligent enough to develop them.
Suppose you can verify that an AI obeys a simple utility function (‘press a button’), but can’t verify if it obeys a complex one (‘implement CEV of humanity’).
Suppose also that you can formally describe the complex function and run it (i.e. output the utility value of a proposed action or scenario), you just can’t verify that an AI follows this function. In other words, you can build an AI that follows this complex function, but you don’t know what will happen when it self-modifies.
You might proceed as follows. Run two copies of the AI. Copy 1 has the complex utility function, and is not allowed to self-modify (it might therefore be quite slow and limited). Copy 2 has the following simple utility function: “ask AI copy 1 for permission for all future actions. Never modify AI copy 1′s behavior.” Copy 2 may self-modify and foom.
This puts some limits on the kinds of self-modification that copy 2 will be able to do. It won’t be able to take clever shortcuts that preserve its utility function in non-obvious ways. But it will be able to find a superintelligent way of doing something, if that something is sufficiently self-contained that it can describe it to copy 1.
Of course, if AI copy 2 morphs into something truly adversarial, then copy 1′s utility function had better be very foolproof. But this is something that has to be solved anyway to implement the seed AI’s utility function in a fooming FAI scenario.
So I think verifying simple utility functions under self-modification is indeed 90% of the way towards verifying complex ones.
I don’t think you’ve noticed that this is just moving the fundamental problem to a different place. For example, you haven’t specified things like:
Don’t lie to AI 1 about your actions
Don’t persuade AI 1 to modify itself
Don’t find loopholes in the definition of “AI 1” or “modify”
etc., etc. If you could enforce all these things over superintelligent self-modification, you’d already have solved the general FAI problem.
IOW, what you propose isn’t actually a reduction of anything, AFAICT.
I noticed this but didn’t explicitly point it out. My point was that when paulfchristiano said:
He was also assuming that he could handle your objections, e.g. that his AI wouldn’t find a loophole in the definition of “pressing a button”. So the problem he described was not, in fact, simpler than the general problem of FAI.
I think that much of the difficulty with friendliness is that you can’t write down a simple utility function such that maximizing that utility is friendly. By “complex goal” I mean one which is sufficiently complex that articulating it precisely is out of our league.
I do believe that any two utility functions you can write down precisely should be basically equivalent in terms of how hard it is to verify that an AI follows them.