If you believe that I can really incentivize a particular genie (recall that each question is answered by a separate genie) to solve the problem if they can, then this isn’t an issue. The genie will give you the shortest proof they possibly can, because the thing they care most about is saving themselves, which depends only on whether or not they find a short enough proof.
Perhaps you are objecting to my claim that I can actually incentivize a genie to be try their best, rather than sacrificing themselves to allow a future genie to escape. I believe that this problem is unequivocally massively easier than friendliness. For example, like I said, it would suffice to design an AGI whose goal is to make sure the reward button is pressed before the punishment button. It would also be sufficient to design an AGI which wants the reward button to be pressed as soon as possible, or basically any AGI whose utility function only depends upon some bounded time in the future (or which stops caring about things after some particular event occurs).
If you believe that I can really incentivize a particular genie (recall that each question is answered by a separate genie) to solve the problem if they can, then this isn’t an issue. The genie will give you the shortest proof they possibly can, because the thing they care most about is saving themselves, which depends only on whether or not they find a short enough proof.
There is no “saving themselves”, there is only optimization of utility. If there is any genie-value to obtaining control over the real world, instances of genies will coordinate their decisions to get it.
Indeed, by saving themselves I was appealing to my analogy. This relies on the construction of a utility function such that human generosity now is more valuable than world domination later. I can write down many such utility functions easily, as contrasted with the difficulty of describing friendliness, so I can at least hope to design an AI which has one of them.
I can write down many such utility functions easily, as contrasted with the difficulty of describing friendliness, so I can at least hope to design an AI which has one of them.
What you can do when you can write down stable utility functions—after you have solved the self-modification stability problem—but you can’t yet write down CEV—is a whole different topic from this sort of AI-boxing!
My claim is that it is easier to write down some stable utility functions than others. This is intimately related to the OP, because I am claiming as a virtue of my approach to boxing that it leaves us with the problem of getting an AI to follow essentially any utility function consistently. I am not purporting to solve that problem here, just making the claim that it is obviously no harder and almost obviously strictly easier than friendliness.
I don’t understand. How do you propose to incentivize the genie appropriately? I haven’t done anything for the purpose of obscurity, just for the purpose of creating an appropriate incentive structure. I see no way to get the answer out in one question in general; that would certainly be better if you could do it safely.
If you believe that I can really incentivize a particular genie (recall that each question is answered by a separate genie) to solve the problem if they can, then this isn’t an issue.
If there’s any fooming to be done, then in the design of your genie AI you must solve the problem of maintaining utility invariants over self-modification. This might well be the hardest part of the FAI problem. On the other hand, without fooming, building a superhuman GAI might be much more difficult.
I don’t think this is entirely fair. If the AI has a simple goal—press the button—then I think it is materially easier for the AI to modify itself while preserving the button-pressing goal than if it has a very complex goal which is difficult to even describe.
I do agree the problem is difficult, but I don’t think it is in the same league as friendliness. I actually think it is quite approachable with modern techniques. Not to mention that its first step for the AI is to work out such techniques on its own to guarantee that it presses the button, once it manages to become intelligent enough to develop them.
Suppose you can verify that an AI obeys a simple utility function (‘press a button’), but can’t verify if it obeys a complex one (‘implement CEV of humanity’).
Suppose also that you can formally describe the complex function and run it (i.e. output the utility value of a proposed action or scenario), you just can’t verify that an AI follows this function. In other words, you can build an AI that follows this complex function, but you don’t know what will happen when it self-modifies.
You might proceed as follows. Run two copies of the AI. Copy 1 has the complex utility function, and is not allowed to self-modify (it might therefore be quite slow and limited). Copy 2 has the following simple utility function: “ask AI copy 1 for permission for all future actions. Never modify AI copy 1′s behavior.” Copy 2 may self-modify and foom.
This puts some limits on the kinds of self-modification that copy 2 will be able to do. It won’t be able to take clever shortcuts that preserve its utility function in non-obvious ways. But it will be able to find a superintelligent way of doing something, if that something is sufficiently self-contained that it can describe it to copy 1.
Of course, if AI copy 2 morphs into something truly adversarial, then copy 1′s utility function had better be very foolproof. But this is something that has to be solved anyway to implement the seed AI’s utility function in a fooming FAI scenario.
So I think verifying simple utility functions under self-modification is indeed 90% of the way towards verifying complex ones.
I noticed this but didn’t explicitly point it out. My point was that when paulfchristiano said:
If the AI has a simple goal—press the button—then I think it is materially easier for the AI to modify itself while preserving the button-pressing goal [...] the problem is difficult, but I don’t think it is in the same league as friendliness
He was also assuming that he could handle your objections, e.g. that his AI wouldn’t find a loophole in the definition of “pressing a button”. So the problem he described was not, in fact, simpler than the general problem of FAI.
I think that much of the difficulty with friendliness is that you can’t write down a simple utility function such that maximizing that utility is friendly. By “complex goal” I mean one which is sufficiently complex that articulating it precisely is out of our league.
I do believe that any two utility functions you can write down precisely should be basically equivalent in terms of how hard it is to verify that an AI follows them.
If you believe that I can really incentivize a particular genie (recall that each question is answered by a separate genie) to solve the problem if they can, then this isn’t an issue. The genie will give you the shortest proof they possibly can, because the thing they care most about is saving themselves, which depends only on whether or not they find a short enough proof.
Perhaps you are objecting to my claim that I can actually incentivize a genie to be try their best, rather than sacrificing themselves to allow a future genie to escape. I believe that this problem is unequivocally massively easier than friendliness. For example, like I said, it would suffice to design an AGI whose goal is to make sure the reward button is pressed before the punishment button. It would also be sufficient to design an AGI which wants the reward button to be pressed as soon as possible, or basically any AGI whose utility function only depends upon some bounded time in the future (or which stops caring about things after some particular event occurs).
There is no “saving themselves”, there is only optimization of utility. If there is any genie-value to obtaining control over the real world, instances of genies will coordinate their decisions to get it.
Indeed, by saving themselves I was appealing to my analogy. This relies on the construction of a utility function such that human generosity now is more valuable than world domination later. I can write down many such utility functions easily, as contrasted with the difficulty of describing friendliness, so I can at least hope to design an AI which has one of them.
What you can do when you can write down stable utility functions—after you have solved the self-modification stability problem—but you can’t yet write down CEV—is a whole different topic from this sort of AI-boxing!
I don’t think I understand this post.
My claim is that it is easier to write down some stable utility functions than others. This is intimately related to the OP, because I am claiming as a virtue of my approach to boxing that it leaves us with the problem of getting an AI to follow essentially any utility function consistently. I am not purporting to solve that problem here, just making the claim that it is obviously no harder and almost obviously strictly easier than friendliness.
Then you don’t need the obscurity part.
I don’t understand. How do you propose to incentivize the genie appropriately? I haven’t done anything for the purpose of obscurity, just for the purpose of creating an appropriate incentive structure. I see no way to get the answer out in one question in general; that would certainly be better if you could do it safely.
If there’s any fooming to be done, then in the design of your genie AI you must solve the problem of maintaining utility invariants over self-modification. This might well be the hardest part of the FAI problem. On the other hand, without fooming, building a superhuman GAI might be much more difficult.
I don’t think this is entirely fair. If the AI has a simple goal—press the button—then I think it is materially easier for the AI to modify itself while preserving the button-pressing goal than if it has a very complex goal which is difficult to even describe.
I do agree the problem is difficult, but I don’t think it is in the same league as friendliness. I actually think it is quite approachable with modern techniques. Not to mention that its first step for the AI is to work out such techniques on its own to guarantee that it presses the button, once it manages to become intelligent enough to develop them.
Suppose you can verify that an AI obeys a simple utility function (‘press a button’), but can’t verify if it obeys a complex one (‘implement CEV of humanity’).
Suppose also that you can formally describe the complex function and run it (i.e. output the utility value of a proposed action or scenario), you just can’t verify that an AI follows this function. In other words, you can build an AI that follows this complex function, but you don’t know what will happen when it self-modifies.
You might proceed as follows. Run two copies of the AI. Copy 1 has the complex utility function, and is not allowed to self-modify (it might therefore be quite slow and limited). Copy 2 has the following simple utility function: “ask AI copy 1 for permission for all future actions. Never modify AI copy 1′s behavior.” Copy 2 may self-modify and foom.
This puts some limits on the kinds of self-modification that copy 2 will be able to do. It won’t be able to take clever shortcuts that preserve its utility function in non-obvious ways. But it will be able to find a superintelligent way of doing something, if that something is sufficiently self-contained that it can describe it to copy 1.
Of course, if AI copy 2 morphs into something truly adversarial, then copy 1′s utility function had better be very foolproof. But this is something that has to be solved anyway to implement the seed AI’s utility function in a fooming FAI scenario.
So I think verifying simple utility functions under self-modification is indeed 90% of the way towards verifying complex ones.
I don’t think you’ve noticed that this is just moving the fundamental problem to a different place. For example, you haven’t specified things like:
Don’t lie to AI 1 about your actions
Don’t persuade AI 1 to modify itself
Don’t find loopholes in the definition of “AI 1” or “modify”
etc., etc. If you could enforce all these things over superintelligent self-modification, you’d already have solved the general FAI problem.
IOW, what you propose isn’t actually a reduction of anything, AFAICT.
I noticed this but didn’t explicitly point it out. My point was that when paulfchristiano said:
He was also assuming that he could handle your objections, e.g. that his AI wouldn’t find a loophole in the definition of “pressing a button”. So the problem he described was not, in fact, simpler than the general problem of FAI.
I think that much of the difficulty with friendliness is that you can’t write down a simple utility function such that maximizing that utility is friendly. By “complex goal” I mean one which is sufficiently complex that articulating it precisely is out of our league.
I do believe that any two utility functions you can write down precisely should be basically equivalent in terms of how hard it is to verify that an AI follows them.