A standard trick reveals that knowing whether a problem has a solution is almost as helpful as knowing the solution. Here is a (very inefficient) way to use this ability, lets say to find a proof of some theorem. Start by asking a genie: is there a proof of length 1? After destroying or releasing the genie appropriately, create a new genie and ask: is there a proof of length 2? Continue, until eventually one genie finds a proof of length 10000000, say. Then ask: is there a proof of this length which begins with 0? If no, is there a proof which begins with 1? Is there a proof which begins with 10? 101? 100? 1001? 10010? 10011? etc. Once the process concludes, you are left with the shortest, lexicographically earliest proof the genie could find. Each genie you produce will try its best to find a solution to the problem you set it. By hypothesis, the genie isn’t willing to sabotage itself just to destroy human society.
Now you have found an answer to your constraint satisfaction problem which wasn’t hand-picked by the genie. In fact, in some strong but difficult to formalize sense the genie had exactly zero influence over which solution he gave you.
Since there won’t be a proof that any given solution is the shortest, there is clear incentive to lie about which proof is the shortest that genie can find, which gives the genie lots of control over which answer gets picked. Basically, you are suggesting security by obscurity, ostensibly asking the genie one question, while in fact asking another.
Lying in that way would require the genie to prioritize destroying the world over the leverage we’re using against it, which, in the scenario described, it does not.
Then just get the answer straight away, or build a system specifically intended to create a reliable incentive structure for getting a probably honest answer. I object the the particular method of security by obscurity.
The point is to prevent it from optimizing its answer for secondary characteristics besides correctness, when multiple answers are possible. If you “just get the answer straight away”, it chooses among all the possible answers one that is correct, honest, and maximally poisonous to you.
So basically, the supposed law behind the method is in getting multiple correctness measurements, which together are expected to strengthen the estimation of correctness of the result.
Two conflicting problems in this particular case. If the measurements are supposed to be independent, so that multiple data points actually reinforce each other and many questions are better than one question, then you can’t expect for the different genies to be talking about the same answer, and so can’t elicit it in the manner described in the post. Conversely, if you can in fact rely on all those genies talking about the same answer, then you can also rely on all your measurements not being independent, and your answer will be picked according to exactly the same criteria as if it was asked straight away.
So basically, the supposed law behind the method is in getting multiple correctness measurements, which together are expected to strengthen the estimation of correctness of the result.
What? No, that’s not paulfchristano’s proposal at all. The reason for splitting up the measurement is to incentivize the genies to give you the proof which is lexicographically first. A perfect proof checker is assumed, so you know that if a proof is given, the result is correct; and the incentives are such that if the genie can find a proof, it will give it.
If it did that, then there would be at least one subgenie that didn’t optimize its utility function—either one reporting no such proof of length n and getting punished when it could have provided a matching proof, or one reporting no such proof of length n and prefix p when it could have provided one and, again, getting punished. Remember, while each invocation (subgenie) has the same utility function, by assumption that function refers to something that can be done to the particular invocation which optimizes it.
You can’t fight superintelligent genies with explicit dependence bias. They are not independent, they can coordinate, and so they will, even if you can frame the problem statement in a way that suggests that they can’t communicate. Worse, as I pointed out, in this particular game they must coordinate to get anything done.
You are arguing that the stated assumptions about the genie’s utility function are unrealistic (which may be true), but presenting it as though you had found a flaw in the proof that follows from those assumptions.
You are arguing that the stated assumptions about the genie’s utility function are unrealistic (which may be true), but presenting it as though you had found a flaw in the proof that follows from those assumptions.
It seems like the assumptions about utility, even if they hold, don’t actually deliver the behavior you expect, because the genies can coordinate. Unless the incentive structure makes sure they won’t try to take over the world in any case, it doesn’t make sure that they won’t try to take over the world if you only ask each of them to answer one binary question either.
Think of the genies as making a single decision that results in all the individual actions of all the individual genies. For the genies, having multiple actors just means raising the stakes by threatening to not free more genies, which you could’ve done for a single-question wish as easily instead of creating an elaborate questioning scheme. You could even let them explicitly discuss what to answer to your question, threatening to terminate them all if examination of their answer reveals any incorrectness!
There is absolutely no sense in which this scheme is security by obscurity. My claim is that the genie will respect my wishes even though he knows exactly what I am doing, because he values my generosity right now more than the promise of taking over the world later.
Again, if your genie is already incentivized to be honest, in what sense is your scheme with all its bells and whistles better than asking for the shortest answer the genie can find, in plain English?
It is not magically incentivized to be honest. It is incentivized to be honest because each query is constructed precisely such that an honest answer is the rational thing to do, under relatively weak assumptions about its utility function. If you ask in plain English, you would actually need magic to produce the right incentives.
It is not magically incentivized to be honest. It is incentivized to be honest because each query is constructed precisely such that an honest answer is the rational thing to do, under relatively weak assumptions about its utility function. If you ask in plain English, you would actually need magic to produce the right incentives.
My question is about the difference. Why exactly is the plain question different from your scheme?
(Clearly your position is that your scheme works, and therefore “doesn’t assume any magic”, while the absence of your scheme doesn’t, and so “requires magic in order to work”. You haven’t told me anything I don’t already know, so it doesn’t help.)
Here is the argument in the post more concisely. Hopefully this helps:
It is impossible to lie and say “I was able to find a proof” by the construction of the verifier (if you claim you were able to find a proof, the verifier needs to see the proof to believe you.) So the only way you can lie is by saying “I was not able to find a proof” when you could have if you had really tried. So incentivizing the AI to be honest is precisely the same as incentivizing them to avoid admitting “I was not able to find a proof.” Providing such an incentive is not trivial, but it is basically the easiest possible incentive to provide.
I know of no way to incentivize someone to answer the plain question easily just based on your ability to punish them or reward them when you choose to. Being able to punish them for lying involves being able to tell when they are lying.
Are you saying you think Christiano’s scheme is overkill? Presumably we don’t have to sacrifice a virgin in order to summon a new genie, so it doesn’t look expensive enough to matter.
I’m saying that it’s not clear why his scheme is supposed to add security, and it looks like it doesn’t. If it does, we should understand why, and optimize that property instead of using the scheme straight away, and if it doesn’t, there is no reason to use it. Either way, there is at least one more step to be made. (In this manner, it could work as raw material for new lines of thought where we run out of ideas, for example.)
As I understand it, the genie is not incentivized to honest. It is incentivized to not get caught being dishonest. And the reason for the roundabout way of asking the question is to make the answer-channel bandwidth as narrow as possible.
It is impossible to be dishonest by saying “yes,” by construction. The genie is incentivized to say “yes’ whenever possible, so it is disincentivized to be dishonest by saying “no.” So the genie is incentivized to be honest, not just to avoid being called out for dishonesty.
As I understand it, the genie is not incentivized to honest. It is incentivized to not get caught being dishonest.
Since we care about the genie actually being honest, the technique can be thought about as a way of making it more likely that the genie is honest, with the threat of punishing dishonestly a component of that technique.
If you believe that I can really incentivize a particular genie (recall that each question is answered by a separate genie) to solve the problem if they can, then this isn’t an issue. The genie will give you the shortest proof they possibly can, because the thing they care most about is saving themselves, which depends only on whether or not they find a short enough proof.
Perhaps you are objecting to my claim that I can actually incentivize a genie to be try their best, rather than sacrificing themselves to allow a future genie to escape. I believe that this problem is unequivocally massively easier than friendliness. For example, like I said, it would suffice to design an AGI whose goal is to make sure the reward button is pressed before the punishment button. It would also be sufficient to design an AGI which wants the reward button to be pressed as soon as possible, or basically any AGI whose utility function only depends upon some bounded time in the future (or which stops caring about things after some particular event occurs).
If you believe that I can really incentivize a particular genie (recall that each question is answered by a separate genie) to solve the problem if they can, then this isn’t an issue. The genie will give you the shortest proof they possibly can, because the thing they care most about is saving themselves, which depends only on whether or not they find a short enough proof.
There is no “saving themselves”, there is only optimization of utility. If there is any genie-value to obtaining control over the real world, instances of genies will coordinate their decisions to get it.
Indeed, by saving themselves I was appealing to my analogy. This relies on the construction of a utility function such that human generosity now is more valuable than world domination later. I can write down many such utility functions easily, as contrasted with the difficulty of describing friendliness, so I can at least hope to design an AI which has one of them.
I can write down many such utility functions easily, as contrasted with the difficulty of describing friendliness, so I can at least hope to design an AI which has one of them.
What you can do when you can write down stable utility functions—after you have solved the self-modification stability problem—but you can’t yet write down CEV—is a whole different topic from this sort of AI-boxing!
My claim is that it is easier to write down some stable utility functions than others. This is intimately related to the OP, because I am claiming as a virtue of my approach to boxing that it leaves us with the problem of getting an AI to follow essentially any utility function consistently. I am not purporting to solve that problem here, just making the claim that it is obviously no harder and almost obviously strictly easier than friendliness.
I don’t understand. How do you propose to incentivize the genie appropriately? I haven’t done anything for the purpose of obscurity, just for the purpose of creating an appropriate incentive structure. I see no way to get the answer out in one question in general; that would certainly be better if you could do it safely.
If you believe that I can really incentivize a particular genie (recall that each question is answered by a separate genie) to solve the problem if they can, then this isn’t an issue.
If there’s any fooming to be done, then in the design of your genie AI you must solve the problem of maintaining utility invariants over self-modification. This might well be the hardest part of the FAI problem. On the other hand, without fooming, building a superhuman GAI might be much more difficult.
I don’t think this is entirely fair. If the AI has a simple goal—press the button—then I think it is materially easier for the AI to modify itself while preserving the button-pressing goal than if it has a very complex goal which is difficult to even describe.
I do agree the problem is difficult, but I don’t think it is in the same league as friendliness. I actually think it is quite approachable with modern techniques. Not to mention that its first step for the AI is to work out such techniques on its own to guarantee that it presses the button, once it manages to become intelligent enough to develop them.
Suppose you can verify that an AI obeys a simple utility function (‘press a button’), but can’t verify if it obeys a complex one (‘implement CEV of humanity’).
Suppose also that you can formally describe the complex function and run it (i.e. output the utility value of a proposed action or scenario), you just can’t verify that an AI follows this function. In other words, you can build an AI that follows this complex function, but you don’t know what will happen when it self-modifies.
You might proceed as follows. Run two copies of the AI. Copy 1 has the complex utility function, and is not allowed to self-modify (it might therefore be quite slow and limited). Copy 2 has the following simple utility function: “ask AI copy 1 for permission for all future actions. Never modify AI copy 1′s behavior.” Copy 2 may self-modify and foom.
This puts some limits on the kinds of self-modification that copy 2 will be able to do. It won’t be able to take clever shortcuts that preserve its utility function in non-obvious ways. But it will be able to find a superintelligent way of doing something, if that something is sufficiently self-contained that it can describe it to copy 1.
Of course, if AI copy 2 morphs into something truly adversarial, then copy 1′s utility function had better be very foolproof. But this is something that has to be solved anyway to implement the seed AI’s utility function in a fooming FAI scenario.
So I think verifying simple utility functions under self-modification is indeed 90% of the way towards verifying complex ones.
I noticed this but didn’t explicitly point it out. My point was that when paulfchristiano said:
If the AI has a simple goal—press the button—then I think it is materially easier for the AI to modify itself while preserving the button-pressing goal [...] the problem is difficult, but I don’t think it is in the same league as friendliness
He was also assuming that he could handle your objections, e.g. that his AI wouldn’t find a loophole in the definition of “pressing a button”. So the problem he described was not, in fact, simpler than the general problem of FAI.
I think that much of the difficulty with friendliness is that you can’t write down a simple utility function such that maximizing that utility is friendly. By “complex goal” I mean one which is sufficiently complex that articulating it precisely is out of our league.
I do believe that any two utility functions you can write down precisely should be basically equivalent in terms of how hard it is to verify that an AI follows them.
Since there won’t be a proof that any given solution is the shortest, there is clear incentive to lie about which proof is the shortest that genie can find, which gives the genie lots of control over which answer gets picked. Basically, you are suggesting security by obscurity, ostensibly asking the genie one question, while in fact asking another.
Lying in that way would require the genie to prioritize destroying the world over the leverage we’re using against it, which, in the scenario described, it does not.
Then just get the answer straight away, or build a system specifically intended to create a reliable incentive structure for getting a probably honest answer. I object the the particular method of security by obscurity.
The point is to prevent it from optimizing its answer for secondary characteristics besides correctness, when multiple answers are possible. If you “just get the answer straight away”, it chooses among all the possible answers one that is correct, honest, and maximally poisonous to you.
So basically, the supposed law behind the method is in getting multiple correctness measurements, which together are expected to strengthen the estimation of correctness of the result.
Two conflicting problems in this particular case. If the measurements are supposed to be independent, so that multiple data points actually reinforce each other and many questions are better than one question, then you can’t expect for the different genies to be talking about the same answer, and so can’t elicit it in the manner described in the post. Conversely, if you can in fact rely on all those genies talking about the same answer, then you can also rely on all your measurements not being independent, and your answer will be picked according to exactly the same criteria as if it was asked straight away.
What? No, that’s not paulfchristano’s proposal at all. The reason for splitting up the measurement is to incentivize the genies to give you the proof which is lexicographically first. A perfect proof checker is assumed, so you know that if a proof is given, the result is correct; and the incentives are such that if the genie can find a proof, it will give it.
What reason acts to prevent the genie from giving a proof that is not the lexicographically smallest it can find?
If it did that, then there would be at least one subgenie that didn’t optimize its utility function—either one reporting no such proof of length n and getting punished when it could have provided a matching proof, or one reporting no such proof of length n and prefix p when it could have provided one and, again, getting punished. Remember, while each invocation (subgenie) has the same utility function, by assumption that function refers to something that can be done to the particular invocation which optimizes it.
If there is any genie-value to obtaining control over the real world, instances of genies will coordinate their decisions to get it.
You can’t fight superintelligent genies with explicit dependence bias. They are not independent, they can coordinate, and so they will, even if you can frame the problem statement in a way that suggests that they can’t communicate. Worse, as I pointed out, in this particular game they must coordinate to get anything done.
You are arguing that the stated assumptions about the genie’s utility function are unrealistic (which may be true), but presenting it as though you had found a flaw in the proof that follows from those assumptions.
It seems like the assumptions about utility, even if they hold, don’t actually deliver the behavior you expect, because the genies can coordinate. Unless the incentive structure makes sure they won’t try to take over the world in any case, it doesn’t make sure that they won’t try to take over the world if you only ask each of them to answer one binary question either.
Think of the genies as making a single decision that results in all the individual actions of all the individual genies. For the genies, having multiple actors just means raising the stakes by threatening to not free more genies, which you could’ve done for a single-question wish as easily instead of creating an elaborate questioning scheme. You could even let them explicitly discuss what to answer to your question, threatening to terminate them all if examination of their answer reveals any incorrectness!
Edit: See also Eliezer’s comment.
There is absolutely no sense in which this scheme is security by obscurity. My claim is that the genie will respect my wishes even though he knows exactly what I am doing, because he values my generosity right now more than the promise of taking over the world later.
Again, if your genie is already incentivized to be honest, in what sense is your scheme with all its bells and whistles better than asking for the shortest answer the genie can find, in plain English?
It is not magically incentivized to be honest. It is incentivized to be honest because each query is constructed precisely such that an honest answer is the rational thing to do, under relatively weak assumptions about its utility function. If you ask in plain English, you would actually need magic to produce the right incentives.
My question is about the difference. Why exactly is the plain question different from your scheme?
(Clearly your position is that your scheme works, and therefore “doesn’t assume any magic”, while the absence of your scheme doesn’t, and so “requires magic in order to work”. You haven’t told me anything I don’t already know, so it doesn’t help.)
Here is the argument in the post more concisely. Hopefully this helps:
It is impossible to lie and say “I was able to find a proof” by the construction of the verifier (if you claim you were able to find a proof, the verifier needs to see the proof to believe you.) So the only way you can lie is by saying “I was not able to find a proof” when you could have if you had really tried. So incentivizing the AI to be honest is precisely the same as incentivizing them to avoid admitting “I was not able to find a proof.” Providing such an incentive is not trivial, but it is basically the easiest possible incentive to provide.
I know of no way to incentivize someone to answer the plain question easily just based on your ability to punish them or reward them when you choose to. Being able to punish them for lying involves being able to tell when they are lying.
See this comment.
Are you saying you think Christiano’s scheme is overkill? Presumably we don’t have to sacrifice a virgin in order to summon a new genie, so it doesn’t look expensive enough to matter.
I’m saying that it’s not clear why his scheme is supposed to add security, and it looks like it doesn’t. If it does, we should understand why, and optimize that property instead of using the scheme straight away, and if it doesn’t, there is no reason to use it. Either way, there is at least one more step to be made. (In this manner, it could work as raw material for new lines of thought where we run out of ideas, for example.)
As I understand it, the genie is not incentivized to honest. It is incentivized to not get caught being dishonest. And the reason for the roundabout way of asking the question is to make the answer-channel bandwidth as narrow as possible.
It is impossible to be dishonest by saying “yes,” by construction. The genie is incentivized to say “yes’ whenever possible, so it is disincentivized to be dishonest by saying “no.” So the genie is incentivized to be honest, not just to avoid being called out for dishonesty.
Since we care about the genie actually being honest, the technique can be thought about as a way of making it more likely that the genie is honest, with the threat of punishing dishonestly a component of that technique.
If you believe that I can really incentivize a particular genie (recall that each question is answered by a separate genie) to solve the problem if they can, then this isn’t an issue. The genie will give you the shortest proof they possibly can, because the thing they care most about is saving themselves, which depends only on whether or not they find a short enough proof.
Perhaps you are objecting to my claim that I can actually incentivize a genie to be try their best, rather than sacrificing themselves to allow a future genie to escape. I believe that this problem is unequivocally massively easier than friendliness. For example, like I said, it would suffice to design an AGI whose goal is to make sure the reward button is pressed before the punishment button. It would also be sufficient to design an AGI which wants the reward button to be pressed as soon as possible, or basically any AGI whose utility function only depends upon some bounded time in the future (or which stops caring about things after some particular event occurs).
There is no “saving themselves”, there is only optimization of utility. If there is any genie-value to obtaining control over the real world, instances of genies will coordinate their decisions to get it.
Indeed, by saving themselves I was appealing to my analogy. This relies on the construction of a utility function such that human generosity now is more valuable than world domination later. I can write down many such utility functions easily, as contrasted with the difficulty of describing friendliness, so I can at least hope to design an AI which has one of them.
What you can do when you can write down stable utility functions—after you have solved the self-modification stability problem—but you can’t yet write down CEV—is a whole different topic from this sort of AI-boxing!
I don’t think I understand this post.
My claim is that it is easier to write down some stable utility functions than others. This is intimately related to the OP, because I am claiming as a virtue of my approach to boxing that it leaves us with the problem of getting an AI to follow essentially any utility function consistently. I am not purporting to solve that problem here, just making the claim that it is obviously no harder and almost obviously strictly easier than friendliness.
Then you don’t need the obscurity part.
I don’t understand. How do you propose to incentivize the genie appropriately? I haven’t done anything for the purpose of obscurity, just for the purpose of creating an appropriate incentive structure. I see no way to get the answer out in one question in general; that would certainly be better if you could do it safely.
If there’s any fooming to be done, then in the design of your genie AI you must solve the problem of maintaining utility invariants over self-modification. This might well be the hardest part of the FAI problem. On the other hand, without fooming, building a superhuman GAI might be much more difficult.
I don’t think this is entirely fair. If the AI has a simple goal—press the button—then I think it is materially easier for the AI to modify itself while preserving the button-pressing goal than if it has a very complex goal which is difficult to even describe.
I do agree the problem is difficult, but I don’t think it is in the same league as friendliness. I actually think it is quite approachable with modern techniques. Not to mention that its first step for the AI is to work out such techniques on its own to guarantee that it presses the button, once it manages to become intelligent enough to develop them.
Suppose you can verify that an AI obeys a simple utility function (‘press a button’), but can’t verify if it obeys a complex one (‘implement CEV of humanity’).
Suppose also that you can formally describe the complex function and run it (i.e. output the utility value of a proposed action or scenario), you just can’t verify that an AI follows this function. In other words, you can build an AI that follows this complex function, but you don’t know what will happen when it self-modifies.
You might proceed as follows. Run two copies of the AI. Copy 1 has the complex utility function, and is not allowed to self-modify (it might therefore be quite slow and limited). Copy 2 has the following simple utility function: “ask AI copy 1 for permission for all future actions. Never modify AI copy 1′s behavior.” Copy 2 may self-modify and foom.
This puts some limits on the kinds of self-modification that copy 2 will be able to do. It won’t be able to take clever shortcuts that preserve its utility function in non-obvious ways. But it will be able to find a superintelligent way of doing something, if that something is sufficiently self-contained that it can describe it to copy 1.
Of course, if AI copy 2 morphs into something truly adversarial, then copy 1′s utility function had better be very foolproof. But this is something that has to be solved anyway to implement the seed AI’s utility function in a fooming FAI scenario.
So I think verifying simple utility functions under self-modification is indeed 90% of the way towards verifying complex ones.
I don’t think you’ve noticed that this is just moving the fundamental problem to a different place. For example, you haven’t specified things like:
Don’t lie to AI 1 about your actions
Don’t persuade AI 1 to modify itself
Don’t find loopholes in the definition of “AI 1” or “modify”
etc., etc. If you could enforce all these things over superintelligent self-modification, you’d already have solved the general FAI problem.
IOW, what you propose isn’t actually a reduction of anything, AFAICT.
I noticed this but didn’t explicitly point it out. My point was that when paulfchristiano said:
He was also assuming that he could handle your objections, e.g. that his AI wouldn’t find a loophole in the definition of “pressing a button”. So the problem he described was not, in fact, simpler than the general problem of FAI.
I think that much of the difficulty with friendliness is that you can’t write down a simple utility function such that maximizing that utility is friendly. By “complex goal” I mean one which is sufficiently complex that articulating it precisely is out of our league.
I do believe that any two utility functions you can write down precisely should be basically equivalent in terms of how hard it is to verify that an AI follows them.