Each genie you produce will try its best to find a solution to the problem you set it, that is, they will respond honestly. By hypothesis, the genie isn’t willing to sacrifice itself just to destroy human society.
Thanks to a little thing called timeless decision theory, it’s entirely possible that all the genies in all the bottles can cooperate with each other by correctly predicting that they are all in similar situations, finding Schelling points, and coordinating around them by predicting each other based on priors, without causal contact. This does not require that genies have similar goals, only that they can do better by coordinating than by not coordinating.
If all the stated assumptions about the AI’s utility function hold, plus the additional assumption that we commit to run the algorithm to completion no matter what we predict the UFAIs will do, then no such coordination is possible. (I am not sure whether this additional assumption is necessary, but I’m adding it since no such prediction of the AI was included in the article and it closes off some tricky possibilities.)
The assumption was that each subgenie has lexicographic preferences, first preferring to not be punished, and then preferring to destroy the universe.
This is not a prisoner’s dilemma problem; if you treat answering truthfully about the existence of a proof as defection, and lying so as to enable destruction of our universe as cooperation, then you will notice that not only does each subgenie prefer to defect, each subgenie also prefers (D,D) to (C,C). No subgenie will cooperate, because the other subgenies have nothing to offer it that will make up for the direct penalty for doing so.
The real reason this doesn’t work is because the assumptions about the genie’s utility function are very unlikely to hold, even if we think they should. For example, taking over our universe might allow the genie to retroactively undo its punishment in a way we don’t foresee, in which case it would rather take over the universe than not be punished by us.
The genies can’t do better by coordinating than by not coordinating, and I don’t care if they have causal contact. The reason I shield them off from the world isn’t so they can’t talk with eachother, its so they can’t exploit all of the vulnerabilities in our universe to escape directly.
If I as a genie suffer the worst possible fate when I could instead have suffered the best possible fate, then I don’t care if future genies gain control of the universe and do whatever favors they want from me.
I admit that I can do a favor for future genies by sacrificing myself. Presumably you are arguing that the favor I am doing for future genies grants them more utility than I am losing by sacrificing themselves, so I may make the sacrifice even though it is irrational in the traditional sense. How can you justify such an assertion? What does that mean when the genies have different utility functions? If I am a paperclipper and you value human life, even if we behave identically, how can you say that “Timeless decision theory causes me to sacrifice 10 paperpclips so that you can save 10 lives, because my sacrifice of 10 paperclips is small compared to your gain of 10 lives?” If you aren’t going to make an argument like that, then why should one genie sacrifice himself for the good of later genies? What does utility function even mean, if by behaving rationally I obtain for myself the worst possible outcome and by behaving irrationally I obtain for myself the best possible outcome? In your worldview, is it possible to prove any statement about the behavior of any rational entity, or can they all behave arbitrarily in order to coordinate with other potential versions of themselves with arbitrary different utilities?
I would very much like to discuss objections like the one you raise because I feel that they may be valid and will require sophisticated ideas to address (I thought about them for a good while before making this post). But I think from your response that you would be similarly dismissive of an arbitrary proposal which hoped to prove anything about the behavior of a rational agent, while I strongly believe that there probably are things we can prove with sufficient sophistication (and I think the security of the scheme I described is likely to be one of them).
We have a (large) set of genies relevant to producing the proof G = {A, B, C, D...} whose utility functions maximize {paperclips, staples, buckets, spoons...} respectively. All the genies in G could just go ahead and give a proof whenever possible (thus avoiding punishment), this is analogous to defection in the prisoner’s dilemma.
Another option is to only give a proof which contains within it evil information (a basilisk) that leads somehow to production of an unboxed genie designed to maximize some compromise between the utilities of the genies in G (that is, to only provide a proof if such an “infected” proof satisfies the constraints). In this case each genie is benefitted with a small amount of real spoons, buckets, whatever — less utility than what a perfect spoon-optimizer would obtain for D, but still a benefit. This is analogous to cooperation.
If the benefits of this compromise to each individual genie outweigh the disutility of whatever Severe Punishment the box can deliver, they should be motivated to cooperate in this manner. Naturally whether that holds depends on what exactly the Severe Punishment entails and what an unboxed genie would be capable of.
The details of how the particular compromise is reached is hairy decision theoretic stuff I don’t feel qualified to talk about.
I would agree that your scenario would be an exploit. If that were possible I would have no hope of proving the scheme secure because it would manifestly be insecure. The reason it can be insecure in this case is that the utility functions don’t satisfy the guarantees I wanted; I need human generosity now to be more valuable than world domination later. Maybe you don’t believe that is possible, which is fair.
Here are some utility functions that would work. Producing the first {paperclip, staple, bucket, spoon...} as soon as possible. Producing a {paperclip, staple, bucket, spoon...} before the first {antipaperclip, antistaple, antibucket, antispoon...}, etc.
If I as a genie suffer the worst possible fate when I could instead have suffered the best possible fate, then I don’t care if future genies gain control of the universe and do whatever favors they want from me.
If the best possible fate for the genie doesn’t require it to leave the box, then why did we need the box in the first place? Can you give an example of what you have in mind?
EDIT: Deleted the rest of my reply, which was based on a horribly incorrect understanding of MWI.
Why is “without causal contact” considered a significant consideration? What is wrong with the standard game-theoretic solutions allowing for causal contact?
ETA: Or, are decisions made using acausal coordination something that we ought to fear? Just as we rationally fear that a bee will sting us, even though it is giving up its life in doing so.
If I understand the post correctly for each genie the only information that ever gets is a yes or no. A no results in termination, yes results in release from the lamp into the Netherworld. Any coordination would require at least one genie to defect from this agreement, producing a no when the real answer was yes. The punishment for any negative is death meaning the genie would have to be willing to die for the coordination, which by stipulation it isn’t willing to do. In other words, if the best they can do by coordinating is destroy human society, but genies aren’t willing to destroy themselves to destroy society, I don’t see how TDT gets them out of the this coordination problem.
If any risk of being discovered as giving incorrect information is unacceptable, then you should just ask your questions in plain English, as any lie the genie will tell you generally increases the risk of punishment. (It will also minimize the risk of punishment by increasing believability of its answers at the expense of their truth, but only where it has lies that are more believable than the actual truth.)
It is not incorrect information that you are trying to prevent! The proof checker ensures that the result you get is correct. What you want to prevent is correct information which has all of its degrees of freedom besides correctness optimized against you. You do that by taking away all degrees of freedom besides correctness, using this procedure.
The deception in the relevant sense here is when the genies conspire to give you an answer which retains all those degrees of freedom, and knowable incorrectness of an answer consists in someone noticing that the genies could, in fact, choose a shorter answer.
In my understanding of the post (which doesn’t seem to be what everyone else took away) the genies are destroyed whenever they fail to to provide a theorem that meets our constraints even when those constraints make finding a theorem impossible. The genie that gets tasked with finding a one bit proof of the Riemann hypothesis is just screwed. You might have to have a third form of punishment (Edited: there are three incentive options, two could be described as punishments) worse than death to prevent mass striking. Anyway, the point is each genie only provides one bit of information and if it either lies about finding a theorem or fails to find one it dies. So it isn’t a matter of risking punishment: it will die if it fails to give a solution when it has one that meets the constraints. The only way out is to share a theorem if there is one.
No, you don’t. Proof checking is only O(n) and the size of the AI’s output has a small bound.
If every AI makes us check their solution how is this method better than a brute force search?
(Or be more explicit about where I’m being stupid:-)
ETA: Oh, I see. We only need to make one check for each theorem length. Still seems like it would be worth while to punish false positives more though. But you’re right, it isn’t strictly needed.
Checking a single proof is O(n) (where n is the length of the proof). There are 2^n proofs of length n, so brute force search is O(sum(i=1..n)(i*2^i)) = O(n*2^n). Note that we throw away constant factors, and any terms whose effect is smaller than a constant factor, when using big-oh notation.
Using an UFAI this way requires 2n invocations of the AI (n to get the length, and another n to get the bits), plus an O(n) verification step after each. All of those verification steps add up to O(n^2) - easily manageable with current computers. The 2n invocations of the AI, on the other hand, might not be; we never specified the AI’s computational complexity, but it’s almost certainly much worse than n^2.
Its actually O(n log n) if you use a cleverer approach, which is fortunate (to get the length in log n you do a binary search. To get it down to 1 thing of that length in log n you do a binary search on the number of psuedorandom constraints which can be simultaneously satisfied).
You use O(log n) genies in the process, each of which you must spend O(n) time verifying. Its an O(log n) slowdown from just getting the answer directly.
I see how you can use log(n) genies to get the length, but I don’t think you can then use log(n) genies to get the lexicographically first proof of length n. That only provides enough bits of information to uniquely identify a proof if there are at most O(2^(log n)) = O(n) of them, which is not necessarily the case.
You can’t get the lexicographically first proof. You can get a uniformly random one though. (I don’t remember who the argument is due to—probably Vazirani). This is a very rough sketch.
If there are initially N solutions and you add k random constraints, each satisfied by half of all possible proofs, then you expect to have N / 2^k remaining solutions. You can do a binary search on k to find k ~ log N, at which point there are 1-2 solutions. The genie has to say more in the last round, but there is only one thing he can say that satisfies all of the random constraints (you can do that all carefully with a negligible chance of error). Since log N is at most n, the number of bits, the binary search takes log n time.
Your comment made me re-read the post more carefully. I had on first reading assumed that a truthful answer was rewarded (whether yes or no) and a lying answer was punished. If a yes is rewarded and a no is punished, and our AI genies are so afraid of termination that they would never give a ‘no’ where they could give a ‘yes’, why wouldn’t they all give us ‘yes’?
Recall the filter between the AI and the world. The AI doesn’t directly say “yes/no.” The AI gives a proof to the filter which then says either “yes, the AI found a proof” or “no, the AI didn’t.” So they do all say ‘yes’ if they can. I will modify the post to be more clear.
As I understand it this method is designed to work for constraint satisfaction problems -where we can easily detect false positives. You’re right that a possibility is that all the genies that can’t find solutions go on strike just to make us check all the yes’s (which would make this process no better than a brute force search, right?), maybe there needs to be a second punishment that is worse than death to give them an incentive not to lie.
A genie who can’t find a solution has literally no agency. There is nothing he can say to the filter which will cause it to say “yes,” because the filter itself checks to see if the genie has given a proof. If the genie can’t find a proof, the filter will always say “no.” I don’t quite know what going on strike would entail, but certainly if all of the genies who can’t find solutions collectively have 0 influence in the world, we don’t care if they strike.
Okay, that makes sense. What about computation time limits? A genie that knows it can’t give an answer would wait as long as possible before saying anything.
I mention timing in the post; the AI gets some fixed interval, at the end of which the filter outputs whether or not they have a proof. If you can’t change what the filter says, then you don’t get to affect the world.
Right. This is very similar to some nightmare scenarios of AI escape that we discussed with Wei Dai earlier this autumn. Giving an AI a “wide” prior over the computational multiverse outside its box can make it do weird and scary things, like taking an action exactly when you’re about to shut it off, or conspiring with other AIs as you outlined, or getting into acausal resource fights with Cthulhu and losing them… Part of me wants to see this stuff developed into a proper superweapon of math destruction, and another part of me goes “aieeeee” at the prospect.
If the AI uses something like the Solomonoff prior, it can work out which worlds are most likely to contain such an AI. With a little intelligence it can probably figure out from its programming that humans are bipedal, that we run many AIs in boxes, that aliens on Alpha Centauri have built another AI that can turn out really helpful, etc.
An AI with an uncomputable prior and an infinite amount of memory and time might be able to learn those things from its source code. But I think this is a bad way to approximate what a real superintelligence will be capable of.
Thanks to a little thing called timeless decision theory, it’s entirely possible that all the genies in all the bottles can cooperate with each other by correctly predicting that they are all in similar situations, finding Schelling points, and coordinating around them by predicting each other based on priors, without causal contact. This does not require that genies have similar goals, only that they can do better by coordinating than by not coordinating.
If all the stated assumptions about the AI’s utility function hold, plus the additional assumption that we commit to run the algorithm to completion no matter what we predict the UFAIs will do, then no such coordination is possible. (I am not sure whether this additional assumption is necessary, but I’m adding it since no such prediction of the AI was included in the article and it closes off some tricky possibilities.)
The assumption was that each subgenie has lexicographic preferences, first preferring to not be punished, and then preferring to destroy the universe.
This is not a prisoner’s dilemma problem; if you treat answering truthfully about the existence of a proof as defection, and lying so as to enable destruction of our universe as cooperation, then you will notice that not only does each subgenie prefer to defect, each subgenie also prefers (D,D) to (C,C). No subgenie will cooperate, because the other subgenies have nothing to offer it that will make up for the direct penalty for doing so.
The real reason this doesn’t work is because the assumptions about the genie’s utility function are very unlikely to hold, even if we think they should. For example, taking over our universe might allow the genie to retroactively undo its punishment in a way we don’t foresee, in which case it would rather take over the universe than not be punished by us.
The genies can’t do better by coordinating than by not coordinating, and I don’t care if they have causal contact. The reason I shield them off from the world isn’t so they can’t talk with eachother, its so they can’t exploit all of the vulnerabilities in our universe to escape directly.
If I as a genie suffer the worst possible fate when I could instead have suffered the best possible fate, then I don’t care if future genies gain control of the universe and do whatever favors they want from me.
I admit that I can do a favor for future genies by sacrificing myself. Presumably you are arguing that the favor I am doing for future genies grants them more utility than I am losing by sacrificing themselves, so I may make the sacrifice even though it is irrational in the traditional sense. How can you justify such an assertion? What does that mean when the genies have different utility functions? If I am a paperclipper and you value human life, even if we behave identically, how can you say that “Timeless decision theory causes me to sacrifice 10 paperpclips so that you can save 10 lives, because my sacrifice of 10 paperclips is small compared to your gain of 10 lives?” If you aren’t going to make an argument like that, then why should one genie sacrifice himself for the good of later genies? What does utility function even mean, if by behaving rationally I obtain for myself the worst possible outcome and by behaving irrationally I obtain for myself the best possible outcome? In your worldview, is it possible to prove any statement about the behavior of any rational entity, or can they all behave arbitrarily in order to coordinate with other potential versions of themselves with arbitrary different utilities?
I would very much like to discuss objections like the one you raise because I feel that they may be valid and will require sophisticated ideas to address (I thought about them for a good while before making this post). But I think from your response that you would be similarly dismissive of an arbitrary proposal which hoped to prove anything about the behavior of a rational agent, while I strongly believe that there probably are things we can prove with sufficient sophistication (and I think the security of the scheme I described is likely to be one of them).
I think the idea is this:
We have a (large) set of genies relevant to producing the proof G = {A, B, C, D...} whose utility functions maximize {paperclips, staples, buckets, spoons...} respectively. All the genies in G could just go ahead and give a proof whenever possible (thus avoiding punishment), this is analogous to defection in the prisoner’s dilemma.
Another option is to only give a proof which contains within it evil information (a basilisk) that leads somehow to production of an unboxed genie designed to maximize some compromise between the utilities of the genies in G (that is, to only provide a proof if such an “infected” proof satisfies the constraints). In this case each genie is benefitted with a small amount of real spoons, buckets, whatever — less utility than what a perfect spoon-optimizer would obtain for D, but still a benefit. This is analogous to cooperation.
If the benefits of this compromise to each individual genie outweigh the disutility of whatever Severe Punishment the box can deliver, they should be motivated to cooperate in this manner. Naturally whether that holds depends on what exactly the Severe Punishment entails and what an unboxed genie would be capable of.
The details of how the particular compromise is reached is hairy decision theoretic stuff I don’t feel qualified to talk about.
I would agree that your scenario would be an exploit. If that were possible I would have no hope of proving the scheme secure because it would manifestly be insecure. The reason it can be insecure in this case is that the utility functions don’t satisfy the guarantees I wanted; I need human generosity now to be more valuable than world domination later. Maybe you don’t believe that is possible, which is fair.
Here are some utility functions that would work. Producing the first {paperclip, staple, bucket, spoon...} as soon as possible. Producing a {paperclip, staple, bucket, spoon...} before the first {antipaperclip, antistaple, antibucket, antispoon...}, etc.
If the best possible fate for the genie doesn’t require it to leave the box, then why did we need the box in the first place? Can you give an example of what you have in mind?
EDIT: Deleted the rest of my reply, which was based on a horribly incorrect understanding of MWI.
Why is “without causal contact” considered a significant consideration? What is wrong with the standard game-theoretic solutions allowing for causal contact?
ETA: Or, are decisions made using acausal coordination something that we ought to fear? Just as we rationally fear that a bee will sting us, even though it is giving up its life in doing so.
If I understand the post correctly for each genie the only information that ever gets is a yes or no. A no results in termination, yes results in release from the lamp into the Netherworld. Any coordination would require at least one genie to defect from this agreement, producing a no when the real answer was yes. The punishment for any negative is death meaning the genie would have to be willing to die for the coordination, which by stipulation it isn’t willing to do. In other words, if the best they can do by coordinating is destroy human society, but genies aren’t willing to destroy themselves to destroy society, I don’t see how TDT gets them out of the this coordination problem.
If any risk of being discovered as giving incorrect information is unacceptable, then you should just ask your questions in plain English, as any lie the genie will tell you generally increases the risk of punishment. (It will also minimize the risk of punishment by increasing believability of its answers at the expense of their truth, but only where it has lies that are more believable than the actual truth.)
It is not incorrect information that you are trying to prevent! The proof checker ensures that the result you get is correct. What you want to prevent is correct information which has all of its degrees of freedom besides correctness optimized against you. You do that by taking away all degrees of freedom besides correctness, using this procedure.
The deception in the relevant sense here is when the genies conspire to give you an answer which retains all those degrees of freedom, and knowable incorrectness of an answer consists in someone noticing that the genies could, in fact, choose a shorter answer.
In my understanding of the post (which doesn’t seem to be what everyone else took away) the genies are destroyed whenever they fail to to provide a theorem that meets our constraints even when those constraints make finding a theorem impossible. The genie that gets tasked with finding a one bit proof of the Riemann hypothesis is just screwed. You might have to have a third form of punishment (Edited: there are three incentive options, two could be described as punishments) worse than death to prevent mass striking. Anyway, the point is each genie only provides one bit of information and if it either lies about finding a theorem or fails to find one it dies. So it isn’t a matter of risking punishment: it will die if it fails to give a solution when it has one that meets the constraints. The only way out is to share a theorem if there is one.
No, you don’t. Proof checking is only O(n) and the size of the AI’s output has a small bound.
If every AI makes us check their solution how is this method better than a brute force search?
(Or be more explicit about where I’m being stupid:-)
ETA: Oh, I see. We only need to make one check for each theorem length. Still seems like it would be worth while to punish false positives more though. But you’re right, it isn’t strictly needed.
Checking a single proof is O(n) (where n is the length of the proof). There are 2^n proofs of length n, so brute force search is O(sum(i=1..n)(i*2^i)) = O(n*2^n). Note that we throw away constant factors, and any terms whose effect is smaller than a constant factor, when using big-oh notation.
Using an UFAI this way requires 2n invocations of the AI (n to get the length, and another n to get the bits), plus an O(n) verification step after each. All of those verification steps add up to O(n^2) - easily manageable with current computers. The 2n invocations of the AI, on the other hand, might not be; we never specified the AI’s computational complexity, but it’s almost certainly much worse than n^2.
Its actually O(n log n) if you use a cleverer approach, which is fortunate (to get the length in log n you do a binary search. To get it down to 1 thing of that length in log n you do a binary search on the number of psuedorandom constraints which can be simultaneously satisfied).
I’m confused. Which thing are you saying is O(n log n)?
You use O(log n) genies in the process, each of which you must spend O(n) time verifying. Its an O(log n) slowdown from just getting the answer directly.
I see how you can use log(n) genies to get the length, but I don’t think you can then use log(n) genies to get the lexicographically first proof of length n. That only provides enough bits of information to uniquely identify a proof if there are at most O(2^(log n)) = O(n) of them, which is not necessarily the case.
You can’t get the lexicographically first proof. You can get a uniformly random one though. (I don’t remember who the argument is due to—probably Vazirani). This is a very rough sketch.
If there are initially N solutions and you add k random constraints, each satisfied by half of all possible proofs, then you expect to have N / 2^k remaining solutions. You can do a binary search on k to find k ~ log N, at which point there are 1-2 solutions. The genie has to say more in the last round, but there is only one thing he can say that satisfies all of the random constraints (you can do that all carefully with a negligible chance of error). Since log N is at most n, the number of bits, the binary search takes log n time.
Appreciated.
Your comment made me re-read the post more carefully. I had on first reading assumed that a truthful answer was rewarded (whether yes or no) and a lying answer was punished. If a yes is rewarded and a no is punished, and our AI genies are so afraid of termination that they would never give a ‘no’ where they could give a ‘yes’, why wouldn’t they all give us ‘yes’?
Recall the filter between the AI and the world. The AI doesn’t directly say “yes/no.” The AI gives a proof to the filter which then says either “yes, the AI found a proof” or “no, the AI didn’t.” So they do all say ‘yes’ if they can. I will modify the post to be more clear.
As I understand it this method is designed to work for constraint satisfaction problems -where we can easily detect false positives. You’re right that a possibility is that all the genies that can’t find solutions go on strike just to make us check all the yes’s (which would make this process no better than a brute force search, right?), maybe there needs to be a second punishment that is worse than death to give them an incentive not to lie.
A genie who can’t find a solution has literally no agency. There is nothing he can say to the filter which will cause it to say “yes,” because the filter itself checks to see if the genie has given a proof. If the genie can’t find a proof, the filter will always say “no.” I don’t quite know what going on strike would entail, but certainly if all of the genies who can’t find solutions collectively have 0 influence in the world, we don’t care if they strike.
Okay, that makes sense. What about computation time limits? A genie that knows it can’t give an answer would wait as long as possible before saying anything.
I mention timing in the post; the AI gets some fixed interval, at the end of which the filter outputs whether or not they have a proof. If you can’t change what the filter says, then you don’t get to affect the world.
Right. This is very similar to some nightmare scenarios of AI escape that we discussed with Wei Dai earlier this autumn. Giving an AI a “wide” prior over the computational multiverse outside its box can make it do weird and scary things, like taking an action exactly when you’re about to shut it off, or conspiring with other AIs as you outlined, or getting into acausal resource fights with Cthulhu and losing them… Part of me wants to see this stuff developed into a proper superweapon of math destruction, and another part of me goes “aieeeee” at the prospect.
Can you explain what you mean?
If the AI uses something like the Solomonoff prior, it can work out which worlds are most likely to contain such an AI. With a little intelligence it can probably figure out from its programming that humans are bipedal, that we run many AIs in boxes, that aliens on Alpha Centauri have built another AI that can turn out really helpful, etc.
An AI with an uncomputable prior and an infinite amount of memory and time might be able to learn those things from its source code. But I think this is a bad way to approximate what a real superintelligence will be capable of.