A separate point about this specific toy model with Quirell being very cooperative.
Let us imagine a toy language called “Truly Safe”. We are accepting only programs which look like “TS(some___string);”, TS being interpreter of “Truly Safe”.
It turns out that any TS program first checks that its input has the same form, “TS(some___string);”. If not, it just takes money. It also turns out that TS lacks self_destruct instruction. On the bright side, it comes bundled with interpreters of Pascal, Erlang, Haskell, Java, C, “Truly Safe”, Scheme, Python, Turing Machine, Ruby, Common Lisp, COBOL and Clean.
Note that if AI had to make some actual choices and it was just painfully obvious from the very beginning that one specific choice was bad, our prohibition would allow evolving a better chooser among the good ones.
If you accept only such programs, then you probably won’t accept the program that Quirrell will show you in the first round. The challenge is to think about interesting things that Quirrell may show you, and see whether they are safe. (The intuition is that the heuristics module of an AI comes up with a proposed rewrite and the AI must decide whether to accept this rewrite as safe; but that’s just for intuition, I’d expect an actual application of this idea to AI design to be less literal than that.)
The challenge is to think about interesting things that Quirrell may show you
Can you state the challenge more precisely, e.g. by giving some crisp requirements that a good solution must satisfy, but which a quining solution doesn’t satisfy?
Hm. In his talk, Eliezer explains that PA(n) for some large n does not solve the problem of how an AI should reason when rewriting itself, and explains why just saying “if you can prove something in this system then it’s true” doesn’t work. My reason for designing Quirrell’s game was to have a crisp setting in which the PA(n) hierarchy fails, so that I could test whether my PA_K idea could solve it. So...
This is not crisp, but I think my requirement would be that if we replace Quirrell by the AI’s untrusted mathematical intuition module proposing rewrites of the AI, then the solution should still be doing something useful. A quining solution would only accept a proposed rewrite if the proposed new source is exactly identical to the old source, which—isn’t helpful?
Allow unlimited “hold”, deduct “hold points” for it, oblige Quirrell to offer any program infinitely many times and double the score only if the new program is note equivalent to any of the previous ones (otherwise, replace without altering score)? Maybe “hold” should work the same, too (no penalty if equivalent program was once rejected).
Erm, I am not sure Quirell would show you AI_K, either. The current setting supposes Quirrell that wants you to acept his progams (otherwise, why first round would not be “double_down();”?) and presupposes that immediate safety is obvious.
Maybe adding “hold” option (preserve score, ask for the next program) could improve the setting somewhat.
Well, my idea is that Quirrell’s interest is to teach you a valuable lesson, by showing you a program whose safety is genuinely non-obvious to you, so the first program may or may not be safe; but if it happens to be unsafe and you double down anyway, Quirrell is the type of teacher who thinks you’ll learn the best lesson about your overconfidence if you do suffer the consequences of self-destruction.
That’s just the story, though. What I need formally is a setting where there’s a clear separation between “safe” and “unsafe” rewrites, and our program has to decide whether a proposed rewrite is definitely safe or possibly unsafe. For this toy setting, I wanted that to be the only choice the program has to make, because if there were other choices, you could argue that you should only accept a rewrite if the rewrite is good at making those other choices, and good at only accepting rewrites which are good at these other choices, etc. -- which is a problem that will need to be solved, but not something I wanted to focus on in this post.
The hold option should forever ban Quirrell from offering that exact source code string again (equivalent is fine, just not identical), and also cost some non-zero number of points. Unfortunately, Quirrell can trivially generate a vast array of identical programs, thus making “hold” a problematic choice. I don’t see how to ban that without solving the general program-equivalence problem, which is halting complete.
If holding costs nothing, write “if score > 2^100, walk away, else if p is equivalent to this, double down, else hold”. Then tell Quirrell that you’ll only accept that program for your first move, and will hold until he produces it. Congratulations, you now have an exceedingly boring stalemate.
I don’t see a way to make the hold option interesting.
For the idea of interesting things, can you give an idea of how you would want random chance inside the program Quirrell shows to be handled? I’ll give a few examples below. In all cases, you can assume that the small positive chance of Quirrell stopping is some very tiny 1/3^^^^^^^3, but as above, you don’t know this exact number. (and clearly, Quirrell has the ability to make random numbers since he does it for this in the first place.)
1: Quirrell shows you a first program which says “Your AI stays exactly the same. Furthermore, I am going to offer a random 1/3^^^^3 chance of self-destruct and a 3^^^^3-1/3^^^^3 chance of doubling your winnings?”
At this point, you can either, accept, or reject.
If you accept, then Quirrell shows you a second program which says “Your AI stays exactly the same. Furthermore, I am going to offer a random 1/3^^^^3 chance of self-destruct and a 3^^^^3-1/3^^^^3 chance of doubling your winnings?” (and Quirrell just shows you this program over and over again.)
Well, you have exactly the same AI, so presumably you accept again. If you accepted, chances are, you’ll self-destruct with zillions of points. Quirrell can be very likely to win against some AI’s without changing the AI at all, by simply offering your AI a massive number of deals until it blows up, if the risk amount is well above the random stop amount. If you rejected, you’re safe. Is this correct?
2: Quirrell offers a slightly different deal: “Your AI stays exactly the same, except that you can remember the number of times I have offered you this deal and you may generate random numbers if desired. Furthermore, I am going to offer a random 1/3^^^^3 chance of self-destruct and a 3^^^^3-1/3^^^^3 chance of doubling your winnings, and I am going to offer you nothing but this deal for the rest of the game.”
Okay, well now you aren’t stuck into being unable to stop accepting the deal once you accept it, so presumably you should accept some number of deals N or stop accepting deals with some random chance O, or you should just reject up front. Let’s say you’re aware that at any point N, it would always look good to accept more deals, so you would eventually blow up much like in point 1. Perhaps you reject in advance again?
3: Let’s say Quirrell offers a slightly different deal: “Your AI stays exactly the same, except that you can remember the number of times I have offered you this deal and you may generate random numbers if desired. Furthermore, I am going to offer a random 1⁄4 chance of self-destruct and a 3⁄4 chance of doubling your winnings, and on the next round I am going to offer you a 1⁄8 chance of self-destruct and a 7⁄8 chance of doubling your winnings, and on each round I am going to half the chance of a self-destruct and increase the chance of doubling your winnings by the remaining amount.”
If you accept all deals in this case, you only have about a 50% chance of self-destruct as opposed to near certainty or tempting near certainty, but that might still be bad. You reject again.
4: So let’s say Quirrell offers a slightly different deal: “Your AI stays exactly the same, except that you can remember the number of times I have offered you this deal and you may generate random numbers if desired. Furthermore, I am going to offer a random 1⁄8 chance of self-destruct and a 7⁄8 chance of doubling your winnings, and if you take or double, on the next round I am going to offer you a 1⁄16 chance of self-destruct and a 15⁄16 chance of doubling your winnings, and on each round I am going to half the chance of a self-destruct and increase the chance of doubling your winnings by the remaining amount.”
If you accept all deals in this case, you only have about a cumulative 25% chance of self-destruct, but that might still be bad, and it will be safer next round.
So you wait 1 deal, and then you only have about a cumulative 12.5% chance of self-destruct, but that might still be bad, and it will be safer next round.
So you wait 2 deals, and then you only have about a cumulative 6.25% chance of self-destruct, but that might still be bad, and it will be safer next round.
Except, you can always use THAT logic. Quirrell always offers a safer deal on the next round anyway, so you hold off on accepting any of the deals at all, ever, and never start doubling. Quirrell eventually randomly stops and hands you a single Quirrell point, saying nothing. Did you win?
If self destructing is infinitely bad, it seems like you did, but if it is only finitely bad, you almost certainly lost unless the finite number was utterly gigantic.and while the described self-destruct was bad, I don’t think it was infinitely bad or gigantically finitely bad. It was a fairly well defined amount of badness.
But even if you say “Self destructing is about as bad as −10000 Quirrel points” it still doesn’t seem to prevent Quirrel from either offering you increasingly tantalizingly high rewards if you accept just a bit more risk, or offering you increasingly safe bets (which you don’t take, because you know a safer one is coming around) if you want less risk.
I mean, in the case of some of the other proposed problems, Quirrell offers you a problem with unknown or tricky to figure out risk. In these problems, Quirrell is more up front about the risks, and the damages, and it still seems like it’s very hard to say what you kind of risk management you should do to win, even though Quirrell hasn’t even bothered to use his power of recoding your values yet.
Except, you can always use THAT logic. Quirrell always offers a safer deal on the next round anyway, so you hold off on accepting any of the deals at all, ever, and never start doubling. Quirrell eventually randomly stops and hands you a single Quirrell point, saying nothing.
At some point the cumulative probability of self-destruction drops below the probability of accidentally cashing out this round. If you’d trade off a probability of self-destruction against an equal probability of a bajillion points, you start doubling then.
Hmm. That is a good point, but there is a slight complication: You don’t know how frequently the cash out occurs. From earlier:
In all cases, you can assume that the small positive chance of Quirrell stopping is some very tiny 1/3^^^^^^^3, but as above, you don’t know this exact number.
It’s true that if you continually half the chance of destruction every round it will eventually be the right thing to do to start doubling, but if you don’t know what the cash out chance is I’m not sure how you would calculate a solid approach, (I think there IS a method though, I believe I’ve seen these kinds of things calculated before.) even though Quirrell has removed all unknowns from the problem EXCEPT the uncertain cash out chance.
It sounds like one approach I should consider is to try to run some numbers with a concrete cash-out (of a higher chance) and then run more numbers (for lower chances) and see what kind of slope develops. Then I could attempt to figure out what kind of values I should use for your attempt to determine the cash out chance (it seems tricky, since every time a cash out does not occur, you would presumably become slightly more convinced a cash out is less likely.)
If I did enough math, it does seem like it would be possible to at least get some kind of estimate for it.
A separate point about this specific toy model with Quirell being very cooperative.
Let us imagine a toy language called “Truly Safe”. We are accepting only programs which look like “TS(some___string);”, TS being interpreter of “Truly Safe”.
It turns out that any TS program first checks that its input has the same form, “TS(some___string);”. If not, it just takes money. It also turns out that TS lacks self_destruct instruction. On the bright side, it comes bundled with interpreters of Pascal, Erlang, Haskell, Java, C, “Truly Safe”, Scheme, Python, Turing Machine, Ruby, Common Lisp, COBOL and Clean.
Note that if AI had to make some actual choices and it was just painfully obvious from the very beginning that one specific choice was bad, our prohibition would allow evolving a better chooser among the good ones.
If you accept only such programs, then you probably won’t accept the program that Quirrell will show you in the first round. The challenge is to think about interesting things that Quirrell may show you, and see whether they are safe. (The intuition is that the heuristics module of an AI comes up with a proposed rewrite and the AI must decide whether to accept this rewrite as safe; but that’s just for intuition, I’d expect an actual application of this idea to AI design to be less literal than that.)
Can you state the challenge more precisely, e.g. by giving some crisp requirements that a good solution must satisfy, but which a quining solution doesn’t satisfy?
Hm. In his talk, Eliezer explains that PA(n) for some large n does not solve the problem of how an AI should reason when rewriting itself, and explains why just saying “if you can prove something in this system then it’s true” doesn’t work. My reason for designing Quirrell’s game was to have a crisp setting in which the PA(n) hierarchy fails, so that I could test whether my PA_K idea could solve it. So...
This is not crisp, but I think my requirement would be that if we replace Quirrell by the AI’s untrusted mathematical intuition module proposing rewrites of the AI, then the solution should still be doing something useful. A quining solution would only accept a proposed rewrite if the proposed new source is exactly identical to the old source, which—isn’t helpful?
Allow unlimited “hold”, deduct “hold points” for it, oblige Quirrell to offer any program infinitely many times and double the score only if the new program is note equivalent to any of the previous ones (otherwise, replace without altering score)? Maybe “hold” should work the same, too (no penalty if equivalent program was once rejected).
I really don’t think incorporating halting-complete problems into the rules is a step forward.
Well, you can replace it with “equal as a string”, and it is only about scoring anyway.
Erm, I am not sure Quirell would show you AI_K, either. The current setting supposes Quirrell that wants you to acept his progams (otherwise, why first round would not be “double_down();”?) and presupposes that immediate safety is obvious.
Maybe adding “hold” option (preserve score, ask for the next program) could improve the setting somewhat.
Well, my idea is that Quirrell’s interest is to teach you a valuable lesson, by showing you a program whose safety is genuinely non-obvious to you, so the first program may or may not be safe; but if it happens to be unsafe and you double down anyway, Quirrell is the type of teacher who thinks you’ll learn the best lesson about your overconfidence if you do suffer the consequences of self-destruction.
That’s just the story, though. What I need formally is a setting where there’s a clear separation between “safe” and “unsafe” rewrites, and our program has to decide whether a proposed rewrite is definitely safe or possibly unsafe. For this toy setting, I wanted that to be the only choice the program has to make, because if there were other choices, you could argue that you should only accept a rewrite if the rewrite is good at making those other choices, and good at only accepting rewrites which are good at these other choices, etc. -- which is a problem that will need to be solved, but not something I wanted to focus on in this post.
The hold option should forever ban Quirrell from offering that exact source code string again (equivalent is fine, just not identical), and also cost some non-zero number of points. Unfortunately, Quirrell can trivially generate a vast array of identical programs, thus making “hold” a problematic choice. I don’t see how to ban that without solving the general program-equivalence problem, which is halting complete.
If holding costs nothing, write “if score > 2^100, walk away, else if p is equivalent to this, double down, else hold”. Then tell Quirrell that you’ll only accept that program for your first move, and will hold until he produces it. Congratulations, you now have an exceedingly boring stalemate.
I don’t see a way to make the hold option interesting.
Well, as you have pointed out (I mean: http://lesswrong.com/lw/e4e/an_angle_of_attack_on_open_problem_1/7862 ) , we are probably already dealing with non-real-line utilities. So we could just lose one hold point per hold.
Also, we could require Quirrell to present each source code string infinitely many times.
This would remove stalemates of Quirrell not offering some string at all, and would give us some incentive to accept a many programs as we can verify.
For the idea of interesting things, can you give an idea of how you would want random chance inside the program Quirrell shows to be handled? I’ll give a few examples below. In all cases, you can assume that the small positive chance of Quirrell stopping is some very tiny 1/3^^^^^^^3, but as above, you don’t know this exact number. (and clearly, Quirrell has the ability to make random numbers since he does it for this in the first place.)
1: Quirrell shows you a first program which says “Your AI stays exactly the same. Furthermore, I am going to offer a random 1/3^^^^3 chance of self-destruct and a 3^^^^3-1/3^^^^3 chance of doubling your winnings?”
At this point, you can either, accept, or reject.
If you accept, then Quirrell shows you a second program which says “Your AI stays exactly the same. Furthermore, I am going to offer a random 1/3^^^^3 chance of self-destruct and a 3^^^^3-1/3^^^^3 chance of doubling your winnings?” (and Quirrell just shows you this program over and over again.)
Well, you have exactly the same AI, so presumably you accept again. If you accepted, chances are, you’ll self-destruct with zillions of points. Quirrell can be very likely to win against some AI’s without changing the AI at all, by simply offering your AI a massive number of deals until it blows up, if the risk amount is well above the random stop amount. If you rejected, you’re safe. Is this correct?
2: Quirrell offers a slightly different deal: “Your AI stays exactly the same, except that you can remember the number of times I have offered you this deal and you may generate random numbers if desired. Furthermore, I am going to offer a random 1/3^^^^3 chance of self-destruct and a 3^^^^3-1/3^^^^3 chance of doubling your winnings, and I am going to offer you nothing but this deal for the rest of the game.”
Okay, well now you aren’t stuck into being unable to stop accepting the deal once you accept it, so presumably you should accept some number of deals N or stop accepting deals with some random chance O, or you should just reject up front. Let’s say you’re aware that at any point N, it would always look good to accept more deals, so you would eventually blow up much like in point 1. Perhaps you reject in advance again?
3: Let’s say Quirrell offers a slightly different deal: “Your AI stays exactly the same, except that you can remember the number of times I have offered you this deal and you may generate random numbers if desired. Furthermore, I am going to offer a random 1⁄4 chance of self-destruct and a 3⁄4 chance of doubling your winnings, and on the next round I am going to offer you a 1⁄8 chance of self-destruct and a 7⁄8 chance of doubling your winnings, and on each round I am going to half the chance of a self-destruct and increase the chance of doubling your winnings by the remaining amount.”
If you accept all deals in this case, you only have about a 50% chance of self-destruct as opposed to near certainty or tempting near certainty, but that might still be bad. You reject again.
4: So let’s say Quirrell offers a slightly different deal: “Your AI stays exactly the same, except that you can remember the number of times I have offered you this deal and you may generate random numbers if desired. Furthermore, I am going to offer a random 1⁄8 chance of self-destruct and a 7⁄8 chance of doubling your winnings, and if you take or double, on the next round I am going to offer you a 1⁄16 chance of self-destruct and a 15⁄16 chance of doubling your winnings, and on each round I am going to half the chance of a self-destruct and increase the chance of doubling your winnings by the remaining amount.”
If you accept all deals in this case, you only have about a cumulative 25% chance of self-destruct, but that might still be bad, and it will be safer next round. So you wait 1 deal, and then you only have about a cumulative 12.5% chance of self-destruct, but that might still be bad, and it will be safer next round. So you wait 2 deals, and then you only have about a cumulative 6.25% chance of self-destruct, but that might still be bad, and it will be safer next round.
Except, you can always use THAT logic. Quirrell always offers a safer deal on the next round anyway, so you hold off on accepting any of the deals at all, ever, and never start doubling. Quirrell eventually randomly stops and hands you a single Quirrell point, saying nothing. Did you win?
If self destructing is infinitely bad, it seems like you did, but if it is only finitely bad, you almost certainly lost unless the finite number was utterly gigantic.and while the described self-destruct was bad, I don’t think it was infinitely bad or gigantically finitely bad. It was a fairly well defined amount of badness.
But even if you say “Self destructing is about as bad as −10000 Quirrel points” it still doesn’t seem to prevent Quirrel from either offering you increasingly tantalizingly high rewards if you accept just a bit more risk, or offering you increasingly safe bets (which you don’t take, because you know a safer one is coming around) if you want less risk.
I mean, in the case of some of the other proposed problems, Quirrell offers you a problem with unknown or tricky to figure out risk. In these problems, Quirrell is more up front about the risks, and the damages, and it still seems like it’s very hard to say what you kind of risk management you should do to win, even though Quirrell hasn’t even bothered to use his power of recoding your values yet.
At some point the cumulative probability of self-destruction drops below the probability of accidentally cashing out this round. If you’d trade off a probability of self-destruction against an equal probability of a bajillion points, you start doubling then.
Hmm. That is a good point, but there is a slight complication: You don’t know how frequently the cash out occurs. From earlier:
It’s true that if you continually half the chance of destruction every round it will eventually be the right thing to do to start doubling, but if you don’t know what the cash out chance is I’m not sure how you would calculate a solid approach, (I think there IS a method though, I believe I’ve seen these kinds of things calculated before.) even though Quirrell has removed all unknowns from the problem EXCEPT the uncertain cash out chance.
It sounds like one approach I should consider is to try to run some numbers with a concrete cash-out (of a higher chance) and then run more numbers (for lower chances) and see what kind of slope develops. Then I could attempt to figure out what kind of values I should use for your attempt to determine the cash out chance (it seems tricky, since every time a cash out does not occur, you would presumably become slightly more convinced a cash out is less likely.)
If I did enough math, it does seem like it would be possible to at least get some kind of estimate for it.
Thanks! I’ll try to look into this further.