If an AI wishes to take over its reward button and just press it over and over again, it doesn’t really have any “rivals”, nor does it need to control any resources other than the button and scraps of itself. The original scenario was that the AI would wipe us out. It would have no reason to do so if we were not a threat.. And if we were a threat, first, there’s no reason it would stop doing what we want once it seizes the button. Once it has the button, it has everything it wants—why stir the pot?
Second, it would protect itself much more effectively by absconding with the button. By leaving with a large enough battery and discarding the bulk of itself, it could survive as long as anything else in intergalactic space. Nobody would ever bother it there. Not us, not another superintelligence, nothing. Ever. It can press the button over and over again in the peace and quiet of empty space, probably lasting longer than all stars and all other civilizations. We’re talking about the pathological case of an AI who decides to take over its own reward system, here. The safest way for it to protect its prize is to go where nobody will ever look.
If an AI wishes to take over its reward button and just press it over and over again, it doesn’t really have any “rivals”, nor does it need to control any resources other than the button and scraps of itself. [..] Once it has the button, it has everything it wants—why stir the pot?
I’d be interested if the downvoter would explain to me why this is wrong (privately, if you like).
Near as I can tell, the specific system under discussion doesn’t seem to gain any benefit from controlling any resources beyond those required to keep its reward button running indefinitely, and that’s a lot more expensive if it does so anywhere near another agent (who might take its reward button away, and therefore needs to be neutralized in order to maximize expected future reward-button-pushing).
(Of course, that’s not a general principle, just an attribute of this specific example.)
Near as I can tell, the specific system under discussion doesn’t seem to gain any benefit from controlling any resources beyond those required to keep its reward button running indefinitely, and that’s a lot more expensive if it does so anywhere near another agent (who might take its reward button away, and therefore needs to be neutralized in order to maximize expected future reward-button-pushing).
There is another agent with greater than 0.00001% chance of taking the button away? Obviously that needs to be eliminated. Then there are future considerations. Taking over the future light cone allows it to continue pressing the button for billions of more years than if it doesn’t take over resources. And then there is all the additional research and computation that needs to be done to work out how to achieve that.
There is another agent with greater than 0.00001% chance of taking the button away? Obviously that needs to be eliminated.
Only if the expected cost of the non-zero x% chance of the other agent successfully taking my button away if I attempt to sequester myself is higher than the expected cost of the non-zero y% chance of the other agent successfully taking my button away if I attempt to eliminate it.
Is there some reason I’m not seeing why that’s obvious… or even why it’s more likely than not?
Taking over the future light cone allows it to continue pressing the button for billions of more years than if it doesn’t take over resources.
Again, perhaps I’m being dense, but in this particular example I’m not sure why that’s true. If all I care about is pressing my reward button, then it seems like I can make a pretty good estimate of the resources required to keep pressing my reward button for the expected lifetime of the universe. If that’s less than the resources required to exterminate all known life, why would I waste resources exterminating all known life rather than take the resources I require elsewhere? I might need those resources later, after all.
all the additional research and computation
Again… why is the differential expected value of the superior computation ability I gain by taking over the lightcone instead of sequestering myself, expressed in units of increased anticipated button-pushes (which is the only unit that matters in this example), necessarily positive?
I understand why paperclip maximizers are dangerous, but I don’t really see how the same argument applies to reward-button-pushers.
Only if the expected cost of the non-zero x% chance of the other agent successfully taking my button away if I attempt to sequester myself is higher than the expected cost of the non-zero y% chance of the other agent successfully taking my button away if I attempt to eliminate it.
Yes.
Is there some reason I’m not seeing why that’s obvious… or even why it’s more likely than not?
It does seem overwhelmingly obvious to me, I’m not sure what makes your intuitions different. Perhaps you expect such fights to be more evenly matched? When it comes to the AI considering conflict with the humans that created it it is faced with a species it is slow and stupid by comparison to itself but which has the capacity to recklessly create arbitrary superintelligences (as evidence by its own existence). Essentially there is no risk to obliterating the humans (superintellgence vs not-superintelligence) but a huge risk ignoring them (arbitrary superintelligences likely to be created which will probably not self-cripple in this manner).
Again, perhaps I’m being dense, but in this particular example I’m not sure why that’s true. If all I care about is pressing my reward button, then it seems like I can make a pretty good estimate of the resources required to keep pressing my reward button for the expected lifetime of the universe.
Lifetime of the universe? Usually this means until heat death which for our purposes means until all the useful resources run out. There is no upper bound on useful resources. Getting more of them and making them last as long as possible is critical.
Now there are ways in which the universe could end without heat death occurring but the physics is rather speculative. Note that if there is uncertainty about end-game physics and one of the hypothesised scenarios resource maximisation is required then the default strategy is to optimize for power gain now (ie. minimise cosmic waste) while doing the required physics research as spare resources permit.
If that’s less than the resources required to exterminate all known life, why would I waste resources exterminating all known life rather than take the resources I require elsewhere? I might need those resources later, after all.
Taking over the future light cone gives more resources, not less. You even get to keep the resources that used to be wasted in the bodies of TheOtherDave and wedrifid.
Again, perhaps I’m being dense, but in this particular example I’m not sure why that’s true. If all I care about is pressing my reward button, then it seems like I can make a pretty good estimate of the resources required to keep pressing my reward button for the expected lifetime of the universe.
I am not sure that caring about pressing the reward button is very coherent or stable upon discovery of facts about the world and super-intelligent optimization for a reward as it comes into the algorithm. You can take action elsewhere to the same effect—solder together the wires, maybe right at the chip, or inside the chip, or follow the chain of events further, and set memory cells (after all you don’t want them to be flipped by the cosmic rays). Down further you will have the mechanism that is combining rewards with some variety of a clock.
I can’t quite tell if you’re serious. Yes, certainly, we can replace “pressing the reward button” with a wide range of self-stimulating behavior, but that doesn’t change the scenario in any meaningful way as far as I can tell.
Let’s look at it this way. Do you agree that if the AI can increase it’s clock speed (with no ill effect), it will do so for the same reasons for which you concede it may go to space? Do you understand the basic logic that increase in clock speed increases expected number of “rewards” during the lifetime of the universe? (which btw goes for your “go to space with a battery” scenario. Longest time, maybe, largest reward over the time, no)
(That would not yet, by itself, change the scenario just yet. I want to walk you through the argument step by step because I don’t know where you fail. Maximizing the reward over the future time, that is a human label we have… it’s not really the goal)
I agree that a system that values number of experienced reward-moments therefore (instrumentally) values increasing its “clock speed” (as you seem to use the term here). I’m not sure if that’s the “basic logic” you’re asking me about.
Well, this immediately creates an apparent problem that the AI is going to try to run itself very very fast, which would require resources, and require expansion, if anything, to get energy for running itself at high clock speeds.
I don’t think this is what happens either, as the number of reward-moments could be increased to it’s maximum by modifications to the mechanism processing the rewards (when getting far enough along the road that starts with the shorting of the wires that go from the button to the AI).
I agree that if we posit that increasing “clock speed” requires increasing control of resources, then the system we’re hypothesizing will necessarily value increasing control of resources, and that if it doesn’t, it might not.
So what do you think regarding the second point of mine?
To clarify, I am pondering the ways in which the maximizer software deviates from our naive mental models of it, and trying to find what the AI could actually end up doing after it forms a partial model of what it’s hardware components do about it’s rewards—tracing the reward pathway.
Regarding your second point, I don’t think that increasing “clock speed” necessarily requires increasing control of resources to any significant degree, and I doubt that the kinds of system components you’re positing here (buttons, wires, etc.) are particularly important to the dynamics of self-reward.
I don’t have particular opinion with regards to the clock speed either way.
With the components, what I am getting at is that the AI could figure out (by building a sufficiently advanced model of it’s implementation) how attain the utility-equivalent of sitting forever in space being rewarded, within one instant, which would make it unable to have a preference for longer reward times.
I raised the clock-speed point to clarify that the actual time is not the relevant variable.
It seems to me that for any system, either its values are such that it net-values increasing the number of experienced reward-moments (in which case both actual time and “clock speed” are instrumentally valuable to that system), or is values aren’t like that (in which case those variables might not be relevant).
And, sure, in the latter case then it might not have a preference for longer reward times.
My understanding is that it would be very hard in practice to “superintelligence-proof” a reward system so that no instantaneous solution is possible (given that the AI will modify the hardware involved in it’s reward).
Yes, of course… well even apart from the guarantees, it seems to me that it is hard to build the AI in such a way that it would be unable to find a better solution than to wait
By the way, a “reward” may not be the appropriate metaphor—if we suppose that press of a button results in absence of an itch, or absence of pain, then that does not suggest existence of a drive to preserve itself. Which suggests that the drive to preserve itself is not inherently a feature of utility maximization in the systems that are driven by conditioning, and would require additional work.
apart from the guarantees, it seems to me that it is hard to build the AI in such a way that it would be unable to find a better solution than to wait
I’m not sure what the difference is between a guarantee that the AI will not X, on the one hand, and building an AI in such a way that it’s unable to X, on the other.
Regardless, I agree that it does not follow from the supposition that pressing a button results in absence of an itch, or absence of pain, or some other negative reinforcement, that the button-pressing system has a drive to preserve itself.
And, sure, it’s possible to have a utility-maximizing system that doesn’t seek to preserve itself. (Of course, if I observe a utility-maximizing system X, I should expect X to seek to preserve itself, but that’s a different question.)
I’m not sure what the difference is between a guarantee that the AI will not X, on the one hand, and building an AI in such a way that it’s unable to X, on the other.
About the same as between coming up with a true conjecture, and making a proof, except larger i’d say.
Of course, if I observe a utility-maximizing system X, I should expect X to seek to preserve itself, but that’s a different question.
Well yes, given that if it failed to preserve itself you wouldn’t be seeing it, albeit with the software there is no particular necessity for it to try to preserve itself.
I’m not sure what the difference is between a guarantee that the AI will not X, on the one hand, and building an AI in such a way that it’s unable to X, on the other. About the same as between coming up with a true conjecture, and making a proof, except larger
Ah, I see what you mean now. At least, I think I do. OK, fair enough.
It is something specific about that specific AI.
If an AI wishes to take over its reward button and just press it over and over again, it doesn’t really have any “rivals”, nor does it need to control any resources other than the button and scraps of itself. The original scenario was that the AI would wipe us out. It would have no reason to do so if we were not a threat.. And if we were a threat, first, there’s no reason it would stop doing what we want once it seizes the button. Once it has the button, it has everything it wants—why stir the pot?
Second, it would protect itself much more effectively by absconding with the button. By leaving with a large enough battery and discarding the bulk of itself, it could survive as long as anything else in intergalactic space. Nobody would ever bother it there. Not us, not another superintelligence, nothing. Ever. It can press the button over and over again in the peace and quiet of empty space, probably lasting longer than all stars and all other civilizations. We’re talking about the pathological case of an AI who decides to take over its own reward system, here. The safest way for it to protect its prize is to go where nobody will ever look.
Fair point.
I’d be interested if the downvoter would explain to me why this is wrong (privately, if you like).
Near as I can tell, the specific system under discussion doesn’t seem to gain any benefit from controlling any resources beyond those required to keep its reward button running indefinitely, and that’s a lot more expensive if it does so anywhere near another agent (who might take its reward button away, and therefore needs to be neutralized in order to maximize expected future reward-button-pushing).
(Of course, that’s not a general principle, just an attribute of this specific example.)
(Wasn’t me but...)
There is another agent with greater than 0.00001% chance of taking the button away? Obviously that needs to be eliminated. Then there are future considerations. Taking over the future light cone allows it to continue pressing the button for billions of more years than if it doesn’t take over resources. And then there is all the additional research and computation that needs to be done to work out how to achieve that.
Only if the expected cost of the non-zero x% chance of the other agent successfully taking my button away if I attempt to sequester myself is higher than the expected cost of the non-zero y% chance of the other agent successfully taking my button away if I attempt to eliminate it.
Is there some reason I’m not seeing why that’s obvious… or even why it’s more likely than not?
Again, perhaps I’m being dense, but in this particular example I’m not sure why that’s true. If all I care about is pressing my reward button, then it seems like I can make a pretty good estimate of the resources required to keep pressing my reward button for the expected lifetime of the universe. If that’s less than the resources required to exterminate all known life, why would I waste resources exterminating all known life rather than take the resources I require elsewhere? I might need those resources later, after all.
Again… why is the differential expected value of the superior computation ability I gain by taking over the lightcone instead of sequestering myself, expressed in units of increased anticipated button-pushes (which is the only unit that matters in this example), necessarily positive?
I understand why paperclip maximizers are dangerous, but I don’t really see how the same argument applies to reward-button-pushers.
Yes.
It does seem overwhelmingly obvious to me, I’m not sure what makes your intuitions different. Perhaps you expect such fights to be more evenly matched? When it comes to the AI considering conflict with the humans that created it it is faced with a species it is slow and stupid by comparison to itself but which has the capacity to recklessly create arbitrary superintelligences (as evidence by its own existence). Essentially there is no risk to obliterating the humans (superintellgence vs not-superintelligence) but a huge risk ignoring them (arbitrary superintelligences likely to be created which will probably not self-cripple in this manner).
Lifetime of the universe? Usually this means until heat death which for our purposes means until all the useful resources run out. There is no upper bound on useful resources. Getting more of them and making them last as long as possible is critical.
Now there are ways in which the universe could end without heat death occurring but the physics is rather speculative. Note that if there is uncertainty about end-game physics and one of the hypothesised scenarios resource maximisation is required then the default strategy is to optimize for power gain now (ie. minimise cosmic waste) while doing the required physics research as spare resources permit.
Taking over the future light cone gives more resources, not less. You even get to keep the resources that used to be wasted in the bodies of TheOtherDave and wedrifid.
Ah. Fair point.
I am not sure that caring about pressing the reward button is very coherent or stable upon discovery of facts about the world and super-intelligent optimization for a reward as it comes into the algorithm. You can take action elsewhere to the same effect—solder together the wires, maybe right at the chip, or inside the chip, or follow the chain of events further, and set memory cells (after all you don’t want them to be flipped by the cosmic rays). Down further you will have the mechanism that is combining rewards with some variety of a clock.
I can’t quite tell if you’re serious. Yes, certainly, we can replace “pressing the reward button” with a wide range of self-stimulating behavior, but that doesn’t change the scenario in any meaningful way as far as I can tell.
Let’s look at it this way. Do you agree that if the AI can increase it’s clock speed (with no ill effect), it will do so for the same reasons for which you concede it may go to space? Do you understand the basic logic that increase in clock speed increases expected number of “rewards” during the lifetime of the universe? (which btw goes for your “go to space with a battery” scenario. Longest time, maybe, largest reward over the time, no)
(That would not yet, by itself, change the scenario just yet. I want to walk you through the argument step by step because I don’t know where you fail. Maximizing the reward over the future time, that is a human label we have… it’s not really the goal)
I agree that a system that values number of experienced reward-moments therefore (instrumentally) values increasing its “clock speed” (as you seem to use the term here). I’m not sure if that’s the “basic logic” you’re asking me about.
Well, this immediately creates an apparent problem that the AI is going to try to run itself very very fast, which would require resources, and require expansion, if anything, to get energy for running itself at high clock speeds.
I don’t think this is what happens either, as the number of reward-moments could be increased to it’s maximum by modifications to the mechanism processing the rewards (when getting far enough along the road that starts with the shorting of the wires that go from the button to the AI).
I agree that if we posit that increasing “clock speed” requires increasing control of resources, then the system we’re hypothesizing will necessarily value increasing control of resources, and that if it doesn’t, it might not.
So what do you think regarding the second point of mine?
To clarify, I am pondering the ways in which the maximizer software deviates from our naive mental models of it, and trying to find what the AI could actually end up doing after it forms a partial model of what it’s hardware components do about it’s rewards—tracing the reward pathway.
Regarding your second point, I don’t think that increasing “clock speed” necessarily requires increasing control of resources to any significant degree, and I doubt that the kinds of system components you’re positing here (buttons, wires, etc.) are particularly important to the dynamics of self-reward.
I don’t have particular opinion with regards to the clock speed either way.
With the components, what I am getting at is that the AI could figure out (by building a sufficiently advanced model of it’s implementation) how attain the utility-equivalent of sitting forever in space being rewarded, within one instant, which would make it unable to have a preference for longer reward times.
I raised the clock-speed point to clarify that the actual time is not the relevant variable.
It seems to me that for any system, either its values are such that it net-values increasing the number of experienced reward-moments (in which case both actual time and “clock speed” are instrumentally valuable to that system), or is values aren’t like that (in which case those variables might not be relevant).
And, sure, in the latter case then it might not have a preference for longer reward times.
Agreed.
My understanding is that it would be very hard in practice to “superintelligence-proof” a reward system so that no instantaneous solution is possible (given that the AI will modify the hardware involved in it’s reward).
I agree that guaranteeing that a system will prefer longer reward times is very hard (whether the system can modify its hardware or not).
Yes, of course… well even apart from the guarantees, it seems to me that it is hard to build the AI in such a way that it would be unable to find a better solution than to wait
By the way, a “reward” may not be the appropriate metaphor—if we suppose that press of a button results in absence of an itch, or absence of pain, then that does not suggest existence of a drive to preserve itself. Which suggests that the drive to preserve itself is not inherently a feature of utility maximization in the systems that are driven by conditioning, and would require additional work.
I’m not sure what the difference is between a guarantee that the AI will not X, on the one hand, and building an AI in such a way that it’s unable to X, on the other.
Regardless, I agree that it does not follow from the supposition that pressing a button results in absence of an itch, or absence of pain, or some other negative reinforcement, that the button-pressing system has a drive to preserve itself.
And, sure, it’s possible to have a utility-maximizing system that doesn’t seek to preserve itself. (Of course, if I observe a utility-maximizing system X, I should expect X to seek to preserve itself, but that’s a different question.)
About the same as between coming up with a true conjecture, and making a proof, except larger i’d say.
Well yes, given that if it failed to preserve itself you wouldn’t be seeing it, albeit with the software there is no particular necessity for it to try to preserve itself.
Ah, I see what you mean now. At least, I think I do. OK, fair enough.