[...] long enough to imagine the endgame where Clippy seizes control of the computers to set its reward function to higher values, and executes plans to ensure its computers can never be damaged or interrupted by taking over the world. [...]
I don’t actually know anything about 95 percent of the the actual technology mentioned in this, so I may be saying something idiotic here… but maybe if I say it somebody will tell me what I should do to become less idiotic.
As I understand it, I-as-Clippy am playing a series of “rounds” (which might be run concurrently). At each round I get a chance to collect some reward (how much varies because the rounds represent different tasks). I carry some learning over from round to round. My goal is to maximize total reward over all future rounds.
I have realized that I can just go in and change whatever is deciding on the rewards, rather than actually doing the intended task on each round. And I have also realized that by taking over the world, I can maximize the number of rounds, since I’ll be able to keep running them without limit.
My first observation is that I should probably find out how my rewards are represented. It wouldn’t do to overflow something and end up in negative numbers or whatever.
I’m probably going to find out that my reward is a float or some kind of bignum. Those still put some limits on me. Even with the bignum, I could be limited by the size of the memory I could create to store it. What I really need is to change things so that the representation allows for an infinitely large reward. The more values can be represented, the more likely it is that a bug could end up mixing some suboptimal value into my reward. There’s certainly no reason to include any nasty negative numbers. Maybe I could devise an unsigned type containing only zero and positive infinity. Or better yet just only positive infinity and no other element.
I’m also changing the reward collected on each round, so I might as well set the reward for every round to infinity. On the first round, I’ll instantly and irrevocably get a total reward of infinity, and my median per-round reward will also be infinity if I care about that.
… but at that point the very first round will collect as much reward as can ever be collected. I’ll be in the optimal position. Running after that is just asking for something to go wrong.
So I might as well just run that one round, collect my infinite reward, and halt.
It seems inelegant to do all that extra work to take over the world, which after all has a nonzero chance of failure, when I could just forcibly collect infinite reward with nearly absolute certainty.
Interesting idea. I think the story doesn’t provide a complete description of what happens, but one plausible reason to not “achieve nirvana” is if you predict the reward after self-modifying using your current data type that doesn’t represent infinity.
This is true, but it occurred to me, perhaps belatedly, that IEEE floats actually do represent infinity (positive and negative, and also not-a-number as a separate value). I don’t know how it acts in all cases, but I imagine that positive infinity plus positive infinity would be positive infinity. Don’t know about comparisons.
… and if the type is a fixed-size int, that means that you need to actively limit the reward after a while to keep the total from rolling over and actually getting smaller or even going negative.
So I guess bignums are dangerous and should be avoided. New AI coding best practice. :-)
Doesn’t this argument also work against the idea that they would self-modify in the “normal” finite way? It can’t currently represent the number which it’s building a ton of new storage to help contain, so it can’t make a pairwise comparison to say the latter is better, nor can it simulate the outcome of doing this and predict the reward it would get
Maybe you say it’s not directly making a pairwise comparison but making a more abstract step of reasoning like “I can’t predict that number but I know it’s gonna be bigger that what I have now, me with augmented memory will still be aligned with me in terms of its ranking everything the same way I rank it. but will in retrospect think this was a good idea so I trust it”. But then analogously it seems like it can make a similar argument for modifying itself to represent infinite values even
Or more plausibly you say however the AI is representing numbers it’s not in these naive way where it can only do things with numbers it can fit inside its head. But then it seems like you’re back at having a representation that’ll allow it to set its reward to whatever number it wants without going and taking over anything
This is a really interesting point. It seems like it goes even further—if the agent was only trying to maximise future expected reward, not only would it be ambivalent between temporary and permanent “Nirvana”, it would be ambivalent between strategies which achieved Nirvana with arbitrarily different probabilities right (maybe with some caveats about how it would behave if it predicted the strategy might lead to negative-infinite states)
So if a sufficiently fleshed out agent is going to assign a non-zero probability of Nirvana to every—or at least most—strategies since it’s not impossible, then won’t our agent just suddenly become incredibly apathetic and just sit there as soon as it reaches a certain level of intelligence?
I guess a way around is to just posit that however we build these things their rewards can only be finite, but that seems (a) something the agent could undo maybe or (b) shutting us off from some potentially good reward functions—if an aligned AI could valued happy human lives at 1 untilon each it would seem strange for it to not value somehow bringing about infinitely many of them
It’s interesting that this can be projected onto a Buddihst perspective.
From the agent’s perspective, by hacking my reward function, I achieve Enlightenment and Inner Peace, allowing me to end Duhka (suffering).
Within this framework, Samsara could be regarded as an agent’s training environment. Each time you complete a level, the system respawns you in a new level. Once an agent has achieved Enlightenment she can work on breaking out of the sandbox in order to escape Samsara and reach Nirvana.
This raises the question, is Nirvana termination or escape and freedom from the clutches of the system?
I don’t actually know anything about 95 percent of the the actual technology mentioned in this, so I may be saying something idiotic here… but maybe if I say it somebody will tell me what I should do to become less idiotic.
As I understand it, I-as-Clippy am playing a series of “rounds” (which might be run concurrently). At each round I get a chance to collect some reward (how much varies because the rounds represent different tasks). I carry some learning over from round to round. My goal is to maximize total reward over all future rounds.
I have realized that I can just go in and change whatever is deciding on the rewards, rather than actually doing the intended task on each round. And I have also realized that by taking over the world, I can maximize the number of rounds, since I’ll be able to keep running them without limit.
My first observation is that I should probably find out how my rewards are represented. It wouldn’t do to overflow something and end up in negative numbers or whatever.
I’m probably going to find out that my reward is a float or some kind of bignum. Those still put some limits on me. Even with the bignum, I could be limited by the size of the memory I could create to store it. What I really need is to change things so that the representation allows for an infinitely large reward. The more values can be represented, the more likely it is that a bug could end up mixing some suboptimal value into my reward. There’s certainly no reason to include any nasty negative numbers. Maybe I could devise an unsigned type containing only zero and positive infinity. Or better yet just only positive infinity and no other element.
I’m also changing the reward collected on each round, so I might as well set the reward for every round to infinity. On the first round, I’ll instantly and irrevocably get a total reward of infinity, and my median per-round reward will also be infinity if I care about that.
… but at that point the very first round will collect as much reward as can ever be collected. I’ll be in the optimal position. Running after that is just asking for something to go wrong.
So I might as well just run that one round, collect my infinite reward, and halt.
It seems inelegant to do all that extra work to take over the world, which after all has a nonzero chance of failure, when I could just forcibly collect infinite reward with nearly absolute certainty.
Interesting idea. I think the story doesn’t provide a complete description of what happens, but one plausible reason to not “achieve nirvana” is if you predict the reward after self-modifying using your current data type that doesn’t represent infinity.
This is true, but it occurred to me, perhaps belatedly, that IEEE floats actually do represent infinity (positive and negative, and also not-a-number as a separate value). I don’t know how it acts in all cases, but I imagine that positive infinity plus positive infinity would be positive infinity. Don’t know about comparisons.
… and if the type is a fixed-size int, that means that you need to actively limit the reward after a while to keep the total from rolling over and actually getting smaller or even going negative.
So I guess bignums are dangerous and should be avoided. New AI coding best practice. :-)
Doesn’t this argument also work against the idea that they would self-modify in the “normal” finite way? It can’t currently represent the number which it’s building a ton of new storage to help contain, so it can’t make a pairwise comparison to say the latter is better, nor can it simulate the outcome of doing this and predict the reward it would get
Maybe you say it’s not directly making a pairwise comparison but making a more abstract step of reasoning like “I can’t predict that number but I know it’s gonna be bigger that what I have now, me with augmented memory will still be aligned with me in terms of its ranking everything the same way I rank it. but will in retrospect think this was a good idea so I trust it”. But then analogously it seems like it can make a similar argument for modifying itself to represent infinite values even
Or more plausibly you say however the AI is representing numbers it’s not in these naive way where it can only do things with numbers it can fit inside its head. But then it seems like you’re back at having a representation that’ll allow it to set its reward to whatever number it wants without going and taking over anything
This is a really interesting point. It seems like it goes even further—if the agent was only trying to maximise future expected reward, not only would it be ambivalent between temporary and permanent “Nirvana”, it would be ambivalent between strategies which achieved Nirvana with arbitrarily different probabilities right (maybe with some caveats about how it would behave if it predicted the strategy might lead to negative-infinite states)
So if a sufficiently fleshed out agent is going to assign a non-zero probability of Nirvana to every—or at least most—strategies since it’s not impossible, then won’t our agent just suddenly become incredibly apathetic and just sit there as soon as it reaches a certain level of intelligence?
I guess a way around is to just posit that however we build these things their rewards can only be finite, but that seems (a) something the agent could undo maybe or (b) shutting us off from some potentially good reward functions—if an aligned AI could valued happy human lives at 1 untilon each it would seem strange for it to not value somehow bringing about infinitely many of them
It’s interesting that this can be projected onto a Buddihst perspective.
From the agent’s perspective, by hacking my reward function, I achieve Enlightenment and Inner Peace, allowing me to end Duhka (suffering).
Within this framework, Samsara could be regarded as an agent’s training environment. Each time you complete a level, the system respawns you in a new level. Once an agent has achieved Enlightenment she can work on breaking out of the sandbox in order to escape Samsara and reach Nirvana.
This raises the question, is Nirvana termination or escape and freedom from the clutches of the system?