I’m slightly confused by this one. If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be no human utility (i.e. paperclips). But most attempts at an aligned AI would have a minimum at “I have no mouth, and I must scream”. So any sign-flipping error would be expected to land there.
It’s hard to talk in specifics because my knowledge on the details of what future AGI architecture might look like is, of course, extremely limited.
As an almost entirely inapplicable analogy (which nonetheless still conveys my thinking here): consider the sorting algorithm for the comments on this post. If we flipped the “top-scoring” sorting algorithm to sort in the wrong direction, we would see the worst-rated posts on top, which would correspond to a hyperexistential disaster. However, if we instead flipped the effect that an upvote had on the score of a comment to negative values, it would sort comments which had no votes other than the default vote assigned on posting the comment to the top. This corresponds to paperclipping- it’s not minimizing the intended function, it’s just doing something weird.
If we inverted the utility function, this would (unless we take specific measures to combat it like you’re mentioning) lead to hyperexistential disaster. However, if we invert some constant which is meant to initially provide value for exploring new strategies while the AI is not yet intelligent enough to properly explore new strategies as an instrumental goal, the AI would effectively brick itself. It would place negative value on exploring new strategies, presumably including strategies which involve fixing this issue so it can acquire more utility and strategies which involve preventing the humans from turning it off. If we had some code which is intended to make the AI not turn off the evolution of the reward model before the AI values not turning off the reward model for other reasons (e.g. the reward model begins to properly model how humans don’t want the AI to turn the reward model evolution process off), and some crucial sign was flipped which made it do the opposite, the AI would freeze the process of the reward model being updated and then maximize whatever inane nonsense its model currently represented, and it would eventually run into some bizarre previously unconsidered and thus not appropriately penalized strategy comparable to tiling the universe with smiley faces, i.e. paperclipping.
These are really crude examples, but I think the argument is still valid. Also, this argument doesn’t address the core concern of “What about the things which DO result in hypexistential disaster”, it just establishes that much of the class of mistake you may have previously thought usually or always resulted in hyperexistential disaster (sign flips on critical software points) in fact usually causes paperclipping or the AI bricking itself.
If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be no human utility (i.e. paperclips).
Can you clarify what you mean by this? Also, I get what you’re going for, but paperclips is still extremely negative utility because it involves the destruction of humanity and the reconfiguration of the universe into garbage.
Perhaps there’ll be a reward function/model intentionally designed to disvalue some arbitrary “surrogate” thing in an attempt to separate it from hyperexistential risk. So “pessimizing the target metric” would look more like paperclipping than torture. But I’m unsure as to (1) whether the AGI’s developers would actually bother to implement it, and (2) whether it’d actually work in this sort of scenario.
I sure hope that future AGI developers can be bothered to embrace safe design!
Also worth noting is that an AGI based on reward modelling is going to have to be linked to another neural network, which is going to have constant input from humans. If that reward model isn’t designed to be separated in design space from AM, someone could screw up with the model somehow.
The reward modelling system would need to be very carefully engineered, definitely.
If we were to, say, have U = V + W (where V is the reward given by the reward model and W is some arbitrary thing that the AGI disvalues, as is the case in Eliezer’s Arbital post that I linked,) a sign flip-type error in V (rather than a sign flip in U) would lead to a hyperexistential catastrophe.
I thought this as well when I read the post. I’m sure there’s something clever you can do to avoid this but we also need to make sure that these sorts of critical components are not vulnerable to memory corruption. I may try to find a better strategy for this later, but for now I need to go do other things.
I think this is somewhat likely to be the case, but I’m not sure that I’m confident enough about it. Flipping the direction of updates to the reward model seems harder to prevent than a bit flip in a utility function, which could be prevent through error-correcting code memory (as you mentioned earlier.)
Sorry, I meant to convey that this was a feature we’re going to want to ensure that future AGI efforts display, not some feature which I have some other independent reason to believe would be displayed. It was an extension of the thought that “Our method will, ideally, be terrorist proof.”
As an almost entirely inapplicable analogy . . . it’s just doing something weird.
If we inverted the utility function . . . tiling the universe with smiley faces, i.e. paperclipping.
Interesting analogy. I can see what you’re saying, and I guess it depends on what specifically gets flipped. I’m unsure about the second example; something like exploring new strategies doesn’t seem like something an AGI would terminally value. It’s instrumental to optimising the reward function/model, but I can’t see it getting flipped *with* the reward function/model.
Can you clarify what you mean by this? Also, I get what you’re going for, but paperclips is still extremely negative utility because it involves the destruction of humanity and the reconfiguration of the universe into garbage.
My thinking was that a signflipped AGI designed as a positive utilitarian (i.e. with a minimum at 0 human utility) would prefer paperclipping to torture because the former provides 0 human utility (as there aren’t any humans), whereas the latter may produce a negligible amount. I’m not really sure if it makes sense tbh.
The reward modelling system would need to be very carefully engineered, definitely.
Even if we engineered it carefully, that doesn’t rule out screw-ups. We need robust failsafe measures *just in case*, imo.
I thought of this as well when I read the post. I’m sure there’s something clever you can do to avoid this but we also need to make sure that these sorts of critical components are not vulnerable to memory corruption. I may try to find a better strategy for this later, but for now I need to go do other things.
I wonder if you could feasibly make it a part of the reward model. Perhaps you could train the reward model itself to disvalue something arbitrary (like paperclips) even more than torture, which would hopefully mitigate it. You’d still need to balance it in a way such that the system won’t spend all of its resources preventing this thing from happening at the neglect of actual human values, but that doesn’t seem too difficult. Although, once again, we can’t really have high confidence (>90%) that the AGI developers are going to think to implement something like this.
There was also an interesting idea I found in a Facebook post about this type of thing that got linked somewhere (can’t remember where). Stuart Armstrong suggested that a utility function could be designed as such:
Let B1 and B2 be excellent, bestest outcomes. Define U(B1)=1, U(B2)=-1, and U=0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes. Or, more usefully, let X be some trivial feature that the agent can easily set to −1 or 1, and let U be a utility function with values in [0,1]. Have the AI maximisise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.
Even if we solve any issues with these (and actually bother to implement them), there’s still the risk of an error like this happening in a localised part of the reward function such that *only* the part specifying something bad gets flipped, although I’m a little confused about this one. It could very well be the case that the system’s complex enough that there isn’t just one bit indicating whether “pain” or “suffering” is good or bad. And we’d presumably (hopefully) have checksums and whatever else thrown in. Maybe this could be mitigated by assigning more positive utility to good outcomes than negative utility to bad outcomes? (I’m probably speaking out of my rear end on this one.)
Memory corruption seems to be another issue. Perhaps if we have more than one measure we’d be less vulnerable to memory corruption. Like, if we designed an AGI with a reward model that disvalues two arbitrary things rather than just one, and memory corruption screwed with *both* measures, then something probably just went *very* wrong in the AGI and it probably won’t be able to optimise for suffering anyway.
Interesting analogy. I can see what you’re saying, and I guess it depends on what specifically gets flipped. I’m unsure about the second example; something like exploring new strategies doesn’t seem like something an AGI would terminally value. It’s instrumental to optimising the reward function/model, but I can’t see it getting flipped with the reward function/model.
Sorry, I meant instrumentally value. Typo. Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I’m highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.
My thinking was that a signflipped AGI designed as a positive utilitarian (i.e. with a minimum at 0 human utility) would prefer paperclipping to torture because the former provides 0 human utility (as there aren’t any humans), whereas the latter may produce a negligible amount. I’m not really sure if it makes sense tbh.
Paperclipping seems to be negative utility, not approximately 0 utility. It involves all the humans being killed and our beautiful universe being ruined. I guess if there are no humans, there’s no utility in some sense, but human values don’t actually seem to work that way. I rate universes where humans never existed at all and
I’m… not sure what 0 utility would look like. It’s within the range of experiences that people experience on modern-day earth- somewhere between my current experience and being tortured. This is just definition problems, though- We could shift the scale such that paperclipping is zero utility, but in that case, we could also just make an AGI that has a minimum at paperclipping levels of utility.
Even if we engineered it carefully, that doesn’t rule out screw-ups. We need robust failsafe measures just in case, imo.
In the context of AI safety, I think “robust failsafe measures just in case” is part of “careful engineering”. So, we agree!
You’d still need to balance it in a way such that the system won’t spend all of its resources preventing this thing from happening at the neglect of actual human values, but that doesn’t seem too difficult.
I read Eliezer’s idea, and that strategy seems to be… dangerous. I think that “Giving an AGI a utility function which includes features which are not really relevant to human values” is something we want to avoid unless we absolutely need to.
I have much more to say on this topic and about the rest of your comment, but it’s definitely too much for a comment chain. I’ll make an actual post on this containing my thoughts sometime in the next week or two, and link it to you.
Paperclipping seems to be negative utility, not approximately 0 utility.
My thinking was that an AI system that *only* takes values between 0 and + ∞ (or some arbitrary positive number) would identify that killing humans would result in 0 human value, which is its minimum utility.
I read Eliezer’s idea, and that strategy seems to be… dangerous. I think that “Giving an AGI a utility function which includes features which are not really relevant to human values” is something we want to avoid unless we absolutely need to.
How come? It doesn’t seem *too* hard to create an AI that only expends a small amount of its energy on preventing the garbage thing from happening.
I have much more to say on this topic and about the rest of your comment, but it’s definitely too much for a comment chain. I’ll make an actual post containing my thoughts sometime in the next week or two, and link it to you.
Please do! I’d love to see a longer discussion on this type of thing.
EDIT: just thought some more about this and want to clear something up:
Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I’m highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.
I’m a little unsure on this one after further reflection. When this happened with GPT-2, the bug managed to flip the reward & the system still pursued instrumental goals like exploring new strategies:
Bugs can optimize for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. A mechanism such as Toyota’s Andon cord could have prevented this, by allowing any labeler to stop a problematic training process.
So it definitely seems *plausible* for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.
So it definitely seems plausible for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.
I didn’t mean to imply that a signflipped AGI would not instrumentally explore.
I’m saying that, well… modern machine learning systems often get specific bonus utility for exploring, because it’s hard to explore the proper amount as an instrumental goal due to the difficulties of fully modelling the situation, and because systems which don’t have this bonus will often get stuck in local maximums.
Humans exhibit this property too. We have investigating things, acquiring new information, and building useful strategic models as a terminal goal- we are “curious”.
This is a feature we might see in early stages of modern attempts at full AGI, for similar reasons to why modern machine learning systems and humans exhibit this same behavior.
Presumably such features would be built to uninstall themselves after the AGI reaches levels of intelligence sufficient to properly and fully explore new strategies as an instrumental goal to satisfying the human utility function, if we do go this route.
If we sign flipped the amount of reward the AGI gets from such a feature, the AGI would be penalized for exploring new strategies- this may have any number of effects which are fairly implementation specific and unpredictable. However, it probably wouldn’t result in hyperexistential catastrophe. This AI, providing everything else works as intended, actually seems to be perfectly aligned. If performed on a subhuman seed AI, it may brick- in this trivial case, it is neither aligned nor misaligned- it is an inanimate object.
Yes, an AGI with a flipped utility function would pursue its goals with roughly the same level of intelligence.
The point of this argument is super obvious, so you probably thought I was saying something else. I’m going somewhere with this, though- I’ll expand later.
I see what you’re saying here, but the GPT-2 incident seems to downplay it somewhat IMO. I’ll wait until you’re able to write down your thoughts on this at length; this is something that I’d like to see elaborated on (as well as everything else regarding hyperexistential risk.)
It’s hard to talk in specifics because my knowledge on the details of what future AGI architecture might look like is, of course, extremely limited.
As an almost entirely inapplicable analogy (which nonetheless still conveys my thinking here): consider the sorting algorithm for the comments on this post. If we flipped the “top-scoring” sorting algorithm to sort in the wrong direction, we would see the worst-rated posts on top, which would correspond to a hyperexistential disaster. However, if we instead flipped the effect that an upvote had on the score of a comment to negative values, it would sort comments which had no votes other than the default vote assigned on posting the comment to the top. This corresponds to paperclipping- it’s not minimizing the intended function, it’s just doing something weird.
If we inverted the utility function, this would (unless we take specific measures to combat it like you’re mentioning) lead to hyperexistential disaster. However, if we invert some constant which is meant to initially provide value for exploring new strategies while the AI is not yet intelligent enough to properly explore new strategies as an instrumental goal, the AI would effectively brick itself. It would place negative value on exploring new strategies, presumably including strategies which involve fixing this issue so it can acquire more utility and strategies which involve preventing the humans from turning it off. If we had some code which is intended to make the AI not turn off the evolution of the reward model before the AI values not turning off the reward model for other reasons (e.g. the reward model begins to properly model how humans don’t want the AI to turn the reward model evolution process off), and some crucial sign was flipped which made it do the opposite, the AI would freeze the process of the reward model being updated and then maximize whatever inane nonsense its model currently represented, and it would eventually run into some bizarre previously unconsidered and thus not appropriately penalized strategy comparable to tiling the universe with smiley faces, i.e. paperclipping.
These are really crude examples, but I think the argument is still valid. Also, this argument doesn’t address the core concern of “What about the things which DO result in hypexistential disaster”, it just establishes that much of the class of mistake you may have previously thought usually or always resulted in hyperexistential disaster (sign flips on critical software points) in fact usually causes paperclipping or the AI bricking itself.
Can you clarify what you mean by this? Also, I get what you’re going for, but paperclips is still extremely negative utility because it involves the destruction of humanity and the reconfiguration of the universe into garbage.
I sure hope that future AGI developers can be bothered to embrace safe design!
The reward modelling system would need to be very carefully engineered, definitely.
I thought this as well when I read the post. I’m sure there’s something clever you can do to avoid this but we also need to make sure that these sorts of critical components are not vulnerable to memory corruption. I may try to find a better strategy for this later, but for now I need to go do other things.
Sorry, I meant to convey that this was a feature we’re going to want to ensure that future AGI efforts display, not some feature which I have some other independent reason to believe would be displayed. It was an extension of the thought that “Our method will, ideally, be terrorist proof.”
Interesting analogy. I can see what you’re saying, and I guess it depends on what specifically gets flipped. I’m unsure about the second example; something like exploring new strategies doesn’t seem like something an AGI would terminally value. It’s instrumental to optimising the reward function/model, but I can’t see it getting flipped *with* the reward function/model.
My thinking was that a signflipped AGI designed as a positive utilitarian (i.e. with a minimum at 0 human utility) would prefer paperclipping to torture because the former provides 0 human utility (as there aren’t any humans), whereas the latter may produce a negligible amount. I’m not really sure if it makes sense tbh.
Even if we engineered it carefully, that doesn’t rule out screw-ups. We need robust failsafe measures *just in case*, imo.
I wonder if you could feasibly make it a part of the reward model. Perhaps you could train the reward model itself to disvalue something arbitrary (like paperclips) even more than torture, which would hopefully mitigate it. You’d still need to balance it in a way such that the system won’t spend all of its resources preventing this thing from happening at the neglect of actual human values, but that doesn’t seem too difficult. Although, once again, we can’t really have high confidence (>90%) that the AGI developers are going to think to implement something like this.
There was also an interesting idea I found in a Facebook post about this type of thing that got linked somewhere (can’t remember where). Stuart Armstrong suggested that a utility function could be designed as such:
Even if we solve any issues with these (and actually bother to implement them), there’s still the risk of an error like this happening in a localised part of the reward function such that *only* the part specifying something bad gets flipped, although I’m a little confused about this one. It could very well be the case that the system’s complex enough that there isn’t just one bit indicating whether “pain” or “suffering” is good or bad. And we’d presumably (hopefully) have checksums and whatever else thrown in. Maybe this could be mitigated by assigning more positive utility to good outcomes than negative utility to bad outcomes? (I’m probably speaking out of my rear end on this one.)
Memory corruption seems to be another issue. Perhaps if we have more than one measure we’d be less vulnerable to memory corruption. Like, if we designed an AGI with a reward model that disvalues two arbitrary things rather than just one, and memory corruption screwed with *both* measures, then something probably just went *very* wrong in the AGI and it probably won’t be able to optimise for suffering anyway.
Sorry, I meant instrumentally value. Typo. Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I’m highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.
Paperclipping seems to be negative utility, not approximately 0 utility. It involves all the humans being killed and our beautiful universe being ruined. I guess if there are no humans, there’s no utility in some sense, but human values don’t actually seem to work that way. I rate universes where humans never existed at all and
I’m… not sure what 0 utility would look like. It’s within the range of experiences that people experience on modern-day earth- somewhere between my current experience and being tortured. This is just definition problems, though- We could shift the scale such that paperclipping is zero utility, but in that case, we could also just make an AGI that has a minimum at paperclipping levels of utility.
In the context of AI safety, I think “robust failsafe measures just in case” is part of “careful engineering”. So, we agree!
I read Eliezer’s idea, and that strategy seems to be… dangerous. I think that “Giving an AGI a utility function which includes features which are not really relevant to human values” is something we want to avoid unless we absolutely need to.
I have much more to say on this topic and about the rest of your comment, but it’s definitely too much for a comment chain. I’ll make an actual post on this containing my thoughts sometime in the next week or two, and link it to you.
My thinking was that an AI system that *only* takes values between 0 and + ∞ (or some arbitrary positive number) would identify that killing humans would result in 0 human value, which is its minimum utility.
How come? It doesn’t seem *too* hard to create an AI that only expends a small amount of its energy on preventing the garbage thing from happening.
Please do! I’d love to see a longer discussion on this type of thing.
EDIT: just thought some more about this and want to clear something up:
I’m a little unsure on this one after further reflection. When this happened with GPT-2, the bug managed to flip the reward & the system still pursued instrumental goals like exploring new strategies:
So it definitely seems *plausible* for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.
I didn’t mean to imply that a signflipped AGI would not instrumentally explore.
I’m saying that, well… modern machine learning systems often get specific bonus utility for exploring, because it’s hard to explore the proper amount as an instrumental goal due to the difficulties of fully modelling the situation, and because systems which don’t have this bonus will often get stuck in local maximums.
Humans exhibit this property too. We have investigating things, acquiring new information, and building useful strategic models as a terminal goal- we are “curious”.
This is a feature we might see in early stages of modern attempts at full AGI, for similar reasons to why modern machine learning systems and humans exhibit this same behavior.
Presumably such features would be built to uninstall themselves after the AGI reaches levels of intelligence sufficient to properly and fully explore new strategies as an instrumental goal to satisfying the human utility function, if we do go this route.
If we sign flipped the amount of reward the AGI gets from such a feature, the AGI would be penalized for exploring new strategies- this may have any number of effects which are fairly implementation specific and unpredictable. However, it probably wouldn’t result in hyperexistential catastrophe. This AI, providing everything else works as intended, actually seems to be perfectly aligned. If performed on a subhuman seed AI, it may brick- in this trivial case, it is neither aligned nor misaligned- it is an inanimate object.
Yes, an AGI with a flipped utility function would pursue its goals with roughly the same level of intelligence.
The point of this argument is super obvious, so you probably thought I was saying something else. I’m going somewhere with this, though- I’ll expand later.
I see what you’re saying here, but the GPT-2 incident seems to downplay it somewhat IMO. I’ll wait until you’re able to write down your thoughts on this at length; this is something that I’d like to see elaborated on (as well as everything else regarding hyperexistential risk.)