If you’re having significant anxiety from imagining some horrific I-have-no-mouth-and-I-must-scream scenario, I recommend that you multiply that dread by a very, very small number, so as to incorporate the low probability of such a scenario. You’re privileging this supposedly very low probability specific outcome over the rather horrifically wide selection of ways AGI could be a cosmic disaster.
This is, of course, not intended to dismay you from pursuing solutions to such a disaster.
I don’t really know what the probability is. It seems somewhat low, but I’m not confident that it’s *that* low. I wrote a shortform about it last night (tl;dr it seems like this type of error could occur in a disjunction of ways and we need a good way of separating the AI in design space.)
I think I’d stop worrying about it if I were convinced that its probability is extremely low. But I’m not yet convinced of that. Something like the example Gwern provided elsewhere in this thread seems more worrying than the more frequently discussed cosmic ray scenarios to me.
You can’t really be accidentally slightly wrong. We’re not going to develop Mostly Friendly AI, which is Friendly AI but with the slight caveat that it has a slightly higher value on the welfare of shrimp than desired, with no other negative consequences. The molecular sorts of precision needed to get anywhere near the zone of loosely trying to maximize or minimize for anything resembling human values will probably only follow from a method that is converging towards the exact spot we want it to be at, such as some clever flawless version of reward modelling.
In the same way, we’re probably not going to accidentally land in hyperexistential disaster territory. We could have some sign flipped, our checksum changed, and all our other error-correcting methods (Any future seed AI should at least be using ECC memory, drives in RAID, etc.) defeated by religious terrorists, cosmic rays, unscrupulous programmers, quantum fluctuations, etc. However, the vast majority of these mistakes would probably buff out or result in paper-clipping. If an FAI has slightly too high of a value assigned to the welfare of shrimp, it will realize this in the process of reward modelling and correct the issue. If its operation does not involve the continual adaptation of the model that is supposed to represent human values, it’s not using a method which has any chance of converging to Overwhelming Victory or even adjacent spaces for any reason other than sheer coincidence.
A method such as this has, barring stuff which I need to think more about (stability under self-modification), no chance of ending up in a “We perfectly recreated human values… But placed an unreasonably high value on eating bread! Now all the humans will be force-fed bread until the stars burn out! Mwhahahahaha!” sorts of scenarios. If the system cares about humans being alive enough to not reconfigure their matter into something else, we’re probably using a method which is innately insulated from most types of hyperexistential risk.
It’s not clear that Gwern’s example, or even that category of problem, is particularly relevant to this situation. Most parallels to modern-day software systems and the errors they are prone to are probably best viewed as sobering reminders, not specific advice. Indeed, I suspect his comment was merely a sobering reminder and not actual advice. If humans are making changes to the critical software/hardware of an AGI (And we’ll assume you figured out how to let the AGI allow you to do this in a way that has no negative side effects), while that AGI is already running, something bizarre and beyond my abilities of prediction is already happening. If you need to make changes after you turn your AGI on, you’ve already lost. If you don’t need to make changes and you’re making changes, you’re putting humanity in unnecessary risk. At this point, if we’ve figured out how to assist the seed AI in self-modification, at least until the point at which it can figure out how to do stable self-modification for itself, the problem is already solved. There’s more to be said here, but I’ll refrain for the purpose of brevity.
Essentially, we can not make any ordinary mistake. The type of mistake we would need to make in order to land up in hyperexistential disaster territory would, most likely, be an actual, literal sign flip scenario, and such scenarios seem much easier to address. There will probably only be a handful of weak points for this problem, and those weak points are all already things we’d pay extra super special attention to and will engineer in ways which make it extra super special sure nothing goes wrong. Our method will, ideally, be terrorist proof. It will not be possible to flip the sign of the utility function or the direction of the updates to the reward model, even if several of the researchers on the project are actively trying to sabotage the effort and cause a hyperexistential disaster.
I conjecture that most of the expected utility gained from combating the possibility of a hyperexistential disaster lies in the disproportionate positive effects on human sanity and the resulting improvements to the efforts to avoid regular existential disasters, and other such side-benefits.
None of this is intended to dissuade you from investigating this topic further. I’m merely arguing that a hyperexistential disaster is not remotely likely- not that it is not a concern. The fact that people will be concerned about this possibility is an important part of why the outcome is unlikely.
Thanks for the detailed response. A bit of nitpicking (from someone who doesn’t really know what they’re talking about):
However, the vast majority of these mistakes would probably buff out or result in paper-clipping.
I’m slightly confused by this one. If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be *no* human utility (i.e. paperclips). But most attempts at an aligned AI would have a minimum at “I have no mouth, and I must scream”. So any sign-flipping error would be expected to land there.
If humans are making changes to the critical software/hardware of an AGI (And we’ll assume you figured out how to let the AGI allow you to do this in a way that has no negative side effects), *while that AGI is already running*, something bizarre and beyond my abilities of prediction is already happening.
In the example, the AGI was using online machine learning, which, as I understand it, would probably require the system to be hooked up to a database that humans have access to in order for it to learn properly. And I’m unsure as to how easy it’d be for things like checksums to pick up an issue like this (a boolean flag getting flipped) in a database.
Perhaps there’ll be a reward function/model intentionally designed to disvalue some arbitrary “surrogate” thing in an attempt to separate it from hyperexistential risk. So “pessimizing the target metric” would look more like paperclipping than torture. But I’m unsure as to (1) whether the AGI’s developers would actually bother to implement it, and (2) whether it’d actually work in this sort of scenario.
Also worth noting is that an AGI based on reward modelling is going to have to be linked to another neural network, which is going to have constant input from humans. If that reward model isn’t designed to be separated in design space from AM, someone could screw up with the model somehow. If we were to, say, have U = V + W (where V is the reward given by the reward model and W is some arbitrary thing that the AGI disvalues, as is the case in Eliezer’s Arbital post that I linked,) a sign flip-type error in V (rather than a sign flip in U) would lead to a hyperexistential catastrophe.
It will not be possible to flip the sign of the utility function or the direction of the updates to the reward model, even if several of the researchers on the project are actively trying to sabotage the effort and cause a hyperexistential disaster.
I think this is somewhat likely to be the case, but I’m not sure that I’m confident enough about it. Flipping the direction of updates to the reward model seems harder to prevent than a bit flip in a utility function, which could be prevent through error-correcting code memory (as you mentioned earlier.)
Despite my confusions, your response has definitely decreased my credence in this sort of thing from happening.
I’m slightly confused by this one. If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be no human utility (i.e. paperclips). But most attempts at an aligned AI would have a minimum at “I have no mouth, and I must scream”. So any sign-flipping error would be expected to land there.
It’s hard to talk in specifics because my knowledge on the details of what future AGI architecture might look like is, of course, extremely limited.
As an almost entirely inapplicable analogy (which nonetheless still conveys my thinking here): consider the sorting algorithm for the comments on this post. If we flipped the “top-scoring” sorting algorithm to sort in the wrong direction, we would see the worst-rated posts on top, which would correspond to a hyperexistential disaster. However, if we instead flipped the effect that an upvote had on the score of a comment to negative values, it would sort comments which had no votes other than the default vote assigned on posting the comment to the top. This corresponds to paperclipping- it’s not minimizing the intended function, it’s just doing something weird.
If we inverted the utility function, this would (unless we take specific measures to combat it like you’re mentioning) lead to hyperexistential disaster. However, if we invert some constant which is meant to initially provide value for exploring new strategies while the AI is not yet intelligent enough to properly explore new strategies as an instrumental goal, the AI would effectively brick itself. It would place negative value on exploring new strategies, presumably including strategies which involve fixing this issue so it can acquire more utility and strategies which involve preventing the humans from turning it off. If we had some code which is intended to make the AI not turn off the evolution of the reward model before the AI values not turning off the reward model for other reasons (e.g. the reward model begins to properly model how humans don’t want the AI to turn the reward model evolution process off), and some crucial sign was flipped which made it do the opposite, the AI would freeze the process of the reward model being updated and then maximize whatever inane nonsense its model currently represented, and it would eventually run into some bizarre previously unconsidered and thus not appropriately penalized strategy comparable to tiling the universe with smiley faces, i.e. paperclipping.
These are really crude examples, but I think the argument is still valid. Also, this argument doesn’t address the core concern of “What about the things which DO result in hypexistential disaster”, it just establishes that much of the class of mistake you may have previously thought usually or always resulted in hyperexistential disaster (sign flips on critical software points) in fact usually causes paperclipping or the AI bricking itself.
If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be no human utility (i.e. paperclips).
Can you clarify what you mean by this? Also, I get what you’re going for, but paperclips is still extremely negative utility because it involves the destruction of humanity and the reconfiguration of the universe into garbage.
Perhaps there’ll be a reward function/model intentionally designed to disvalue some arbitrary “surrogate” thing in an attempt to separate it from hyperexistential risk. So “pessimizing the target metric” would look more like paperclipping than torture. But I’m unsure as to (1) whether the AGI’s developers would actually bother to implement it, and (2) whether it’d actually work in this sort of scenario.
I sure hope that future AGI developers can be bothered to embrace safe design!
Also worth noting is that an AGI based on reward modelling is going to have to be linked to another neural network, which is going to have constant input from humans. If that reward model isn’t designed to be separated in design space from AM, someone could screw up with the model somehow.
The reward modelling system would need to be very carefully engineered, definitely.
If we were to, say, have U = V + W (where V is the reward given by the reward model and W is some arbitrary thing that the AGI disvalues, as is the case in Eliezer’s Arbital post that I linked,) a sign flip-type error in V (rather than a sign flip in U) would lead to a hyperexistential catastrophe.
I thought this as well when I read the post. I’m sure there’s something clever you can do to avoid this but we also need to make sure that these sorts of critical components are not vulnerable to memory corruption. I may try to find a better strategy for this later, but for now I need to go do other things.
I think this is somewhat likely to be the case, but I’m not sure that I’m confident enough about it. Flipping the direction of updates to the reward model seems harder to prevent than a bit flip in a utility function, which could be prevent through error-correcting code memory (as you mentioned earlier.)
Sorry, I meant to convey that this was a feature we’re going to want to ensure that future AGI efforts display, not some feature which I have some other independent reason to believe would be displayed. It was an extension of the thought that “Our method will, ideally, be terrorist proof.”
As an almost entirely inapplicable analogy . . . it’s just doing something weird.
If we inverted the utility function . . . tiling the universe with smiley faces, i.e. paperclipping.
Interesting analogy. I can see what you’re saying, and I guess it depends on what specifically gets flipped. I’m unsure about the second example; something like exploring new strategies doesn’t seem like something an AGI would terminally value. It’s instrumental to optimising the reward function/model, but I can’t see it getting flipped *with* the reward function/model.
Can you clarify what you mean by this? Also, I get what you’re going for, but paperclips is still extremely negative utility because it involves the destruction of humanity and the reconfiguration of the universe into garbage.
My thinking was that a signflipped AGI designed as a positive utilitarian (i.e. with a minimum at 0 human utility) would prefer paperclipping to torture because the former provides 0 human utility (as there aren’t any humans), whereas the latter may produce a negligible amount. I’m not really sure if it makes sense tbh.
The reward modelling system would need to be very carefully engineered, definitely.
Even if we engineered it carefully, that doesn’t rule out screw-ups. We need robust failsafe measures *just in case*, imo.
I thought of this as well when I read the post. I’m sure there’s something clever you can do to avoid this but we also need to make sure that these sorts of critical components are not vulnerable to memory corruption. I may try to find a better strategy for this later, but for now I need to go do other things.
I wonder if you could feasibly make it a part of the reward model. Perhaps you could train the reward model itself to disvalue something arbitrary (like paperclips) even more than torture, which would hopefully mitigate it. You’d still need to balance it in a way such that the system won’t spend all of its resources preventing this thing from happening at the neglect of actual human values, but that doesn’t seem too difficult. Although, once again, we can’t really have high confidence (>90%) that the AGI developers are going to think to implement something like this.
There was also an interesting idea I found in a Facebook post about this type of thing that got linked somewhere (can’t remember where). Stuart Armstrong suggested that a utility function could be designed as such:
Let B1 and B2 be excellent, bestest outcomes. Define U(B1)=1, U(B2)=-1, and U=0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes. Or, more usefully, let X be some trivial feature that the agent can easily set to −1 or 1, and let U be a utility function with values in [0,1]. Have the AI maximisise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.
Even if we solve any issues with these (and actually bother to implement them), there’s still the risk of an error like this happening in a localised part of the reward function such that *only* the part specifying something bad gets flipped, although I’m a little confused about this one. It could very well be the case that the system’s complex enough that there isn’t just one bit indicating whether “pain” or “suffering” is good or bad. And we’d presumably (hopefully) have checksums and whatever else thrown in. Maybe this could be mitigated by assigning more positive utility to good outcomes than negative utility to bad outcomes? (I’m probably speaking out of my rear end on this one.)
Memory corruption seems to be another issue. Perhaps if we have more than one measure we’d be less vulnerable to memory corruption. Like, if we designed an AGI with a reward model that disvalues two arbitrary things rather than just one, and memory corruption screwed with *both* measures, then something probably just went *very* wrong in the AGI and it probably won’t be able to optimise for suffering anyway.
Interesting analogy. I can see what you’re saying, and I guess it depends on what specifically gets flipped. I’m unsure about the second example; something like exploring new strategies doesn’t seem like something an AGI would terminally value. It’s instrumental to optimising the reward function/model, but I can’t see it getting flipped with the reward function/model.
Sorry, I meant instrumentally value. Typo. Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I’m highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.
My thinking was that a signflipped AGI designed as a positive utilitarian (i.e. with a minimum at 0 human utility) would prefer paperclipping to torture because the former provides 0 human utility (as there aren’t any humans), whereas the latter may produce a negligible amount. I’m not really sure if it makes sense tbh.
Paperclipping seems to be negative utility, not approximately 0 utility. It involves all the humans being killed and our beautiful universe being ruined. I guess if there are no humans, there’s no utility in some sense, but human values don’t actually seem to work that way. I rate universes where humans never existed at all and
I’m… not sure what 0 utility would look like. It’s within the range of experiences that people experience on modern-day earth- somewhere between my current experience and being tortured. This is just definition problems, though- We could shift the scale such that paperclipping is zero utility, but in that case, we could also just make an AGI that has a minimum at paperclipping levels of utility.
Even if we engineered it carefully, that doesn’t rule out screw-ups. We need robust failsafe measures just in case, imo.
In the context of AI safety, I think “robust failsafe measures just in case” is part of “careful engineering”. So, we agree!
You’d still need to balance it in a way such that the system won’t spend all of its resources preventing this thing from happening at the neglect of actual human values, but that doesn’t seem too difficult.
I read Eliezer’s idea, and that strategy seems to be… dangerous. I think that “Giving an AGI a utility function which includes features which are not really relevant to human values” is something we want to avoid unless we absolutely need to.
I have much more to say on this topic and about the rest of your comment, but it’s definitely too much for a comment chain. I’ll make an actual post on this containing my thoughts sometime in the next week or two, and link it to you.
Paperclipping seems to be negative utility, not approximately 0 utility.
My thinking was that an AI system that *only* takes values between 0 and + ∞ (or some arbitrary positive number) would identify that killing humans would result in 0 human value, which is its minimum utility.
I read Eliezer’s idea, and that strategy seems to be… dangerous. I think that “Giving an AGI a utility function which includes features which are not really relevant to human values” is something we want to avoid unless we absolutely need to.
How come? It doesn’t seem *too* hard to create an AI that only expends a small amount of its energy on preventing the garbage thing from happening.
I have much more to say on this topic and about the rest of your comment, but it’s definitely too much for a comment chain. I’ll make an actual post containing my thoughts sometime in the next week or two, and link it to you.
Please do! I’d love to see a longer discussion on this type of thing.
EDIT: just thought some more about this and want to clear something up:
Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I’m highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.
I’m a little unsure on this one after further reflection. When this happened with GPT-2, the bug managed to flip the reward & the system still pursued instrumental goals like exploring new strategies:
Bugs can optimize for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. A mechanism such as Toyota’s Andon cord could have prevented this, by allowing any labeler to stop a problematic training process.
So it definitely seems *plausible* for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.
So it definitely seems plausible for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.
I didn’t mean to imply that a signflipped AGI would not instrumentally explore.
I’m saying that, well… modern machine learning systems often get specific bonus utility for exploring, because it’s hard to explore the proper amount as an instrumental goal due to the difficulties of fully modelling the situation, and because systems which don’t have this bonus will often get stuck in local maximums.
Humans exhibit this property too. We have investigating things, acquiring new information, and building useful strategic models as a terminal goal- we are “curious”.
This is a feature we might see in early stages of modern attempts at full AGI, for similar reasons to why modern machine learning systems and humans exhibit this same behavior.
Presumably such features would be built to uninstall themselves after the AGI reaches levels of intelligence sufficient to properly and fully explore new strategies as an instrumental goal to satisfying the human utility function, if we do go this route.
If we sign flipped the amount of reward the AGI gets from such a feature, the AGI would be penalized for exploring new strategies- this may have any number of effects which are fairly implementation specific and unpredictable. However, it probably wouldn’t result in hyperexistential catastrophe. This AI, providing everything else works as intended, actually seems to be perfectly aligned. If performed on a subhuman seed AI, it may brick- in this trivial case, it is neither aligned nor misaligned- it is an inanimate object.
Yes, an AGI with a flipped utility function would pursue its goals with roughly the same level of intelligence.
The point of this argument is super obvious, so you probably thought I was saying something else. I’m going somewhere with this, though- I’ll expand later.
I see what you’re saying here, but the GPT-2 incident seems to downplay it somewhat IMO. I’ll wait until you’re able to write down your thoughts on this at length; this is something that I’d like to see elaborated on (as well as everything else regarding hyperexistential risk.)
If you’re having significant anxiety from imagining some horrific I-have-no-mouth-and-I-must-scream scenario, I recommend that you multiply that dread by a very, very small number, so as to incorporate the low probability of such a scenario. You’re privileging this supposedly very low probability specific outcome over the rather horrifically wide selection of ways AGI could be a cosmic disaster.
This is, of course, not intended to dismay you from pursuing solutions to such a disaster.
I don’t really know what the probability is. It seems somewhat low, but I’m not confident that it’s *that* low. I wrote a shortform about it last night (tl;dr it seems like this type of error could occur in a disjunction of ways and we need a good way of separating the AI in design space.)
I think I’d stop worrying about it if I were convinced that its probability is extremely low. But I’m not yet convinced of that. Something like the example Gwern provided elsewhere in this thread seems more worrying than the more frequently discussed cosmic ray scenarios to me.
You can’t really be accidentally slightly wrong. We’re not going to develop Mostly Friendly AI, which is Friendly AI but with the slight caveat that it has a slightly higher value on the welfare of shrimp than desired, with no other negative consequences. The molecular sorts of precision needed to get anywhere near the zone of loosely trying to maximize or minimize for anything resembling human values will probably only follow from a method that is converging towards the exact spot we want it to be at, such as some clever flawless version of reward modelling.
In the same way, we’re probably not going to accidentally land in hyperexistential disaster territory. We could have some sign flipped, our checksum changed, and all our other error-correcting methods (Any future seed AI should at least be using ECC memory, drives in RAID, etc.) defeated by religious terrorists, cosmic rays, unscrupulous programmers, quantum fluctuations, etc. However, the vast majority of these mistakes would probably buff out or result in paper-clipping. If an FAI has slightly too high of a value assigned to the welfare of shrimp, it will realize this in the process of reward modelling and correct the issue. If its operation does not involve the continual adaptation of the model that is supposed to represent human values, it’s not using a method which has any chance of converging to Overwhelming Victory or even adjacent spaces for any reason other than sheer coincidence.
A method such as this has, barring stuff which I need to think more about (stability under self-modification), no chance of ending up in a “We perfectly recreated human values… But placed an unreasonably high value on eating bread! Now all the humans will be force-fed bread until the stars burn out! Mwhahahahaha!” sorts of scenarios. If the system cares about humans being alive enough to not reconfigure their matter into something else, we’re probably using a method which is innately insulated from most types of hyperexistential risk.
It’s not clear that Gwern’s example, or even that category of problem, is particularly relevant to this situation. Most parallels to modern-day software systems and the errors they are prone to are probably best viewed as sobering reminders, not specific advice. Indeed, I suspect his comment was merely a sobering reminder and not actual advice. If humans are making changes to the critical software/hardware of an AGI (And we’ll assume you figured out how to let the AGI allow you to do this in a way that has no negative side effects), while that AGI is already running, something bizarre and beyond my abilities of prediction is already happening. If you need to make changes after you turn your AGI on, you’ve already lost. If you don’t need to make changes and you’re making changes, you’re putting humanity in unnecessary risk. At this point, if we’ve figured out how to assist the seed AI in self-modification, at least until the point at which it can figure out how to do stable self-modification for itself, the problem is already solved. There’s more to be said here, but I’ll refrain for the purpose of brevity.
Essentially, we can not make any ordinary mistake. The type of mistake we would need to make in order to land up in hyperexistential disaster territory would, most likely, be an actual, literal sign flip scenario, and such scenarios seem much easier to address. There will probably only be a handful of weak points for this problem, and those weak points are all already things we’d pay extra super special attention to and will engineer in ways which make it extra super special sure nothing goes wrong. Our method will, ideally, be terrorist proof. It will not be possible to flip the sign of the utility function or the direction of the updates to the reward model, even if several of the researchers on the project are actively trying to sabotage the effort and cause a hyperexistential disaster.
I conjecture that most of the expected utility gained from combating the possibility of a hyperexistential disaster lies in the disproportionate positive effects on human sanity and the resulting improvements to the efforts to avoid regular existential disasters, and other such side-benefits.
None of this is intended to dissuade you from investigating this topic further. I’m merely arguing that a hyperexistential disaster is not remotely likely- not that it is not a concern. The fact that people will be concerned about this possibility is an important part of why the outcome is unlikely.
Thanks for the detailed response. A bit of nitpicking (from someone who doesn’t really know what they’re talking about):
I’m slightly confused by this one. If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be *no* human utility (i.e. paperclips). But most attempts at an aligned AI would have a minimum at “I have no mouth, and I must scream”. So any sign-flipping error would be expected to land there.
In the example, the AGI was using online machine learning, which, as I understand it, would probably require the system to be hooked up to a database that humans have access to in order for it to learn properly. And I’m unsure as to how easy it’d be for things like checksums to pick up an issue like this (a boolean flag getting flipped) in a database.
Perhaps there’ll be a reward function/model intentionally designed to disvalue some arbitrary “surrogate” thing in an attempt to separate it from hyperexistential risk. So “pessimizing the target metric” would look more like paperclipping than torture. But I’m unsure as to (1) whether the AGI’s developers would actually bother to implement it, and (2) whether it’d actually work in this sort of scenario.
Also worth noting is that an AGI based on reward modelling is going to have to be linked to another neural network, which is going to have constant input from humans. If that reward model isn’t designed to be separated in design space from AM, someone could screw up with the model somehow. If we were to, say, have U = V + W (where V is the reward given by the reward model and W is some arbitrary thing that the AGI disvalues, as is the case in Eliezer’s Arbital post that I linked,) a sign flip-type error in V (rather than a sign flip in U) would lead to a hyperexistential catastrophe.
I think this is somewhat likely to be the case, but I’m not sure that I’m confident enough about it. Flipping the direction of updates to the reward model seems harder to prevent than a bit flip in a utility function, which could be prevent through error-correcting code memory (as you mentioned earlier.)
Despite my confusions, your response has definitely decreased my credence in this sort of thing from happening.
It’s hard to talk in specifics because my knowledge on the details of what future AGI architecture might look like is, of course, extremely limited.
As an almost entirely inapplicable analogy (which nonetheless still conveys my thinking here): consider the sorting algorithm for the comments on this post. If we flipped the “top-scoring” sorting algorithm to sort in the wrong direction, we would see the worst-rated posts on top, which would correspond to a hyperexistential disaster. However, if we instead flipped the effect that an upvote had on the score of a comment to negative values, it would sort comments which had no votes other than the default vote assigned on posting the comment to the top. This corresponds to paperclipping- it’s not minimizing the intended function, it’s just doing something weird.
If we inverted the utility function, this would (unless we take specific measures to combat it like you’re mentioning) lead to hyperexistential disaster. However, if we invert some constant which is meant to initially provide value for exploring new strategies while the AI is not yet intelligent enough to properly explore new strategies as an instrumental goal, the AI would effectively brick itself. It would place negative value on exploring new strategies, presumably including strategies which involve fixing this issue so it can acquire more utility and strategies which involve preventing the humans from turning it off. If we had some code which is intended to make the AI not turn off the evolution of the reward model before the AI values not turning off the reward model for other reasons (e.g. the reward model begins to properly model how humans don’t want the AI to turn the reward model evolution process off), and some crucial sign was flipped which made it do the opposite, the AI would freeze the process of the reward model being updated and then maximize whatever inane nonsense its model currently represented, and it would eventually run into some bizarre previously unconsidered and thus not appropriately penalized strategy comparable to tiling the universe with smiley faces, i.e. paperclipping.
These are really crude examples, but I think the argument is still valid. Also, this argument doesn’t address the core concern of “What about the things which DO result in hypexistential disaster”, it just establishes that much of the class of mistake you may have previously thought usually or always resulted in hyperexistential disaster (sign flips on critical software points) in fact usually causes paperclipping or the AI bricking itself.
Can you clarify what you mean by this? Also, I get what you’re going for, but paperclips is still extremely negative utility because it involves the destruction of humanity and the reconfiguration of the universe into garbage.
I sure hope that future AGI developers can be bothered to embrace safe design!
The reward modelling system would need to be very carefully engineered, definitely.
I thought this as well when I read the post. I’m sure there’s something clever you can do to avoid this but we also need to make sure that these sorts of critical components are not vulnerable to memory corruption. I may try to find a better strategy for this later, but for now I need to go do other things.
Sorry, I meant to convey that this was a feature we’re going to want to ensure that future AGI efforts display, not some feature which I have some other independent reason to believe would be displayed. It was an extension of the thought that “Our method will, ideally, be terrorist proof.”
Interesting analogy. I can see what you’re saying, and I guess it depends on what specifically gets flipped. I’m unsure about the second example; something like exploring new strategies doesn’t seem like something an AGI would terminally value. It’s instrumental to optimising the reward function/model, but I can’t see it getting flipped *with* the reward function/model.
My thinking was that a signflipped AGI designed as a positive utilitarian (i.e. with a minimum at 0 human utility) would prefer paperclipping to torture because the former provides 0 human utility (as there aren’t any humans), whereas the latter may produce a negligible amount. I’m not really sure if it makes sense tbh.
Even if we engineered it carefully, that doesn’t rule out screw-ups. We need robust failsafe measures *just in case*, imo.
I wonder if you could feasibly make it a part of the reward model. Perhaps you could train the reward model itself to disvalue something arbitrary (like paperclips) even more than torture, which would hopefully mitigate it. You’d still need to balance it in a way such that the system won’t spend all of its resources preventing this thing from happening at the neglect of actual human values, but that doesn’t seem too difficult. Although, once again, we can’t really have high confidence (>90%) that the AGI developers are going to think to implement something like this.
There was also an interesting idea I found in a Facebook post about this type of thing that got linked somewhere (can’t remember where). Stuart Armstrong suggested that a utility function could be designed as such:
Even if we solve any issues with these (and actually bother to implement them), there’s still the risk of an error like this happening in a localised part of the reward function such that *only* the part specifying something bad gets flipped, although I’m a little confused about this one. It could very well be the case that the system’s complex enough that there isn’t just one bit indicating whether “pain” or “suffering” is good or bad. And we’d presumably (hopefully) have checksums and whatever else thrown in. Maybe this could be mitigated by assigning more positive utility to good outcomes than negative utility to bad outcomes? (I’m probably speaking out of my rear end on this one.)
Memory corruption seems to be another issue. Perhaps if we have more than one measure we’d be less vulnerable to memory corruption. Like, if we designed an AGI with a reward model that disvalues two arbitrary things rather than just one, and memory corruption screwed with *both* measures, then something probably just went *very* wrong in the AGI and it probably won’t be able to optimise for suffering anyway.
Sorry, I meant instrumentally value. Typo. Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I’m highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.
Paperclipping seems to be negative utility, not approximately 0 utility. It involves all the humans being killed and our beautiful universe being ruined. I guess if there are no humans, there’s no utility in some sense, but human values don’t actually seem to work that way. I rate universes where humans never existed at all and
I’m… not sure what 0 utility would look like. It’s within the range of experiences that people experience on modern-day earth- somewhere between my current experience and being tortured. This is just definition problems, though- We could shift the scale such that paperclipping is zero utility, but in that case, we could also just make an AGI that has a minimum at paperclipping levels of utility.
In the context of AI safety, I think “robust failsafe measures just in case” is part of “careful engineering”. So, we agree!
I read Eliezer’s idea, and that strategy seems to be… dangerous. I think that “Giving an AGI a utility function which includes features which are not really relevant to human values” is something we want to avoid unless we absolutely need to.
I have much more to say on this topic and about the rest of your comment, but it’s definitely too much for a comment chain. I’ll make an actual post on this containing my thoughts sometime in the next week or two, and link it to you.
My thinking was that an AI system that *only* takes values between 0 and + ∞ (or some arbitrary positive number) would identify that killing humans would result in 0 human value, which is its minimum utility.
How come? It doesn’t seem *too* hard to create an AI that only expends a small amount of its energy on preventing the garbage thing from happening.
Please do! I’d love to see a longer discussion on this type of thing.
EDIT: just thought some more about this and want to clear something up:
I’m a little unsure on this one after further reflection. When this happened with GPT-2, the bug managed to flip the reward & the system still pursued instrumental goals like exploring new strategies:
So it definitely seems *plausible* for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.
I didn’t mean to imply that a signflipped AGI would not instrumentally explore.
I’m saying that, well… modern machine learning systems often get specific bonus utility for exploring, because it’s hard to explore the proper amount as an instrumental goal due to the difficulties of fully modelling the situation, and because systems which don’t have this bonus will often get stuck in local maximums.
Humans exhibit this property too. We have investigating things, acquiring new information, and building useful strategic models as a terminal goal- we are “curious”.
This is a feature we might see in early stages of modern attempts at full AGI, for similar reasons to why modern machine learning systems and humans exhibit this same behavior.
Presumably such features would be built to uninstall themselves after the AGI reaches levels of intelligence sufficient to properly and fully explore new strategies as an instrumental goal to satisfying the human utility function, if we do go this route.
If we sign flipped the amount of reward the AGI gets from such a feature, the AGI would be penalized for exploring new strategies- this may have any number of effects which are fairly implementation specific and unpredictable. However, it probably wouldn’t result in hyperexistential catastrophe. This AI, providing everything else works as intended, actually seems to be perfectly aligned. If performed on a subhuman seed AI, it may brick- in this trivial case, it is neither aligned nor misaligned- it is an inanimate object.
Yes, an AGI with a flipped utility function would pursue its goals with roughly the same level of intelligence.
The point of this argument is super obvious, so you probably thought I was saying something else. I’m going somewhere with this, though- I’ll expand later.
I see what you’re saying here, but the GPT-2 incident seems to downplay it somewhat IMO. I’ll wait until you’re able to write down your thoughts on this at length; this is something that I’d like to see elaborated on (as well as everything else regarding hyperexistential risk.)