Interesting analogy. I can see what you’re saying, and I guess it depends on what specifically gets flipped. I’m unsure about the second example; something like exploring new strategies doesn’t seem like something an AGI would terminally value. It’s instrumental to optimising the reward function/model, but I can’t see it getting flipped with the reward function/model.
Sorry, I meant instrumentally value. Typo. Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I’m highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.
My thinking was that a signflipped AGI designed as a positive utilitarian (i.e. with a minimum at 0 human utility) would prefer paperclipping to torture because the former provides 0 human utility (as there aren’t any humans), whereas the latter may produce a negligible amount. I’m not really sure if it makes sense tbh.
Paperclipping seems to be negative utility, not approximately 0 utility. It involves all the humans being killed and our beautiful universe being ruined. I guess if there are no humans, there’s no utility in some sense, but human values don’t actually seem to work that way. I rate universes where humans never existed at all and
I’m… not sure what 0 utility would look like. It’s within the range of experiences that people experience on modern-day earth- somewhere between my current experience and being tortured. This is just definition problems, though- We could shift the scale such that paperclipping is zero utility, but in that case, we could also just make an AGI that has a minimum at paperclipping levels of utility.
Even if we engineered it carefully, that doesn’t rule out screw-ups. We need robust failsafe measures just in case, imo.
In the context of AI safety, I think “robust failsafe measures just in case” is part of “careful engineering”. So, we agree!
You’d still need to balance it in a way such that the system won’t spend all of its resources preventing this thing from happening at the neglect of actual human values, but that doesn’t seem too difficult.
I read Eliezer’s idea, and that strategy seems to be… dangerous. I think that “Giving an AGI a utility function which includes features which are not really relevant to human values” is something we want to avoid unless we absolutely need to.
I have much more to say on this topic and about the rest of your comment, but it’s definitely too much for a comment chain. I’ll make an actual post on this containing my thoughts sometime in the next week or two, and link it to you.
Paperclipping seems to be negative utility, not approximately 0 utility.
My thinking was that an AI system that *only* takes values between 0 and + ∞ (or some arbitrary positive number) would identify that killing humans would result in 0 human value, which is its minimum utility.
I read Eliezer’s idea, and that strategy seems to be… dangerous. I think that “Giving an AGI a utility function which includes features which are not really relevant to human values” is something we want to avoid unless we absolutely need to.
How come? It doesn’t seem *too* hard to create an AI that only expends a small amount of its energy on preventing the garbage thing from happening.
I have much more to say on this topic and about the rest of your comment, but it’s definitely too much for a comment chain. I’ll make an actual post containing my thoughts sometime in the next week or two, and link it to you.
Please do! I’d love to see a longer discussion on this type of thing.
EDIT: just thought some more about this and want to clear something up:
Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I’m highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.
I’m a little unsure on this one after further reflection. When this happened with GPT-2, the bug managed to flip the reward & the system still pursued instrumental goals like exploring new strategies:
Bugs can optimize for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. A mechanism such as Toyota’s Andon cord could have prevented this, by allowing any labeler to stop a problematic training process.
So it definitely seems *plausible* for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.
So it definitely seems plausible for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.
I didn’t mean to imply that a signflipped AGI would not instrumentally explore.
I’m saying that, well… modern machine learning systems often get specific bonus utility for exploring, because it’s hard to explore the proper amount as an instrumental goal due to the difficulties of fully modelling the situation, and because systems which don’t have this bonus will often get stuck in local maximums.
Humans exhibit this property too. We have investigating things, acquiring new information, and building useful strategic models as a terminal goal- we are “curious”.
This is a feature we might see in early stages of modern attempts at full AGI, for similar reasons to why modern machine learning systems and humans exhibit this same behavior.
Presumably such features would be built to uninstall themselves after the AGI reaches levels of intelligence sufficient to properly and fully explore new strategies as an instrumental goal to satisfying the human utility function, if we do go this route.
If we sign flipped the amount of reward the AGI gets from such a feature, the AGI would be penalized for exploring new strategies- this may have any number of effects which are fairly implementation specific and unpredictable. However, it probably wouldn’t result in hyperexistential catastrophe. This AI, providing everything else works as intended, actually seems to be perfectly aligned. If performed on a subhuman seed AI, it may brick- in this trivial case, it is neither aligned nor misaligned- it is an inanimate object.
Yes, an AGI with a flipped utility function would pursue its goals with roughly the same level of intelligence.
The point of this argument is super obvious, so you probably thought I was saying something else. I’m going somewhere with this, though- I’ll expand later.
I see what you’re saying here, but the GPT-2 incident seems to downplay it somewhat IMO. I’ll wait until you’re able to write down your thoughts on this at length; this is something that I’d like to see elaborated on (as well as everything else regarding hyperexistential risk.)
Sorry, I meant instrumentally value. Typo. Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I’m highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.
Paperclipping seems to be negative utility, not approximately 0 utility. It involves all the humans being killed and our beautiful universe being ruined. I guess if there are no humans, there’s no utility in some sense, but human values don’t actually seem to work that way. I rate universes where humans never existed at all and
I’m… not sure what 0 utility would look like. It’s within the range of experiences that people experience on modern-day earth- somewhere between my current experience and being tortured. This is just definition problems, though- We could shift the scale such that paperclipping is zero utility, but in that case, we could also just make an AGI that has a minimum at paperclipping levels of utility.
In the context of AI safety, I think “robust failsafe measures just in case” is part of “careful engineering”. So, we agree!
I read Eliezer’s idea, and that strategy seems to be… dangerous. I think that “Giving an AGI a utility function which includes features which are not really relevant to human values” is something we want to avoid unless we absolutely need to.
I have much more to say on this topic and about the rest of your comment, but it’s definitely too much for a comment chain. I’ll make an actual post on this containing my thoughts sometime in the next week or two, and link it to you.
My thinking was that an AI system that *only* takes values between 0 and + ∞ (or some arbitrary positive number) would identify that killing humans would result in 0 human value, which is its minimum utility.
How come? It doesn’t seem *too* hard to create an AI that only expends a small amount of its energy on preventing the garbage thing from happening.
Please do! I’d love to see a longer discussion on this type of thing.
EDIT: just thought some more about this and want to clear something up:
I’m a little unsure on this one after further reflection. When this happened with GPT-2, the bug managed to flip the reward & the system still pursued instrumental goals like exploring new strategies:
So it definitely seems *plausible* for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.
I didn’t mean to imply that a signflipped AGI would not instrumentally explore.
I’m saying that, well… modern machine learning systems often get specific bonus utility for exploring, because it’s hard to explore the proper amount as an instrumental goal due to the difficulties of fully modelling the situation, and because systems which don’t have this bonus will often get stuck in local maximums.
Humans exhibit this property too. We have investigating things, acquiring new information, and building useful strategic models as a terminal goal- we are “curious”.
This is a feature we might see in early stages of modern attempts at full AGI, for similar reasons to why modern machine learning systems and humans exhibit this same behavior.
Presumably such features would be built to uninstall themselves after the AGI reaches levels of intelligence sufficient to properly and fully explore new strategies as an instrumental goal to satisfying the human utility function, if we do go this route.
If we sign flipped the amount of reward the AGI gets from such a feature, the AGI would be penalized for exploring new strategies- this may have any number of effects which are fairly implementation specific and unpredictable. However, it probably wouldn’t result in hyperexistential catastrophe. This AI, providing everything else works as intended, actually seems to be perfectly aligned. If performed on a subhuman seed AI, it may brick- in this trivial case, it is neither aligned nor misaligned- it is an inanimate object.
Yes, an AGI with a flipped utility function would pursue its goals with roughly the same level of intelligence.
The point of this argument is super obvious, so you probably thought I was saying something else. I’m going somewhere with this, though- I’ll expand later.
I see what you’re saying here, but the GPT-2 incident seems to downplay it somewhat IMO. I’ll wait until you’re able to write down your thoughts on this at length; this is something that I’d like to see elaborated on (as well as everything else regarding hyperexistential risk.)