Paperclipping seems to be negative utility, not approximately 0 utility.
My thinking was that an AI system that *only* takes values between 0 and + ∞ (or some arbitrary positive number) would identify that killing humans would result in 0 human value, which is its minimum utility.
I read Eliezer’s idea, and that strategy seems to be… dangerous. I think that “Giving an AGI a utility function which includes features which are not really relevant to human values” is something we want to avoid unless we absolutely need to.
How come? It doesn’t seem *too* hard to create an AI that only expends a small amount of its energy on preventing the garbage thing from happening.
I have much more to say on this topic and about the rest of your comment, but it’s definitely too much for a comment chain. I’ll make an actual post containing my thoughts sometime in the next week or two, and link it to you.
Please do! I’d love to see a longer discussion on this type of thing.
EDIT: just thought some more about this and want to clear something up:
Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I’m highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.
I’m a little unsure on this one after further reflection. When this happened with GPT-2, the bug managed to flip the reward & the system still pursued instrumental goals like exploring new strategies:
Bugs can optimize for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. A mechanism such as Toyota’s Andon cord could have prevented this, by allowing any labeler to stop a problematic training process.
So it definitely seems *plausible* for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.
So it definitely seems plausible for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.
I didn’t mean to imply that a signflipped AGI would not instrumentally explore.
I’m saying that, well… modern machine learning systems often get specific bonus utility for exploring, because it’s hard to explore the proper amount as an instrumental goal due to the difficulties of fully modelling the situation, and because systems which don’t have this bonus will often get stuck in local maximums.
Humans exhibit this property too. We have investigating things, acquiring new information, and building useful strategic models as a terminal goal- we are “curious”.
This is a feature we might see in early stages of modern attempts at full AGI, for similar reasons to why modern machine learning systems and humans exhibit this same behavior.
Presumably such features would be built to uninstall themselves after the AGI reaches levels of intelligence sufficient to properly and fully explore new strategies as an instrumental goal to satisfying the human utility function, if we do go this route.
If we sign flipped the amount of reward the AGI gets from such a feature, the AGI would be penalized for exploring new strategies- this may have any number of effects which are fairly implementation specific and unpredictable. However, it probably wouldn’t result in hyperexistential catastrophe. This AI, providing everything else works as intended, actually seems to be perfectly aligned. If performed on a subhuman seed AI, it may brick- in this trivial case, it is neither aligned nor misaligned- it is an inanimate object.
Yes, an AGI with a flipped utility function would pursue its goals with roughly the same level of intelligence.
The point of this argument is super obvious, so you probably thought I was saying something else. I’m going somewhere with this, though- I’ll expand later.
I see what you’re saying here, but the GPT-2 incident seems to downplay it somewhat IMO. I’ll wait until you’re able to write down your thoughts on this at length; this is something that I’d like to see elaborated on (as well as everything else regarding hyperexistential risk.)
My thinking was that an AI system that *only* takes values between 0 and + ∞ (or some arbitrary positive number) would identify that killing humans would result in 0 human value, which is its minimum utility.
How come? It doesn’t seem *too* hard to create an AI that only expends a small amount of its energy on preventing the garbage thing from happening.
Please do! I’d love to see a longer discussion on this type of thing.
EDIT: just thought some more about this and want to clear something up:
I’m a little unsure on this one after further reflection. When this happened with GPT-2, the bug managed to flip the reward & the system still pursued instrumental goals like exploring new strategies:
So it definitely seems *plausible* for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.
I didn’t mean to imply that a signflipped AGI would not instrumentally explore.
I’m saying that, well… modern machine learning systems often get specific bonus utility for exploring, because it’s hard to explore the proper amount as an instrumental goal due to the difficulties of fully modelling the situation, and because systems which don’t have this bonus will often get stuck in local maximums.
Humans exhibit this property too. We have investigating things, acquiring new information, and building useful strategic models as a terminal goal- we are “curious”.
This is a feature we might see in early stages of modern attempts at full AGI, for similar reasons to why modern machine learning systems and humans exhibit this same behavior.
Presumably such features would be built to uninstall themselves after the AGI reaches levels of intelligence sufficient to properly and fully explore new strategies as an instrumental goal to satisfying the human utility function, if we do go this route.
If we sign flipped the amount of reward the AGI gets from such a feature, the AGI would be penalized for exploring new strategies- this may have any number of effects which are fairly implementation specific and unpredictable. However, it probably wouldn’t result in hyperexistential catastrophe. This AI, providing everything else works as intended, actually seems to be perfectly aligned. If performed on a subhuman seed AI, it may brick- in this trivial case, it is neither aligned nor misaligned- it is an inanimate object.
Yes, an AGI with a flipped utility function would pursue its goals with roughly the same level of intelligence.
The point of this argument is super obvious, so you probably thought I was saying something else. I’m going somewhere with this, though- I’ll expand later.
I see what you’re saying here, but the GPT-2 incident seems to downplay it somewhat IMO. I’ll wait until you’re able to write down your thoughts on this at length; this is something that I’d like to see elaborated on (as well as everything else regarding hyperexistential risk.)