It freaks me out that we have Loss Functions and also Utility Functions and their type signature is exactly the same, but if you put one in a place where the other was expected, it causes literally the worst possible thing to happen that ever could happen. I am not comfortable with this at all.
Yes. For example: lots of applications use online learning. A programmer flips the meaning of a boolean flag in a database somewhere while not updating all downstream callers, and suddenly an online learner is now actively pessimizing their target metric.
Let B1 and B2 be excellent, bestest outcomes. Define U(B1) = 1, U(B2) = −1, and U = 0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes.
Or, more usefully, let X be some trivial feature that the agent can easily set to −1 or 1, and let U be a utility function with values in [0, 1]. Have the AI maximise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.
Or at least prevent sign flip errors from causing something worse than paperclipping?
Forgive me for my stupidity (I’m not exactly an expert in machine learning), but it seems to me that building an AGI linked to some sort of database like that in such a fashion (that some random guy’s screw-up can effectively reverse the utility function completely) is a REALLY stupid idea. Would there not be a safer way of doing things?
If we actually built an AGI that optimised to maximise a loss function, wouldn’t we notice long before deploying the thing?
I’d imagine that this type of thing would be sanity-checked and tested intensively, so signflip-type errors would predominantly be scenarios where the error occurs *after* deployment, like the one Gwern mentioned (“A programmer flips the meaning of a boolean flag in a database somewhere while not updating all downstream callers, and suddenly an online learner is now actively pessimizing their target metric.”)
Even if you disclaim configuration errors or updates (despite this accounting for most of a system’s operating lifespan, and human/configuration errors accounting for a large fraction of all major errors at cloud providers etc according to postmortems), an error may still happen too fast to notice. Recall that in the preference learning case, the bug manifested after Christiano et al went to sleep, and they woke up to the maximally-NSFW AI. AlphaZero trained in ~2 hours wallclock, IIRC. Someone working on an even larger cluster commits a change and takes a quick bathroom break...
Wouldn’t any configuration errors or updates be caught with sanity-checking tools though? Maybe the way I’m visualising this is just too simplistic, but any developers capable of creating an *aligned* AGI are going to be *extremely* careful not to fuck up. Sure, it’s possible, but the most plausible cause of a hyperexistential catastrophe to me seems to be where a SignFlip-type error occurs once the system has been deployed.
Hopefully a system as crucially important as an AGI isn’t going to have just one guy watching it who “takes a quick bathroom break”. When the difference is literally Heaven and Hell (minimising human values), I’d consider only having one guy in a basement monitoring it to be gross negligence.
Many entities have sanity-checking tools. They fail. Many have careful developers. They fail. Many have automated tests. They fail. And so on. Disasters happen because all of those will fail to work every time and therefore all will fail some time. If any of that sounds improbable, as if there would have to be a veritable malevolent demon arranging to make every single safeguard fail or backfire (literally, sometimes, like the recent warehouse explosion—triggered by welders trying to safeguard it!), you should probably read more about complex systems and their failures to understand how normal it all is.
Sure, but the *specific* type of error I’m imagining would surely be easier to pick up than most other errors. I have no idea what sort of sanity checking was done with GPT-2, but the fact that the developers were asleep when it trained is telling: they weren’t being as careful as they could’ve been.
For this type of bug (a sign error in the utility function) to occur *before* the system is deployed and somehow persist, it’d have to make it past all sanity-checking tools (which I imagine would be used extensively with an AGI) *and* somehow not be noticed at all while the model trains *and* whatever else. Yes, these sort of conjunctions occur in the real world but the error is generally more subtle than “system does the complete opposite of what it was meant to do”.
I made a question post about this specific type of bug occurring before deployment a while ago and think my views have shifted significantly; it’s unlikely that a bug as obvious as one that flips the sign of the utility function won’t be noticed before deployment. Now I’m more worried about something like this happening *after* the system has been deployed.
I think a more robust solution to all of these sort of errors would be something like the separation from hyperexistential risk article that I linked in my previous response. I optimistically hope that we’re able to come up with a utility function that doesn’t do anything worse than death when minimised, just in case.
At least with current technologies, I expect serious risks to start occuring during training, not deployment. That’s ultimately when you will the greatest learning happening, when you have the greatest access to compute, and when you will first cross the threshold of intelligence that will make the system actually dangerous. So I don’t think that just checking things after they are trained is safe.
I’m under the impression that an AGI would be monitored *during* training as well. So you’d effectively need the system to turn “evil” (utility function flipped) during the training process, and the system to be smart enough to conceal that the error occurred. So it’d need to happen a fair bit into the training process. I guess that’s possible, but IDK how likely it’d be.
Yeah, I do think it’s likely that AGI would be monitored during training, but the specific instance of Open AI staff being asleep while we train the AI is a clear instance of us not monitoring the AI during the most crucial periods (which, to be clear, I think is fine since I think the risks were indeed quite low, and I don’t see this as providing super much evidence about Open AI’s future practices)
Given that compute is very expensive, economic pressures will push training to be 24⁄7, so it’s unlikely that people generally pause the training when going to sleep.
Maybe the project will come up with some mechanism that detects that. But if they fall back to the naive “just watch what it does in the test environment and assume it’ll do the same in production,” then there is a risk it’s going to figure out it’s in a test environment, and that its judges would not react well to finding out what is wrong with its utility function, and then it will act aligned in the testing environment.
If we ever see a news headline saying “Good News, AGI seems to ‘self-align’ regardless of the sign of the utility function!” that will be some very bad news.
I asked Rohin Shah about that possibility in a question thread about a month ago. I think he’s probably right that this type of thing would only plausibly make it through the training process if the system’s *already* smart enough to be able to think about this type of thing. And then on top of that there are still things like sanity checks which, while unlikely to pick up numerous errors, would probably notice a sign error. See also this comment:
Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect is not catastrophe; it’s just that the system doesn’t work. Another likely consequence is that the system is misaligned, but in an obvious ways that makes it easy for developers to recognize that deployment is a very bad idea.
IMO it’s incredibly important that we find a way to prevent this type of thing from occurring *after* the system has been trained, whether that be hyperexistential separation or something else. I think that a team that’s safety-conscious enough to come up with a (reasonably) aligned AGI design is going to put a considerable amount of effort into fixing bugs & one as obvious as a sign error would be unlikely to make it through. And hopefully—even better, they would have come up with a utility function that can’t be easily reversed by a single bit flip or doesn’t cause outcomes worse than death when minimised. That’d (hopefully?) solve the SignFlip issue *regardless* of what causes it.
It freaks me out that we have Loss Functions and also Utility Functions and their type signature is exactly the same, but if you put one in a place where the other was expected, it causes literally the worst possible thing to happen that ever could happen. I am not comfortable with this at all.
It is definitely awkward when that happens. Reward functions are hard.
Do you think that this type of thing could plausibly occur *after* training and deployment?
Yes. For example: lots of applications use online learning. A programmer flips the meaning of a boolean flag in a database somewhere while not updating all downstream callers, and suddenly an online learner is now actively pessimizing their target metric.
Do you think that this specific risk could be mitigated by some variant of Eliezer’s separation from hyperexistential risk or Stuart Armstrong’s idea here:
Or at least prevent sign flip errors from causing something worse than paperclipping?
Interesting. Terrifying, but interesting.
Forgive me for my stupidity (I’m not exactly an expert in machine learning), but it seems to me that building an AGI linked to some sort of database like that in such a fashion (that some random guy’s screw-up can effectively reverse the utility function completely) is a REALLY stupid idea. Would there not be a safer way of doing things?
If we actually built an AGI that optimised to maximise a loss function, wouldn’t we notice long before deploying the thing?
I’d imagine that this type of thing would be sanity-checked and tested intensively, so signflip-type errors would predominantly be scenarios where the error occurs *after* deployment, like the one Gwern mentioned (“A programmer flips the meaning of a boolean flag in a database somewhere while not updating all downstream callers, and suddenly an online learner is now actively pessimizing their target metric.”)
Even if you disclaim configuration errors or updates (despite this accounting for most of a system’s operating lifespan, and human/configuration errors accounting for a large fraction of all major errors at cloud providers etc according to postmortems), an error may still happen too fast to notice. Recall that in the preference learning case, the bug manifested after Christiano et al went to sleep, and they woke up to the maximally-NSFW AI. AlphaZero trained in ~2 hours wallclock, IIRC. Someone working on an even larger cluster commits a change and takes a quick bathroom break...
Wouldn’t any configuration errors or updates be caught with sanity-checking tools though? Maybe the way I’m visualising this is just too simplistic, but any developers capable of creating an *aligned* AGI are going to be *extremely* careful not to fuck up. Sure, it’s possible, but the most plausible cause of a hyperexistential catastrophe to me seems to be where a SignFlip-type error occurs once the system has been deployed.
Hopefully a system as crucially important as an AGI isn’t going to have just one guy watching it who “takes a quick bathroom break”. When the difference is literally Heaven and Hell (minimising human values), I’d consider only having one guy in a basement monitoring it to be gross negligence.
Many entities have sanity-checking tools. They fail. Many have careful developers. They fail. Many have automated tests. They fail. And so on. Disasters happen because all of those will fail to work every time and therefore all will fail some time. If any of that sounds improbable, as if there would have to be a veritable malevolent demon arranging to make every single safeguard fail or backfire (literally, sometimes, like the recent warehouse explosion—triggered by welders trying to safeguard it!), you should probably read more about complex systems and their failures to understand how normal it all is.
Sure, but the *specific* type of error I’m imagining would surely be easier to pick up than most other errors. I have no idea what sort of sanity checking was done with GPT-2, but the fact that the developers were asleep when it trained is telling: they weren’t being as careful as they could’ve been.
For this type of bug (a sign error in the utility function) to occur *before* the system is deployed and somehow persist, it’d have to make it past all sanity-checking tools (which I imagine would be used extensively with an AGI) *and* somehow not be noticed at all while the model trains *and* whatever else. Yes, these sort of conjunctions occur in the real world but the error is generally more subtle than “system does the complete opposite of what it was meant to do”.
I made a question post about this specific type of bug occurring before deployment a while ago and think my views have shifted significantly; it’s unlikely that a bug as obvious as one that flips the sign of the utility function won’t be noticed before deployment. Now I’m more worried about something like this happening *after* the system has been deployed.
I think a more robust solution to all of these sort of errors would be something like the separation from hyperexistential risk article that I linked in my previous response. I optimistically hope that we’re able to come up with a utility function that doesn’t do anything worse than death when minimised, just in case.
At least with current technologies, I expect serious risks to start occuring during training, not deployment. That’s ultimately when you will the greatest learning happening, when you have the greatest access to compute, and when you will first cross the threshold of intelligence that will make the system actually dangerous. So I don’t think that just checking things after they are trained is safe.
I’m under the impression that an AGI would be monitored *during* training as well. So you’d effectively need the system to turn “evil” (utility function flipped) during the training process, and the system to be smart enough to conceal that the error occurred. So it’d need to happen a fair bit into the training process. I guess that’s possible, but IDK how likely it’d be.
Yeah, I do think it’s likely that AGI would be monitored during training, but the specific instance of Open AI staff being asleep while we train the AI is a clear instance of us not monitoring the AI during the most crucial periods (which, to be clear, I think is fine since I think the risks were indeed quite low, and I don’t see this as providing super much evidence about Open AI’s future practices)
Given that compute is very expensive, economic pressures will push training to be 24⁄7, so it’s unlikely that people generally pause the training when going to sleep.
Sure, but I’d expect that a system as important as this would have people monitoring it 24⁄7.
Maybe the project will come up with some mechanism that detects that. But if they fall back to the naive “just watch what it does in the test environment and assume it’ll do the same in production,” then there is a risk it’s going to figure out it’s in a test environment, and that its judges would not react well to finding out what is wrong with its utility function, and then it will act aligned in the testing environment.
If we ever see a news headline saying “Good News, AGI seems to ‘self-align’ regardless of the sign of the utility function!” that will be some very bad news.
I asked Rohin Shah about that possibility in a question thread about a month ago. I think he’s probably right that this type of thing would only plausibly make it through the training process if the system’s *already* smart enough to be able to think about this type of thing. And then on top of that there are still things like sanity checks which, while unlikely to pick up numerous errors, would probably notice a sign error. See also this comment:
IMO it’s incredibly important that we find a way to prevent this type of thing from occurring *after* the system has been trained, whether that be hyperexistential separation or something else. I think that a team that’s safety-conscious enough to come up with a (reasonably) aligned AGI design is going to put a considerable amount of effort into fixing bugs & one as obvious as a sign error would be unlikely to make it through. And hopefully—even better, they would have come up with a utility function that can’t be easily reversed by a single bit flip or doesn’t cause outcomes worse than death when minimised. That’d (hopefully?) solve the SignFlip issue *regardless* of what causes it.