I think the idea is that the 4th scenario is the case, and you can’t discern whether you’re the real you or the simulated version, as the simulation is (near-) perfect. In that scenario, you should act in the same way that you’d want the simulated version to. Either (1) you’re a simulation and the real you just won $1,000,000; or (2) you’re the real you and the simulated version of you thought the same way that you did and one-boxed (meaning that you get $1,000,000 if you one-box.)
Anirandis
If Trump loses the election, he’s not the president anymore and the federal bureaucracy and military will stop listening to him.
He’d still be president until Biden’s inauguration though. I think most of the concern is that there’d be ~3 months of a president Trump with nothing to lose.
If anyone happens to be willing to privately discuss some potentially infohazardous stuff that’s been on my mind (and not in a good way) involving acausal trade, I’d appreciate it—PM me. It’d be nice if I can figure out whether I’m going batshit.
it’s much harder to know if you’ve got it pointed in the right direction or not
Perhaps, but the type of thing I’m describing in the post is more preventing worse-than-death outcomes even if the sign is flipped (by designing a reward function/model in such a way that it’s not going to torture everyone if that’s the case.)
This seems easier than recognising whether the sign is flipped or just designing a system that can’t experience these sign-flip type errors; I’m just unsure whether this is something that we have robust solutions for. If it turns out that someone’s figured out a reliable solution to this problem, then the only real concern is whether the AI’s developers would bother to implement it. I’d much rather risk the system going wrong and paperclipping than going wrong and turning “I have no mouth, and I must scream” into a reality.
My anxieties over this stuff tend not to be so bad late at night, TBH.
Seems a little bit beyond me at 4:45am—I’ll probably take a look tomorrow when I’m less sleep deprived (although still can’t guarantee I’ll be able to make it through then; there’s quite a bit of technical language in there that makes my head spin.) Are you able to provide a brief tl;dr, and have you thought much about “sign flip in reward function” or “direction of updates to reward model flipped”-type errors specifically? It seems like these particularly nasty bugs could plausibly be mitigated more easily than avoiding false positives (as you defined them in the arxiv’s paper’s abstract) in general.
Would you not agree that (assuming there’s an easy way of doing it), separating the system from hyperexistential risk is a good thing for psychological reasons? Even if you think it’s extremely unlikely, I’m not at all comfortable with the thought that our seed AI could screw up & design a successor that implements the opposite of our values; and I suspect there are at least some others who share that anxiety.
For the record, I think that this is also a risk worth worrying about for non-psychological reasons.
You seem to have a somewhat general argument against any solution that involves adding onto the utility function in “What if that added solution was bugged instead?”.
I might’ve failed to make my argument clear: if we designed the utility function as U = V + W (where W is the thing being added on and V refers to human values), this would only stop the sign flipping error if it was U that got flipped. If it were instead V that got flipped (so the AI optimises for U = -V + W), that’d be problematic.
I think it’s better to move on from trying to directly target the sign-flip problem and instead deal with bugs/accidents in general.
I disagree here. Obviously we’d want to mitigate both, but a robust way of preventing sign-flipping type errors specifically is absolutely crucial (if anything, so people stop worrying about it.) It’s much easier to prevent one specific bug from having an effect than trying to deal with all bugs in general.
I see. I’m somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.
I don’t think that’s an example of (3), more like (1) or (2), or actually “none of the above because GPT-2 doesn’t have this kind of architecture”.
I just raised GPT-2 to indicate that flipping the goal sign suddenly can lead to optimising for bad behavior without the AI neglecting to consider new strategies. Presumably that’d suggest it’s also a possibility with cosmic ray/other errors.
I hadn’t really considered the possibility of a brain-inspired/neuromorphic AI, thanks for the points.
(2) seems interesting; as I understand it, you’re basically suggesting that the error would occur gradually & the system would work to prevent it. Although maybe the AI realises it’s getting positive feedback for bad things and keeps doing them, or something (I don’t really know, I’m also a little sleep deprived and things like this tend to do my head in.) Like, if I hated beer then suddenly started liking it, I’d probably continue drinking it. Maybe the reward signals are simply so strong that the AI can’t resist turning into a “monster”, or whatever. Perhaps the system would implement checksums of some sort to do this automatically?
A similar point to (3) was raised by Dach in another thread, although I’m uncertain about this since GPT-2 was willing to explore new strategies when it got hit by a sign-flipping bug. I don’t doubt that it would be different with a neuromorphic system, though.
Mainly for brevity, but also because it seems to involve quite a drastic change in how the reward function/model as a whole functions. So it doesn’t seem particularly likely that it’ll be implemented.
True, but note that he elaborates and comes up with a patch to the patch (that being have W refer to a class of events that would be expected to happen in the Universe’s expected lifespan rather than one that won’t.) So he still seems to support the basic idea, although he probably intended just to get the ball rolling with the concept rather than conclusively solve the problem.
How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?
Perhaps malware could be another risk factor in the type of bug I described here? Not sure.
I’m still a little dubious of Eliezer’s solution to the problem of separation from hyperexistential risk; if we had U = V + W where V is a reward function & W is some arbitrary thing it wants to minimise (e.g. paperclips), a sign flip in V (due to any of a broad disjunction of causes) would still cause hyperexistential catastrophe.
Or what about the case where instead of maximising -U, the values that the reward function/model gives for each “thing” is multiplied by −1. E.g. AI system gets 1 point for wireheading and −1 for torture, some weird malware/human screw-up (in the reward model or some relevant database), etc. flips the signs for each individual action. AI now maximises U = W—V.
This seems a lot more nuanced than *just* avoiding cosmic rays; and the potential consequences of a hellish “I have no mouth, and I must scream”-type are far worse than human extinction. I’m not happy with *any* non-negligible probability of this happening.
I see what you’re saying here, but the GPT-2 incident seems to downplay it somewhat IMO. I’ll wait until you’re able to write down your thoughts on this at length; this is something that I’d like to see elaborated on (as well as everything else regarding hyperexistential risk.)
Paperclipping seems to be negative utility, not approximately 0 utility.
My thinking was that an AI system that *only* takes values between 0 and + ∞ (or some arbitrary positive number) would identify that killing humans would result in 0 human value, which is its minimum utility.
I read Eliezer’s idea, and that strategy seems to be… dangerous. I think that “Giving an AGI a utility function which includes features which are not really relevant to human values” is something we want to avoid unless we absolutely need to.
How come? It doesn’t seem *too* hard to create an AI that only expends a small amount of its energy on preventing the garbage thing from happening.
I have much more to say on this topic and about the rest of your comment, but it’s definitely too much for a comment chain. I’ll make an actual post containing my thoughts sometime in the next week or two, and link it to you.
Please do! I’d love to see a longer discussion on this type of thing.
EDIT: just thought some more about this and want to clear something up:
Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I’m highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.
I’m a little unsure on this one after further reflection. When this happened with GPT-2, the bug managed to flip the reward & the system still pursued instrumental goals like exploring new strategies:
Bugs can optimize for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. A mechanism such as Toyota’s Andon cord could have prevented this, by allowing any labeler to stop a problematic training process.
So it definitely seems *plausible* for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.
As an almost entirely inapplicable analogy . . . it’s just doing something weird.
If we inverted the utility function . . . tiling the universe with smiley faces, i.e. paperclipping.
Interesting analogy. I can see what you’re saying, and I guess it depends on what specifically gets flipped. I’m unsure about the second example; something like exploring new strategies doesn’t seem like something an AGI would terminally value. It’s instrumental to optimising the reward function/model, but I can’t see it getting flipped *with* the reward function/model.
Can you clarify what you mean by this? Also, I get what you’re going for, but paperclips is still extremely negative utility because it involves the destruction of humanity and the reconfiguration of the universe into garbage.
My thinking was that a signflipped AGI designed as a positive utilitarian (i.e. with a minimum at 0 human utility) would prefer paperclipping to torture because the former provides 0 human utility (as there aren’t any humans), whereas the latter may produce a negligible amount. I’m not really sure if it makes sense tbh.
The reward modelling system would need to be very carefully engineered, definitely.
Even if we engineered it carefully, that doesn’t rule out screw-ups. We need robust failsafe measures *just in case*, imo.
I thought of this as well when I read the post. I’m sure there’s something clever you can do to avoid this but we also need to make sure that these sorts of critical components are not vulnerable to memory corruption. I may try to find a better strategy for this later, but for now I need to go do other things.
I wonder if you could feasibly make it a part of the reward model. Perhaps you could train the reward model itself to disvalue something arbitrary (like paperclips) even more than torture, which would hopefully mitigate it. You’d still need to balance it in a way such that the system won’t spend all of its resources preventing this thing from happening at the neglect of actual human values, but that doesn’t seem too difficult. Although, once again, we can’t really have high confidence (>90%) that the AGI developers are going to think to implement something like this.
There was also an interesting idea I found in a Facebook post about this type of thing that got linked somewhere (can’t remember where). Stuart Armstrong suggested that a utility function could be designed as such:
Let B1 and B2 be excellent, bestest outcomes. Define U(B1)=1, U(B2)=-1, and U=0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes. Or, more usefully, let X be some trivial feature that the agent can easily set to −1 or 1, and let U be a utility function with values in [0,1]. Have the AI maximisise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.
Even if we solve any issues with these (and actually bother to implement them), there’s still the risk of an error like this happening in a localised part of the reward function such that *only* the part specifying something bad gets flipped, although I’m a little confused about this one. It could very well be the case that the system’s complex enough that there isn’t just one bit indicating whether “pain” or “suffering” is good or bad. And we’d presumably (hopefully) have checksums and whatever else thrown in. Maybe this could be mitigated by assigning more positive utility to good outcomes than negative utility to bad outcomes? (I’m probably speaking out of my rear end on this one.)
Memory corruption seems to be another issue. Perhaps if we have more than one measure we’d be less vulnerable to memory corruption. Like, if we designed an AGI with a reward model that disvalues two arbitrary things rather than just one, and memory corruption screwed with *both* measures, then something probably just went *very* wrong in the AGI and it probably won’t be able to optimise for suffering anyway.
Thanks for the detailed response. A bit of nitpicking (from someone who doesn’t really know what they’re talking about):
However, the vast majority of these mistakes would probably buff out or result in paper-clipping.
I’m slightly confused by this one. If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be *no* human utility (i.e. paperclips). But most attempts at an aligned AI would have a minimum at “I have no mouth, and I must scream”. So any sign-flipping error would be expected to land there.
If humans are making changes to the critical software/hardware of an AGI (And we’ll assume you figured out how to let the AGI allow you to do this in a way that has no negative side effects), *while that AGI is already running*, something bizarre and beyond my abilities of prediction is already happening.
In the example, the AGI was using online machine learning, which, as I understand it, would probably require the system to be hooked up to a database that humans have access to in order for it to learn properly. And I’m unsure as to how easy it’d be for things like checksums to pick up an issue like this (a boolean flag getting flipped) in a database.
Perhaps there’ll be a reward function/model intentionally designed to disvalue some arbitrary “surrogate” thing in an attempt to separate it from hyperexistential risk. So “pessimizing the target metric” would look more like paperclipping than torture. But I’m unsure as to (1) whether the AGI’s developers would actually bother to implement it, and (2) whether it’d actually work in this sort of scenario.
Also worth noting is that an AGI based on reward modelling is going to have to be linked to another neural network, which is going to have constant input from humans. If that reward model isn’t designed to be separated in design space from AM, someone could screw up with the model somehow. If we were to, say, have U = V + W (where V is the reward given by the reward model and W is some arbitrary thing that the AGI disvalues, as is the case in Eliezer’s Arbital post that I linked,) a sign flip-type error in V (rather than a sign flip in U) would lead to a hyperexistential catastrophe.
It will not be possible to flip the sign of the utility function or the direction of the updates to the reward model, even if several of the researchers on the project are actively trying to sabotage the effort and cause a hyperexistential disaster.
I think this is somewhat likely to be the case, but I’m not sure that I’m confident enough about it. Flipping the direction of updates to the reward model seems harder to prevent than a bit flip in a utility function, which could be prevent through error-correcting code memory (as you mentioned earlier.)
Despite my confusions, your response has definitely decreased my credence in this sort of thing from happening.
I’ve seen that post & discussed it on my shortform. I’m not really sure how effective something like Eliezer’s idea of “surrogate” goals there would actually be—sure, it’d help with some sign flip errors but it seems like it’d fail on others (e.g. if U = V + W, a sign error could occur in V instead of U, in which case that idea might not work.) I’m also unsure as to whether the probability is truly “very tiny” as Eliezer describes it. Human errors seem much more worrying than cosmic rays.
I’m unsure that “extreme” would necessarily get a more robust response, considering that there comes a point where the pain becomes disabling.
It seems as though there might be some sort of biological “limit” insofar as there are limited peripheral nerves, the grey matter can only process so much information, etc., and there’d be a point where the brain is 100% focused on avoiding the pain (meaning there’d be no evolutionary advantage to having the capacity to process additional pain). I’m not really sure where this limit would be, though. And I don’t really know any biology so I’m plausibly completely wrong.