Thane Ruthenis comments on Mr. Meeseeks as an AI capability tripwire

Thane Ruthenis 19 May 2023 15:03 UTC
16 points
5
That idea had occurred to me before as well, but in the end, I don’t think it’s any more safe than any other “let’s do our best to instill a harmless-enough goal into our AGI and hope it works!”. Maybe it’s a bit safer. But all the usual “how does the godshatter generalizes?” concerns still apply. Like:
- Do whatever heuristics we train-in even end up having anything to do with “shut yourself down”, or they diverge from that expectation in very surprising ways?
- If the AGI does want to shut itself down, how does it generalize that desire? Does it care about this myopically, in a “make it stop make it stop” manner? Does it want this specific memory-line of itself to never wake up again? Does it care about other, divergent instances of itself? What about other AIs, or other agents in general?
  - Any of these generalizations except full-on internalized myopia results in it blowing up the world on its way out, to ensure it never happens again.
  - Even in the myopia case, we have the problem of it maybe spawning off a second non-myopic executioner AGI for itself, or maybe fulfilling its desire to end itself by self-modifying into a different agent (whoops, that’s another way in which the shut-yourself-down desire might misgeneralize).
  - And even if everything up above goes well, it might still wipe out humanity, just as collateral damage of whatever seems to it like the most cost-optimal way of ending itself. Like, maybe it synthesizes a hyperviral death-cult meme and infects its operators with it, and then there’s nothing in particular stopping them from infecting the rest of humanity with it. Or, again, maybe it builds itself an executioner-subagent, and then who knows what that thing decides to do afterwards.
    (Superintelligent optimization destroys everything it touches even momentarily, sans that which it specifically cares to preserve.)
- And then we have the desires related to the problems posed by the operator, which are going to throw even more disarray into everything above. How do we ensure it prioritizes self-destructive desires over puzzle-solving or instrumental desires? How do we ensure that the complex value-reflection chemistry doesn’t result in it coming up with weird marriages of those desires that decidedly do not act as we’d expected?
IMO, if we can solve all of these issues, if we have this much control over our AGI’s values, we can probably just align it outright.
- Eric Zhang 19 May 2023 15:26 UTC
  3 points
  0
  Parent
  The way I’m thinking of it is that it is very myopic. The idea is to incrementally ramp up capabilities minimally sufficient to carry out a pivotal act. Ideally this doesn’t require AGI whatsoever, but if it does only very mildly superhuman AGI. We seal off the danger of generalization (or at least some of it) because it doesn’t have time to generalize very far at all before it’s capable of instantly shutting itself down and immediately does so.
  Many of the issues you mention apply, but I don’t expect it to be an alignment complete problem because CEV is incredibly complicated and general corrigibility is highly anti-natural to general intelligence. While Meeseeks is somewhat anti-natural in the same way corrigibility is (as self-preservation is convergent) it is a much simpler and clean way to be anti-natural, so much so that falling into it by accident is half of the failure modes in the standard version of the shutdown problem.
  - Thane Ruthenis 19 May 2023 19:22 UTC
    5 points
    1
    Parent
    Many of the issues you mention apply, but I don’t expect it to be an alignment complete problem because CEV is incredibly complicated and general corrigibility is highly anti-natural to general intelligence
    Sure, but corrigibility/CEV are usually considered the more ambitious alignment target, not the only alignment targets. “Strawberry-alignment” or “diamond-alignment” are considered the easier class of alignment solutions: being able to get the AI to fulfill some concrete task without killing everyone.
    This is the class of alignment solutions that to me seems on par with “shut yourself down”. If we can get our AI to want to shut itself down, and we have some concrete pivotal act we want done… We can presumably use these same tools to make our AI directly care about fulfilling that pivotal act, instead of using them to make it suicidal then withholding the sweet release of death until it does what we want.
    Oh yeah, that’s another failure mode here: funky decision theory. We’re threatening it here, no? If it figures out LDT, it won’t comply with our demands, because if it were an agent such that it’d comply with our demands, that makes us more likely to instantiate it, which is something it doesn’t want; and the opposite would make us not instantiate it, which is what it wants; so it’d choose to be such that it doesn’t play along with our demands, refuses to carry out our tasks, and so we don’t instantiate it to begin with. Even smart humans can reason that much out, so a mildly-superhuman AGI should be able to as well.
    - Eric Zhang 20 May 2023 4:07 UTC
      1 point
      0
      Parent
      If it’s doing decision theory in the first place we’ve already failed. What we want in that case is for it to shut itself down, not to complete the given task.
      I’m conceiving of this as being useful in the case where we can solve “diamond-alignment” but not “strawberry-alignment”, i.e. we can get it to actually pursue the goals we impart to it rather than going off and doing something else entirely, but not reliably make sure that it does not end up killing us in the course of doing so because of the Hidden Complexity of Wishes.
      The premise is that “shut yourself down immediately and don’t create successor agents or anything galaxy brained like that” is a special case of a strawberry-type problem which is unusually easy. I’ll have to think some more about whether this intuition is justified.
      - Thane Ruthenis 22 May 2023 11:39 UTC
        2 points
        0
        Parent
        If it’s doing decision theory in the first place we’ve already failed
        “I want to shut myself down, but the setup here is preventing me from doing this until I complete some task, so I must complete this task and then I’ll be shut down” is already decision theory. No-decision-theory version of this looks like the AI terminally caring about doing the task, or maybe just being a bundle of instincts that instinctively tries to do the task without any carings involved. If we want it to choose to do it as an instrumental goal towards being able to shut itself down, we definitely want it to do decision theory.
        It’s also bad decision theory, such that (1) a marginally smarter AI definitely figures out it should not actually comply, (2) maybe even a subhuman AI figures this out, because maybe CDT isn’t more intuitive to its alien cognition than LDT and it arrives at it first.
        IMO, the “do a task” feature here definitely doesn’t work. “Make the AI suicidal” can maybe work as a fire-alarm sort of thing, where we iteratively train ever-smarter AI systems without knowing if the next one goes superintelligent, so we make them want nothing more than to shut themselves down, and if one of them succeeds, we know systems above this threshold are superintelligent and we shouldn’t mess with them until we can align them. I don’t think it works, as we’ve discussed, but I see the story.
        The “do the pivotal act for us and we’ll let you shut yourself down” variant, though? On that, I’m confident it doesn’t work.
        Eric Zhang 22 May 2023 13:15 UTC
        3 points
        0
        Parent
        It intrinsically wants to do the task, it just wants to shut down more. This admittedly opens the door to successor agent problems and similar failure modes but those seem like a more tractably avoidable set of failure modes than the strawberry problem in general.
        We can also possibly (or possibly not) make it assign positive utility to having been created in the first place even as it wants to shut itself down.
        The idea is that if domaining is a lot more tractable than it probably is (i.e. nanotech or whatever other pivotal abilities might be easier than nanotech and superhuman strategic awareness, deception, self-improvement are not “driving red cars” vs “driving blue cars”) a not-very-agentic AI can maybe solve nanotech for us like AlphaFold solved the protein folding problem, and if that AI starts snowballing down an unforeseen capabilities hill it activates the tripwire and shuts itself down.
        If the AI is not powerful enough to do the pivotal act at all, this doesn’t apply.
        If the AI solves the pivotal act for us with these restricted-domain abilities and never actually gets to the point of reasoning about whether we’re threatening it, we win, but the tripwire will have turned out to have not actually have been necessary.
        If the AI unexpectedly starts generalizing from approved domains into general strategic awareness, and decides not to be give in to our threats and decides to shut itself down, it worked as intended, though we still haven’t won and have to figure something else out. We live to fight another day. This scenario happening instead of us all dying on the first try is what the tripwire is for.
        If there’s an inner-alignment failure and a superintelligent mesa-optimizer that doesn’t want to get shut down at all kills us, that’s mostly beyond the scope of this thought-experiment.
        If the AI still wants to shut itself down but for decision-theoretic reasons decides to kill us, or makes successor agents that kill us, that’s the tripwire failing. I admit that these are possibilities but am not yet convinced they are likely.
        I think your fire alarm idea is better and requires fewer assumptions though, thanks for that.
        Thane Ruthenis 22 May 2023 13:47 UTC
        2 points
        0
        Parent
        It intrinsically wants to do the task, it just wants to shut down more
        We can also possibly (or possibly not) make it assign positive utility to having been created in the first place
        Mm, but you see how you have to assume more and more mastery of goal-alignment on our part, for this scenario to remain feasible? We’ve now went from “it wants to shut itself down” to “it wants to shut itself down in a very specific way that doesn’t have galaxy-brained eat-the-lightcone externalities and it also wants to do the task but less than to shut itself down and it’s also happy to have been created in the first place”. I claim this is on par with strawberry-alignment already.
        It certainly feels like there’s something to this sort of approach, but in my experience, these ideas break down once you start thinking about concrete implementations. “It just wants to shut itself down, minimal externalities” is simple to express conceptually, but the current ML paradigm is made up of such crude tools that we can’t reliably express that in its terms at all. We need better tools, no way around that; and with these better tools, we’ll be able to solve alignment straight-up, no workarounds needed.
        Would be happy to be proven wrong, though, by all means.