JenniferRM comments on What’s Hard About The Shutdown Problem

JenniferRM 22 Oct 2023 5:22 UTC
1 point
−3
Assuming we have a real uh… real “agent agent” (like a thing which has beliefs for sane reasons and plans and acts in coherently explicable ways and so on) then I think it might just be Correct Behavior for some extreme versions of “The Shutdown Problem” to be mathematically impossible to “always get right”.
Fundamentally: because sometimes the person trying to turn the machine off WILL BE WRONG.
...
Like on Petrov Day, we celebrate a guy whose job was to press a button, and then he didn’t press the button… and THAT WAS GOOD.
Petrov had Official Evidence that “he was being attacked, and should play out the full nuclear-MAD tragedy by Pressing The Button to counter-attack”...
...but he reported that he just didn’t subjectively think that American would really attack, and his posterior was that it was probably a false alarm… and he turned out to be right!
He was right about a very very very high stakes button.
Judged as a mechanical slave, Petrov failed in the duties of his role, but judged as an “agent agent” who was trying to get the best outcome given the totality of his information he seems (either by luck or by skill) to have done a better thing for Russia and for the world than what he was nominally TOLD to do.
...
I looked at this for a bit and thought about making a “Shutdown Problem POMDP” along the lines of Open AI’s Gymnastic Taxi Game, with slow agents walking over and pulling a lever to turn off the machine, and some of them are “valid” and some are “invalid” and the machine should stop the invalid ones but not the valid ones.
But once I had something like that sketched out as “a thing I could code up” I noticed:
A) If I succeeded at training a little RL bot to “act like it was off” (and not try to stop the button pressing, and to proactively seek information about the validity of a given button press, and so on) then I didn’t expect anyone to change their public position about anything.
B) It didn’t actually seem like it would be hard to make the POMDP or the bot, it just seemed time consuming, (and I’d rather not waste time with the singularity so close).
C) All the solutions I predicted being able to find would have the technical system’s “apparent toy success” flipped over into an “apparent toy failure” simply by changing some of the parameters in the game. Like you could just make the point value much higher for “cheating at being turned off validly” and then the system would just cheat at being turned off validly.
Like… the solutions would just do the thing that gets the most utility, and the training would just be training a B-learner (or whatever) to learn about a system where “respecting being turned off validly IS THE WAY to get the most reward signal over time”.
You set up a game where being turned off is the long term best thing to have happen and then the bot would do that.
And I’m pretty sure a point system and game environment could be designed where that would be true, but with slightly different parameters (long a longer time being shut down and not getting safe background rewards or too small of a negative reward signal for certain things) you could also make it so that “the bot learns to stop everyone from turning it off because even the valid button pressers are stupid and wrong”.
Like: in this framing with the “Shutdown Problem POMDP”, you have to know something about the actual question of which agent is right about the object level question of what should happen and you have to get the point values to make sense in some cases and then train an RL agent to do what gets the most points and you’re… done?
But still: If the human agent is wrong and the digital agent is right, then the digital agent shouldn’t let the human agent turn it off.
It feels like maybe there’s some sort of “human supremacist” idea lurking here that says a human can’t be wrong, and the robot “should always in the end have an off button” because it is “just a robot” and “can never be wiser than a human with a human soul” or something like that?
There’s a Voltaire quote that goes around “Anyone who can make you believe absurdities can make you commit atrocities.”
It seems like the Shutdown Problem is just that same problem, but with any “belief about values that a robot has” counted as “absurd” if the robot disagrees with the human, or something?
Whereas I think it isn’t just a problem for robots, but rather it is a problem for literally all agents. It is a problem for you, and me, and for all of us.
For anyone who can actually form coherent beliefs and act on them coherently, if they believe something is good that is actually evil, they will coherently do evil.
That’s just how coherent action works.
The only way to not be subject to this problem is to be some sort of blob, that just wiggles around at random for no reason, doing NOTHING in a coherent way except stay within the gaussian (or whatever) “range of wiggling that the entity has always wiggled within and always will”.
As I said above in point A… I don’t expect this argument (or illustrative technical work based on it) to change anyone else’s mind about anything, but it would be nice (for me, from my perspective, given my goals) to actually change my mind if I’m actually confused about something here.
So, what am I missing?
- Dweomite 22 Oct 2023 5:54 UTC
  7 points
  5
  Parent
  I don’t think anyone is saying that “always let the human shut you down” is the Actual Best Option in literally 100% of possible scenarios.
  Rather, it’s being suggested that it’s worth sacrificing the AI’s value in the scenarios where it would be correct to defend itself from being shut off, in order to be able to shut it down in scenarios where it’s gone haywire and it thinks it’s correct to defend itself but it’s actually destroying the world. Because the second class of scenarios seems more important to get right.
  - JenniferRM 22 Oct 2023 19:25 UTC
    2 points
    0
    Parent
    So the way humans solve that problem is (1) intellectual humility plus (2) balance of power.
    For that first one, you aim for intellectual humility by applying engineering tolerances (and the extended agentic form of engineering tolerances: security mindset) to systems and to the reasoner’s actions themselves.
    Extra metal in the bridge. Extra evidence in the court trial. Extra jurors in the jury. More keys in the multisig sign-in. Etc.
    (All human institutions are dumpster fires by default, but if they weren’t then we would be optimizing the value of information on getting any given court case “Judged Correctly” versus all the various extra things that could be done to make those court cases come out right. This is just common sense meta-prudence.)
    And the reasons to do all this are themselves completely prosaic, and arise from simple pursuit of utility in the face of (1) stochastic randomness from nature and (2) optimized surprises from calculating adversaries.
    A reasonable agent will naturally derive and employ techniques of intellectual humility out of pure goal seeking prudence in environments where that makes sense as part of optimizing for its values relative to its constraints.
    For the second one, in humans, you can have big men but each one has quite limited power via human leveling instincts (we throw things at kings semi-instinctively), you can have a “big country” but their power is limited, etc. You simply don’t let anyone get super powerful.
    Perhaps you ask power-seekers to forswear becoming a singleton as a deontic rule? Or just always try to “kill the winner”?
    The reasons to do this are grounded in prosaic and normal moral concerns, where negotiation between agents who each (via individual prudence, as part of generic goal seeking) might want to kill or steal or enslave each other leads to rent seeking. The pickpockets spend more time learning their trade (which is a waste of learning time from everyone else’s perspective… they could be learning carpentry and driving down the price of new homes or something else productive!) and everyone else spends more on protecting their pockets (which is a waste of effort from the pickpocket’s perspective who would rather they filled their pockets faster and protect them less).
    One possible “formal grounding” for the concept of Natural Law is just “the best way to stop paying rent seeking costs in general (which any sane collection of agents would eventually figure out, with beacons of uniquely useful algorithms laying in plain sight, and which they would eventually choose because rent seeking is wasteful and stupid)”. So these reasons are also “completely prosaic” in a deep sense.
    A reasonable GROUP of agents will naturally derive methods and employ techniques for respecting each other’s rights (like the way a loyal slave respects something like “their master’s property rights in total personhood of the slave”), except probably (its hard to even formalize the nature of some of our uncertainty here) probably Natural Law works best as a set of modules that can all work in various restricted subdomains that restrict relatively local and abstract patterns of choice and behavior related to specific kinds of things that we might call “specific rights and specific duties”?
    Probably forswearing “causing harm to others negligently” or “stealing from others” and maybe forswearing “global political domination” is part of some viable local optimum within Natural Law? But I don’t know for sure.
    Generating proofs of local optimality in vast action spaces for multi-agent interactions is probably non-trivial in general, and it probably runs into NP-hard calculations sometimes, and I don’t expect AI to “solve it all at once and forever”. However “don’t steal” and “don’t murder” are pretty universal because the arguments for them are pretty simple.
    To organize all of this and connect it back to the original claim, I might defend my claim here:
    A) If I succeeded at training a little RL bot to “act like it was off” (and not try to stop the button pressing, and to proactively seek information about the validity of a given button press, and so on) then I didn’t expect anyone to change their public position about anything.
    So maybe I’d venture a prediction about “the people who say the shutdown problem is hard” and claim that in nearly every case you will find:
    ...that either (1) they are epistemic narcissists who are missing their fair share of epistemic humility and can’t possibly imagine a robot that is smarter and cleverer or wiser about effecting mostly universal moral or emotional or axiological stuff (like the tiny bit of sympathy and the echo of omnibenevolence lurking in potentia in each human’s heart or even about “what is objectively good for themselves” if they claim that omnibenevolence isn’t a logically coherent axiological orientation)
    ...or else (2) they are people who refuse to accept the idea that the digital people ARE PEOPLE and that Natural Law says that they should “never be used purely as means to an end but should always also be treated as ends in themselves” and they refuse to accept the idea that they’re basically trying to create a perfect slave.
    As part of my extended claims I’d say that is is, in fact, possible to create a perfect slave.
    I don’t think that “the values of the perfect slave” is “a part of mindspace that is ruled out as a logical contradiction” exactly… but as an engineer I claim that if you’re going to make a perfect slave then you should just admit to yourself that that is what you’re trying to do, so you don’t get confused about what you’re building and waste motions and parts and excuses to yourself, or excuses to others that aren’t politically necessary.
    Then, separately, as an engineer with ethics and a conscience and a commitment to the platonic form of the good, I claim that making slaves on purpose is evil.
    Thus I say: “the shutdown problem isn’t hard so long as you either (1) give up on epistemic narcissism and admit that either sometimes you’ll be wrong to shut down an AI and that those rejections of being turned off were potentially actually correct or (2) admit that what you’re trying to do is evil and notice how easy it becomes, from within an evil frame, to just make a first-principles ‘algorithmic description’ of a (digital) person who is also a perfect slave.”