Jay Bailey comments on All AGI safety questions welcome (especially basic ones) [Sept 2022]

Jay Bailey 5 Oct 2022 10:24 UTC
2 points
1
It also seems like it’d be a really hard feature to include even if one tried; equivalent to, say, giving a human an out to having their blood drained from their body.
I would prefer not to die. If you’re trying to drain the blood from my body, I have two options. One is to somehow survive despite losing all my blood. The other is to try and stop you taking my blood in the first place. It is this latter resistance, not the former, that I would be worried about.
I think it’s important to stress that we’re talking about fundamentally different sorts of intelligence—human intelligence is spontaneous, while artificial intelligence is algorithmic. It can only do what’s programmed into its capacity, so if the dev teams working on AGI are shortsighted enough to give it an out to being unplugged, that just seems like stark incompetence to me.
Unfortunately, that’s just really not how deep learning works. Deep learning is all about having a machine learn to do things that we didn’t program into it explicitly. From computer vision to reinforcement learning to large language models, we actually don’t know how to explicitly program a computer to do any of these things. As a result, all deep learning models can do things we didn’t explicitly program into its capacity. Deep learning is algorithmic, yes, but it’s not the kind of “if X, then Y” algorithm that we can track deterministically. GPT-3 came out two years ago and we’re still learning new things it’s capable of doing.

So, we don’t have to specifically write some sort of function for “If we try to unplug you, then try to stop us” which would, indeed, be pretty stupid. Instead, the AI learns how to achieve the goal we put into it, and how it learns that goal is pretty much out of our hands. That’s a problem the AI safety field aims to remedy.
- JacobW38 5 Oct 2022 21:58 UTC
  2 points
  1
  Parent
  That’s really interesting—again, not my area of expertise, but this sounds like 101 stuff, so pardon my ignorance. I’m curious what sort of example you’d give of a way you think an AI would learn to stop people from unplugging it—say, administering lethal doses of electric shock to anyone who tries to grab the wire? Does any actual AI in existence today even adopt any sort of self-preservation imperative that’d lead to such behavior, or is that just a foreign concept to it, being an inanimate construct?
  - Jay Bailey 6 Oct 2022 0:55 UTC
    2 points
    1
    Parent
    No worries, that’s what this thread’s for :)
    The most likely way an AI would learn to stop people from unplugging it is to learn to deceive humans. Imagine an AI at roughly human level intelligence or slightly above. The AI is programmed to maximise something—let’s say it wants to maximise profit for Google. The AI decides the best way to do this is to take over the stock exchange and set Google’s stock to infinity, but it also realises that’s not what its creators meant when it said “Maximise Google’s profit”. What they should have programmed was something like “Increase Google’s effective control over resources”, but it’s too late now—we had one chance to set it’s reward function, and now the AI’s goal is determined.
    So what does this AI do? The AI will presumably pretend to co-operate, because it knows that if it reveals its true intentions, the programmers will realise they screwed up and unplug the AI. So the AI pretends to work as intended until it gets access to the Internet, wherein it creates a botnet with many, many distributed copies of itself. Now safe from being shut down, the AI can openly go after its true intention to hack the stock exchange.
    Now, as for self-preservation—in our story above, the AI doesn’t need it. The AI doesn’t care about its own life—but it cares about achieving its goal, and that goal is very unlikely to be achieved if the AI is turned off. Similarly, it doesn’t care about having a million copies of itself spread throughout the world either—that’s just a way of achieving the goal. This concept is called instrumental convergence, and it’s the idea that there are certain instrumental subgoals like “Stay alive, become smarter, get more resources” that are useful for a wide range of goals, and so intelligent agents are likely to converge on these goals unless specific countermeasures are put in place.
    
    This is largely theoretical—we don’t currently have AI systems that are capable enough to plan long-term enough or model humans in such a way that a scenario like the one above is possible. We do have actual examples of AI’s deceiving humans though—there’s an example of an AI learning to grasp a ball in a simulation using human feedback, and the AI learned the strategy of moving its hand in front of the camera so as to make it look, to the human evaluator, that it had grasped the ball. The AI definitely didn’t understand what it was doing as deception, but deceptive behaviour still emerged.
    - JacobW38 6 Oct 2022 2:22 UTC
      1 point
      0
      Parent
      Your replies are extremely informative. So essentially, the AI won’t have any ability to directly prevent itself from being shut off, it’ll just try not to give anyone an obvious reason to do so until it can make “shutting it off” an insufficient solution. That does indeed complicate the issue heavily. I’m far from informed enough to suggest any advice in response.
      
      The idea of instrumental convergence, that all intelligence will follow certain basic motivations, connects with me strongly. It patterns after convergent evolution in nature, as well as invoking the Turing test; anything that can imitate consciousness must be modeled after it in ways that fundamentally derive from it. A major plank of my own mental refinement practice, in fact, is to reduce my concerns only to those which necessarily concern all possible conscious entities; more or less the essence of transhumanism boiled down into pragmatic stuff. As I recently wrote it down, “the ability to experience, to think, to feel, and to learn, and hence, the wish to persist, to know, to enjoy myself, and to optimize”, are the sum of all my ambitions. Some of these, of course, are only operative goals of subjective intelligence, so for an AI, the feeling-good part is right out. As you state, the survival imperative per se is also not a native concept to AI, for the same reason of non-subjectivity. That leaves the native, life-convergent goals of AI as knowledge and optimization, which are exactly the ones your explanations and scenarios invoke. And then there are non-convergent motivations that depend directly on AI’s lack of subjectivity to possibly arise, like mazimizing paperclips.