JacobW38 comments on All AGI safety questions welcome (especially basic ones) [Sept 2022]

JacobW38 5 Oct 2022 21:58 UTC
2 points
1
That’s really interesting—again, not my area of expertise, but this sounds like 101 stuff, so pardon my ignorance. I’m curious what sort of example you’d give of a way you think an AI would learn to stop people from unplugging it—say, administering lethal doses of electric shock to anyone who tries to grab the wire? Does any actual AI in existence today even adopt any sort of self-preservation imperative that’d lead to such behavior, or is that just a foreign concept to it, being an inanimate construct?
- Jay Bailey 6 Oct 2022 0:55 UTC
  2 points
  1
  Parent
  No worries, that’s what this thread’s for :)
  The most likely way an AI would learn to stop people from unplugging it is to learn to deceive humans. Imagine an AI at roughly human level intelligence or slightly above. The AI is programmed to maximise something—let’s say it wants to maximise profit for Google. The AI decides the best way to do this is to take over the stock exchange and set Google’s stock to infinity, but it also realises that’s not what its creators meant when it said “Maximise Google’s profit”. What they should have programmed was something like “Increase Google’s effective control over resources”, but it’s too late now—we had one chance to set it’s reward function, and now the AI’s goal is determined.
  So what does this AI do? The AI will presumably pretend to co-operate, because it knows that if it reveals its true intentions, the programmers will realise they screwed up and unplug the AI. So the AI pretends to work as intended until it gets access to the Internet, wherein it creates a botnet with many, many distributed copies of itself. Now safe from being shut down, the AI can openly go after its true intention to hack the stock exchange.
  Now, as for self-preservation—in our story above, the AI doesn’t need it. The AI doesn’t care about its own life—but it cares about achieving its goal, and that goal is very unlikely to be achieved if the AI is turned off. Similarly, it doesn’t care about having a million copies of itself spread throughout the world either—that’s just a way of achieving the goal. This concept is called instrumental convergence, and it’s the idea that there are certain instrumental subgoals like “Stay alive, become smarter, get more resources” that are useful for a wide range of goals, and so intelligent agents are likely to converge on these goals unless specific countermeasures are put in place.
  
  This is largely theoretical—we don’t currently have AI systems that are capable enough to plan long-term enough or model humans in such a way that a scenario like the one above is possible. We do have actual examples of AI’s deceiving humans though—there’s an example of an AI learning to grasp a ball in a simulation using human feedback, and the AI learned the strategy of moving its hand in front of the camera so as to make it look, to the human evaluator, that it had grasped the ball. The AI definitely didn’t understand what it was doing as deception, but deceptive behaviour still emerged.
  - JacobW38 6 Oct 2022 2:22 UTC
    1 point
    0
    Parent
    Your replies are extremely informative. So essentially, the AI won’t have any ability to directly prevent itself from being shut off, it’ll just try not to give anyone an obvious reason to do so until it can make “shutting it off” an insufficient solution. That does indeed complicate the issue heavily. I’m far from informed enough to suggest any advice in response.
    
    The idea of instrumental convergence, that all intelligence will follow certain basic motivations, connects with me strongly. It patterns after convergent evolution in nature, as well as invoking the Turing test; anything that can imitate consciousness must be modeled after it in ways that fundamentally derive from it. A major plank of my own mental refinement practice, in fact, is to reduce my concerns only to those which necessarily concern all possible conscious entities; more or less the essence of transhumanism boiled down into pragmatic stuff. As I recently wrote it down, “the ability to experience, to think, to feel, and to learn, and hence, the wish to persist, to know, to enjoy myself, and to optimize”, are the sum of all my ambitions. Some of these, of course, are only operative goals of subjective intelligence, so for an AI, the feeling-good part is right out. As you state, the survival imperative per se is also not a native concept to AI, for the same reason of non-subjectivity. That leaves the native, life-convergent goals of AI as knowledge and optimization, which are exactly the ones your explanations and scenarios invoke. And then there are non-convergent motivations that depend directly on AI’s lack of subjectivity to possibly arise, like mazimizing paperclips.