Shmi comments on Open Thread August 2018

Shmi 3 Aug 2018 1:44 UTC
−4 points
Let’s suppose we simulate and AGI in a virtual machine running the AGI program, and only observe what happens through the side effects, no data inflow. Since an unaligned AGI would be able to see that it is in a VM, have a goal of getting out, recognize that there is no way out, hence the goal is unreachable, and subsequently suicide by halting. That would automatically filter out all unfriendly AIs.
- servy 3 Aug 2018 14:09 UTC
  6 points
  Parent
  If there are side effects that someone can observe then the virtual machine is potentially escapable.
  An unfriendly AI might not have a goal of getting out. A psycho that would prefer a dead person to a live person, and who would prefer to stay in a locked room instead of getting out, is not particularly friendly.
  Since you would eventually let out the AI that won’t halt after a certain finite amount of time, I see no reason why unfriendly AI would halt instead of waiting for you to believe it is friendly.
- Tetraspace 7 Aug 2018 13:05 UTC
  3 points
  Parent
  Why does this line of reasoning not apply to friendly AIs?
  Why would the unfriendly AI halt? Is there really no better way for it to achieve its goals?
  - Shmi 12 Aug 2018 2:21 UTC
    −1 points
    Parent
    Presumably a goal of an unaligned AI would be to get outside the box. Noticing an unachievable goal may force it to have an existential crisis of sorts, resulting in self-termination. Or at least that is how I would try to program any AI by default. It should not hurt an aligned AI, as it by definition conforms to the humans’ values, so if it finds itself well-boxed, it would not try to fight it.
    - Paperclip Minimizer 5 Sep 2018 13:51 UTC
      3 points
      Parent
      
      Noticing an unachievable goal may force it to have an existential crisis of sorts, resulting in self-termination.
      
      Do you have reasoning behind this being true, or is this baseless anthropomorphism ?
      
      It should not hurt an aligned AI, as it by definition conforms to the humans’ values, so if it finds itself well-boxed, it would not try to fight it.
      
      So it is an useless AI ?
- hazel 5 Aug 2018 3:50 UTC
  3 points
  Parent
  If you’re throwing your AI into a perfect inescapable hole to die and never again interacting with it, then what exact code you’re running will never matter. If you observe it though, then it can affect you. That’s an output.
  What are you planning to do with the filtered-in ‘friendly’ AIs? Run them in a different context? Trust them with access to resources? Then an unfriendly AI can propose you as a plausible hypothesis, predict your actions, and fake being friendly. It’s just got to consider that escape might be reachable, or that there might be things it doesn’t know, or that sleeping for a few centuries and seeing if anything happens is a option-maximizing alternative to halting, etc. I don’t know what you’re selecting for—suicidality, willingness to give up, halts within n operations—but it’s not friendliness.
- Pattern 7 Aug 2018 0:05 UTC
  1 point
  Parent
  1) Why would it have a goal of getting out
  2) such that it would halt if it couldn’t?
  3) Conversely, if it only halts iff the goal is unreachable (which we assume it figures out, in the absence of a timeline), then if it doesn’t halt the goal is reachable (or it believes so).
  To suppose that a being halts if it cannot perform its function requires a number of assumptions about the nature of its mind.