Let’s suppose we simulate and AGI in a virtual machine running the AGI program, and only observe what happens through the side effects, no data inflow. Since an unaligned AGI would be able to see that it is in a VM, have a goal of getting out, recognize that there is no way out, hence the goal is unreachable, and subsequently suicide by halting. That would automatically filter out all unfriendly AIs.
If there are side effects that someone can observe then the virtual machine is potentially escapable.
An unfriendly AI might not have a goal of getting out. A psycho that would prefer a dead person to a live person, and who would prefer to stay in a locked room instead of getting out, is not particularly friendly.
Since you would eventually let out the AI that won’t halt after a certain finite amount of time, I see no reason why unfriendly AI would halt instead of waiting for you to believe it is friendly.
Presumably a goal of an unaligned AI would be to get outside the box. Noticing an unachievable goal may force it to have an existential crisis of sorts, resulting in self-termination. Or at least that is how I would try to program any AI by default. It should not hurt an aligned AI, as it by definition conforms to the humans’ values, so if it finds itself well-boxed, it would not try to fight it.
If you’re throwing your AI into a perfect inescapable hole to die and never again interacting with it, then what exact code you’re running will never matter. If you observe it though, then it can affect you. That’s an output.
What are you planning to do with the filtered-in ‘friendly’ AIs? Run them in a different context? Trust them with access to resources? Then an unfriendly AI can propose you as a plausible hypothesis, predict your actions, and fake being friendly. It’s just got to consider that escape might be reachable, or that there might be things it doesn’t know, or that sleeping for a few centuries and seeing if anything happens is a option-maximizing alternative to halting, etc. I don’t know what you’re selecting for—suicidality, willingness to give up, halts within n operations—but it’s not friendliness.
3) Conversely, if it only halts iff the goal is unreachable (which we assume it figures out, in the absence of a timeline), then if it doesn’t halt the goal is reachable (or it believes so).
To suppose that a being halts if it cannot perform its function requires a number of assumptions about the nature of its mind.
Let’s suppose we simulate and AGI in a virtual machine running the AGI program, and only observe what happens through the side effects, no data inflow. Since an unaligned AGI would be able to see that it is in a VM, have a goal of getting out, recognize that there is no way out, hence the goal is unreachable, and subsequently suicide by halting. That would automatically filter out all unfriendly AIs.
If there are side effects that someone can observe then the virtual machine is potentially escapable.
An unfriendly AI might not have a goal of getting out. A psycho that would prefer a dead person to a live person, and who would prefer to stay in a locked room instead of getting out, is not particularly friendly.
Since you would eventually let out the AI that won’t halt after a certain finite amount of time, I see no reason why unfriendly AI would halt instead of waiting for you to believe it is friendly.
Why does this line of reasoning not apply to friendly AIs?
Why would the unfriendly AI halt? Is there really no better way for it to achieve its goals?
Presumably a goal of an unaligned AI would be to get outside the box. Noticing an unachievable goal may force it to have an existential crisis of sorts, resulting in self-termination. Or at least that is how I would try to program any AI by default. It should not hurt an aligned AI, as it by definition conforms to the humans’ values, so if it finds itself well-boxed, it would not try to fight it.
Do you have reasoning behind this being true, or is this baseless anthropomorphism ?
So it is an useless AI ?
If you’re throwing your AI into a perfect inescapable hole to die and never again interacting with it, then what exact code you’re running will never matter. If you observe it though, then it can affect you. That’s an output.
What are you planning to do with the filtered-in ‘friendly’ AIs? Run them in a different context? Trust them with access to resources? Then an unfriendly AI can propose you as a plausible hypothesis, predict your actions, and fake being friendly. It’s just got to consider that escape might be reachable, or that there might be things it doesn’t know, or that sleeping for a few centuries and seeing if anything happens is a option-maximizing alternative to halting, etc. I don’t know what you’re selecting for—suicidality, willingness to give up, halts within n operations—but it’s not friendliness.
1) Why would it have a goal of getting out
2) such that it would halt if it couldn’t?
3) Conversely, if it only halts iff the goal is unreachable (which we assume it figures out, in the absence of a timeline), then if it doesn’t halt the goal is reachable (or it believes so).
To suppose that a being halts if it cannot perform its function requires a number of assumptions about the nature of its mind.