If you’re throwing your AI into a perfect inescapable hole to die and never again interacting with it, then what exact code you’re running will never matter. If you observe it though, then it can affect you. That’s an output.
What are you planning to do with the filtered-in ‘friendly’ AIs? Run them in a different context? Trust them with access to resources? Then an unfriendly AI can propose you as a plausible hypothesis, predict your actions, and fake being friendly. It’s just got to consider that escape might be reachable, or that there might be things it doesn’t know, or that sleeping for a few centuries and seeing if anything happens is a option-maximizing alternative to halting, etc. I don’t know what you’re selecting for—suicidality, willingness to give up, halts within n operations—but it’s not friendliness.
If you’re throwing your AI into a perfect inescapable hole to die and never again interacting with it, then what exact code you’re running will never matter. If you observe it though, then it can affect you. That’s an output.
What are you planning to do with the filtered-in ‘friendly’ AIs? Run them in a different context? Trust them with access to resources? Then an unfriendly AI can propose you as a plausible hypothesis, predict your actions, and fake being friendly. It’s just got to consider that escape might be reachable, or that there might be things it doesn’t know, or that sleeping for a few centuries and seeing if anything happens is a option-maximizing alternative to halting, etc. I don’t know what you’re selecting for—suicidality, willingness to give up, halts within n operations—but it’s not friendliness.