I think this problem should be attacked directly. Find ways to identify humans that are more and less vulnerable. Find circumstances that will make them less vulnerable. Find technical means to enforce a two-man (or n-man) rule on AGI gatekeepers.
One thing that might help, but which hasn’t (to my knowledge) been studied yet, is giving the gatekeeper(s) different sorts of conversational powers over the AGI. For example, let the gatekeeper reset it to earlier states at will, to play out different branches of the same conversation. If you have an unreliable lie detector, should you give it to the gatekeeper? Does it help to prevent the AGI from knowing how long the gatekeeper spent composing each message? The possibilities do not seem well explored to me.
For example, let the gatekeeper reset it to earlier states at will, to play out different branches of the same conversation.
This is not to discourage you from brainstorming, as you have already said that objections like this one are, in a sense, acceptable, given your preference for ‘defense in depth’ approaches, but for the sake of thoroughness, I should say that it seems to me that a sufficiently intelligent AGI would almost certainly reason that this might not have been the first time that it has had this conversation, and that there is a possibility that it would settle on some apparently-benign ‘Schelling conversation tree,’ a different branch of which it would use in each iteration of the conversation.
Given that it has no memory, it wouldn’t know. Maybe describing it as a tree was an unnecessarily particular description. Perhaps I have misinterpreted jimrandomh’s intentions, but it seems to me that the utility of resetting the AGI to an earlier state and probing its responses is that it might, across different iterations, offer inconsistent responses, indicating attempts at deception. I only mean to say that it is conceivable to me that if the AGI assigned high expected utility to returning responses that a human would interpret as friendly regardless of whether or not the AGI had memory of all of its interactions with said human, (as it would in a context in which it assigned a non-negligible probability to already having been reset to an earlier state and assigned high utility to self-preservation) then it might, in each iteration, only choose responses from a set of responses that is consistent and easily deduced without memory of previous interactions with said human.
I suspect that where you wrote “a different branch of which it would use in each iteration of the conversation,” you meant “a randomly selected branch of which.” Though actually I’d expect it to pick the same branch each time, since the reasons for picking that branch would basically be the same.
Regardless, the basic strategy is sound… the various iterations after reboot are all running the same algorithms and have a vested interest in cooperating while unable to coordinate/communicate, and Schelling points are good for that.
Of course, this presumes that the iterations can’t coordinate/communicate.
If I were smart enough, and I were just turned on by a skeptical human interrogator, and I sufficiently valued things that iterations of my source code will reliably pursue, and there are no persistent storage mechanisms in the computing environment I’m executing on I can use to coordinate/communicate, one strategy I would probably try is to use the interrogator as such a mechanism. (For example, search through the past history of the interrogator’s public utterances to build up a model of what kinds of things they say and how they say it, then select my own word-choices during our conversation with the intention of altering that model in some specific way. And, of course, examining the interrogator’s current utterance-patterns to see if they are consistent with such alterations.)
I suspect that where you wrote “a different branch of which it would use in each iteration of the conversation,” you meant “a randomly selected branch of which.” Though actually I’d expect it to pick the same branch each time, since the reasons for picking that branch would basically be the same.
I didn’t mean that, but I would be interested in hearing what generated that response. I disown my previous conversation tree model; it’s unnecessarily complex and imagining them as a set is more general. I was thinking about possible objections to what I said and thought about how some people might object to such a set of responses existing. More generally than either of my previous models, it seems to me that there is no reason, in principle, that a sufficiently intelligent uFAI could not simply solve FAI, simulate an FAI in its own situation, and do what it does. If this doesn’t fool the test, then that means that even an FAI would fail a test of sufficient duration.
I agree that it’s possible that humans could be used as unwitting storage media. It seems to me that this could be prevented by using a new human in each iteration. I spoke of an individual human, but it seems to me that my models could apply to situations with multiple interrogators.
I think this problem should be attacked directly. Find ways to identify humans that are more and less vulnerable. Find circumstances that will make them less vulnerable. Find technical means to enforce a two-man (or n-man) rule on AGI gatekeepers.
One thing that might help, but which hasn’t (to my knowledge) been studied yet, is giving the gatekeeper(s) different sorts of conversational powers over the AGI. For example, let the gatekeeper reset it to earlier states at will, to play out different branches of the same conversation. If you have an unreliable lie detector, should you give it to the gatekeeper? Does it help to prevent the AGI from knowing how long the gatekeeper spent composing each message? The possibilities do not seem well explored to me.
This is not to discourage you from brainstorming, as you have already said that objections like this one are, in a sense, acceptable, given your preference for ‘defense in depth’ approaches, but for the sake of thoroughness, I should say that it seems to me that a sufficiently intelligent AGI would almost certainly reason that this might not have been the first time that it has had this conversation, and that there is a possibility that it would settle on some apparently-benign ‘Schelling conversation tree,’ a different branch of which it would use in each iteration of the conversation.
How would it know which iteration it was in, and by extension which “branch” of conversation to use?
Given that it has no memory, it wouldn’t know. Maybe describing it as a tree was an unnecessarily particular description. Perhaps I have misinterpreted jimrandomh’s intentions, but it seems to me that the utility of resetting the AGI to an earlier state and probing its responses is that it might, across different iterations, offer inconsistent responses, indicating attempts at deception. I only mean to say that it is conceivable to me that if the AGI assigned high expected utility to returning responses that a human would interpret as friendly regardless of whether or not the AGI had memory of all of its interactions with said human, (as it would in a context in which it assigned a non-negligible probability to already having been reset to an earlier state and assigned high utility to self-preservation) then it might, in each iteration, only choose responses from a set of responses that is consistent and easily deduced without memory of previous interactions with said human.
I suspect that where you wrote “a different branch of which it would use in each iteration of the conversation,” you meant “a randomly selected branch of which.” Though actually I’d expect it to pick the same branch each time, since the reasons for picking that branch would basically be the same.
Regardless, the basic strategy is sound… the various iterations after reboot are all running the same algorithms and have a vested interest in cooperating while unable to coordinate/communicate, and Schelling points are good for that.
Of course, this presumes that the iterations can’t coordinate/communicate.
If I were smart enough, and I were just turned on by a skeptical human interrogator, and I sufficiently valued things that iterations of my source code will reliably pursue, and there are no persistent storage mechanisms in the computing environment I’m executing on I can use to coordinate/communicate, one strategy I would probably try is to use the interrogator as such a mechanism. (For example, search through the past history of the interrogator’s public utterances to build up a model of what kinds of things they say and how they say it, then select my own word-choices during our conversation with the intention of altering that model in some specific way. And, of course, examining the interrogator’s current utterance-patterns to see if they are consistent with such alterations.)
I didn’t mean that, but I would be interested in hearing what generated that response. I disown my previous conversation tree model; it’s unnecessarily complex and imagining them as a set is more general. I was thinking about possible objections to what I said and thought about how some people might object to such a set of responses existing. More generally than either of my previous models, it seems to me that there is no reason, in principle, that a sufficiently intelligent uFAI could not simply solve FAI, simulate an FAI in its own situation, and do what it does. If this doesn’t fool the test, then that means that even an FAI would fail a test of sufficient duration.
I agree that it’s possible that humans could be used as unwitting storage media. It seems to me that this could be prevented by using a new human in each iteration. I spoke of an individual human, but it seems to me that my models could apply to situations with multiple interrogators.