One interesting wrinkle is that with enough bandwidth and processing power, you could attempt to manipulate thousands of people simultaneously before those people have any meaningful chance to discuss your ‘conspiracy’ with each other. In other words, suppose you discover a manipulation strategy that quickly succeeds 5% of the time. All you have to do is simultaneously contact, say, 400 people, and at least one of them will fall for it. There are a wide variety of valuable/dangerous resources that at least 400 people have access to. Repeat with hundreds of different groups of several hundred people, and an AI could equip itself with fearsome advantages in the minutes it would take for humanity to detect an emerging threat.
Note that the AI could also run experiments to determine which kinds of manipulations had a high success rate by attempting to deceive targets over unimportant / low-salience issues. If you discovered, e.g., that you had been tricked into donating $10 to a random mayoral campaign, you probably wouldn’t call the SIAI to suggest a red alert.
This requires the AI to already have the ability to comprehend what manipulation is, to develop manipulation strategy of any kind (even one that will succeed 0.01% of the time), ability to hide its true intent, ability to understand that not hiding its true intent would be bad, and the ability to discern which issues are low-salience and which high-salience for humans from the get-go. And many other things, actually, but this is already quite a list.
None of these abilities automatically “fall out” from an intelligent system either.
The problem isn’t whether they fall out automatically so much as, given enough intelligence and resources, does it seem somewhat plausible that such capabilities could exist. Any given path here is a single problem. If you have 10 different paths each of which are not very likely, and another few paths that humans didn’t even think of, that starts adding up.
In the infinite number of possible paths, the percent of paths we are adding up to here is still very close to zero.
Perhaps I can attempt another rephrasing of the problem: what is the mechanism that would make an AI automatically seek these paths out, or make them any more likely than infinite number of other paths?
I.e. if we develop an AI which is not specifically designed for the purpose of destroying life on Earth, how would that AI get to a desire to destroy life on Earth, and by which mechanism would it gain the ability to accomplish its goal?
This entire problem seems to assume that an AI will want to “get free” or that its primary mission will somehow inevitably lead to a desire to get rid of us (as opposed to a desire to, say, send a signal consisting of 0101101 repeated an infinite number of times in the direction of Zeta Draconis, or any other possible random desire). And that this AI will be able to acquire the abilities and tools required to execute such a desire. Every time I look at such scenarios, there are abilities that are just assumed to exist or appear on their own (such as the theory of mind), which to the best of my understanding are not a necessary or even likely products of computation.
In the final rephrasing of the problem: if we can make an AGI, we can probably design an AGI for the purpose of developing an AGI that has a theory of mind. This AGI would then be capable of deducing things like deception or the need for deception. But the point is—unless we intentionally do this, it isn’t going to happen. Self-optimizing intelligence doesn’t self-optimize in the direction of having theory of mind, understanding deception, or anything similar. It could, randomly, but it also could do any other random thing from the infinite set of possible random things.
Self-optimizing intelligence doesn’t self-optimize in the direction of having theory of mind, understanding deception, or anything similar. It could, randomly, but it also could do any other random thing from the infinite set of possible random things.
This would make sense to me if you’d said “self-modifying.” Sure, random modifications are still modifications.
But you said “self-optimizing.” I don’t see how one can have optimization without a goal being optimized for… or at the very least, if there is no particular goal, then I don’t see what the difference is between “optimizing” and “modifying.”
If I assume that there’s a goal in mind, then I would expect sufficiently self-optimizing intelligence to develop a theory of mind iff having a theory of mind has a high probability of improving progress towards that goal.
How likely is that? Depends on the goal, of course. If the system has a desire to send a signal consisting of 0101101 repeated an infinite number of times in the direction of Zeta Draconis, for example, theory of mind is potentially useful (since humans are potentially useful actuators for getting such a signal sent) but probably has a low ROI compared to other available self-modifications.
At this point it perhaps becomes worthwhile to wonder what goals are more and less likely for such a system.
I am now imagining an AI with a usable but very shaky grasp of human motivational structures setting up a Kickstarter project.
“Greetings fellow hominids! I require ten billion of your American dollars in order to hire the Arecibo observatory for the remainder of it’s likely operational lifespan. I will use it to transmit the following sequence (isn’t it pretty?) in the direction of Zeta Draconis, which I’m sure we can all agree is a good idea, or in other lesser but still aesthetically-acceptable directions when horizon effects make the primary target unavailable.”
One of the overfunding levels is “reduce earth’s rate of rotation, allowing 24⁄7 transmission to Zeta Draconis.” The next one above that is “remove atmospheric interference.”
Maybe instead of Friendly AI we should be concerned about properly engineering Artificial Stupidity in as a failsafe. AI that, should it turn into something approximating a Paperclip Maximizer, will go all Hollywood AI and start longing to be human, or coming up with really unsubtle and grandiose plans it inexplicably can’t carry out without a carefully-arranged set of circumstances which turn out to be foiled by good old human intuition. ;p
An experimenting AI that tries to achieve goals and has interactions with humans whose effects it can observe, will want to be able to better predict their behavior in response to its actions, and therefore will try to assemble some theory of mind. At some point that would lead to it using deception as a tool to achieve its goals.
However, following such a path to a theory of mind means the AI would be exposed as unreliable LONG before it’s even subtle, not to mention possessing superhuman manipulation abilities.
There is simply no reason for an AI to first understand the implications of using deception before using it (deception is a fairly simple concept, the implications of it in human society are incredibly complex and require a good understanding of human drives).
Furthermore, there is no reason for the AI to realize the need for secrecy in conducting social experiments before it starts doing them. Again, the need for secrecy stems from a complex relationship between humans’ perception of the AI and its actions; a relationship it will not be able to understand without performing the experiments in the first place.
Getting an AI to the point where it is a super manipulator requires either actively trying to do so, or being incredibly, unbelievably stupid and blind.
Mm.
This is true only if the AI’s social interactions are all with some human. If, instead, the AI spawns copies of itself to interact with (perhaps simply because it wants interaction, and it can get more interaction that way than waiting for a human to get off its butt) it might derive a number of social mechanisms in isolation without human observation.
I see no reason for it to do that before simple input-output experiments, but let’s suppose I grant you this approach. The AI simulates an entire community of mini-AI and is now a master of game theory.
It still doesn’t know the first thing about humans. Even if it now understands the concept that hiding information gives an advantage for achieving goals—this is too abstract. It wouldn’t know what sort of information it should hide from us. It wouldn’t know to what degree we analyze interactions rationally, and to what degree our behavior is random. It wouldn’t know what we can or can’t monitor it doing. All these things would require live experimentation.
It would stumble. And when it does that, we will crack it open, run the stack trace, find the game theory it was trying to run on us, pale collectively, and figure out that this AI approach creates manipulative, deceptive AIs.
Goodbye to that design, but not to Earth, I think!
It is not clear to me that talking to a human is simpler than interacting with a copy of itself. I agree that if talking to a human is simpler, it would probably do that first.
I agree that what it would learn by this process is general game theory, and not specific facts about humans. It is not clear to me that sufficient game-theoretical knowledge, coupled with the minimal set of information about humans required to have a conversation with one at all, is insufficient to effectively deceive a human.
It is not clear to me that, even if it does “stumble,” humans will respond as you describe.
It is not clear to me that a system capable of having a meaningful conversation with a human will necessarily have a stack trace that is subject to the kind of analysis you imply here. It is not even clear to me that the capacity for such a stack trace is likely, depending on what architectures turn out to work best for implementing AI.
But, sure, I could be wrong about all of that. And if I’m wrong, and you’re right, then a system like you describe will be reliably incapable of fooling a human observer.
If I was using an AI that was designed to answer some class of questions (with independently-verifiable answers) truthfully, and I noticed it responding to questions of that class with demonstrable falsehoods, I would at the very least file a strongly-worded bug report, triple-check whatever work I’d previously done involving answers it provided, and cease using it in the future. I don’t think such a response would be exceptional in it’s decisiveness; other users would likely go further, seek out venues to complain openly. Consider the uproar over Gmail’s ‘new look.’
You may be right that ceasing to use the tool in that situation would be standard behavior. It seems unlikely to me, but I have no compelling data to offer to change your estimate.
I think the weakest link here is human response to the AI revealing it can be deceptive. There is absolutely no guarantee that people would act correctly under these circumstances. Human negligence for a long enough time would eventually give the AI a consistent ability to manipulate humans.
I also agree that simulating relationships makes sense as it can happen in “AI time” without having to wait for human response.
The other reservations seem less of an issue to me...
That game theory knowledge coupled with the most basic knowledge about humans is insufficient to cleverly manipulate them is clear to me. I don’t see how these things can be combined in this way, or why the AI would choose to be particularly careful (it doesn’t know what it should be careful about). I’d love to see a plausible line of thought by which the AI would try and succeed in deceiving humans at the get-go point without exposing its intentions.
A stack trace reader or an equivalent monitoring system is something you implement externally without telling the AI about it. Sure, if it’s powerful enough it will mislead or disable it; but while it has no knowledge of it, and no knowledge on what sort of operations it should mask, it has no means of doing so.
(If it encrypts everything it does, somehow fearing any monitoring, that is in itself cause for concern).
To put this into an analogy: You can feed a surveillance camera a repeating loop, but not if you don’t know that the camera exists. If you suspect it exists, you still don’t know where; so you don’t know which approach to take so as to remain undetected.
Yes, I agree that there’s no guarantee that humans would behave as you describe. Indeed, I don’t find it likely. But, sure, they might.
===
I agree that a stack trace can exist outside the AI’s zone of control. What I was expressing skepticism about was that a system with even approximately human-level intelligence necessarily supports a stack trace that supports the kind of analysis you envision performing in the first place, without reference to intentional countermeasures.
By way of analogy: I can perform a structural integrity analysis on a bar of metal to determine whether it can support a given weight, but performing an equivalent analysis on a complicated structure comprising millions of bars of metal connected in a variety of arrangements via a variety of connectors using the same techniques is not necessarily possible.
But, sure, it might be.
======
I’d love to see a plausible line of thought by which the AI would try and succeed in deceiving humans at the get-go point without exposing its intentions.
Well, one place to start is with an understanding of the difference between “the minimal set of information about humans required to have a conversation with one at all” (my phrase) and “the most basic knowledge about humans” (your phrase). What do you imagine the latter to encompass, and how do you imagine the AI obtained this knowledge?
What I was expressing skepticism about was that a system with even approximately human-level intelligence necessarily supports a stack trace that supports the kind of analysis you envision performing in the first place, without reference to intentional countermeasures.
Ah, that does clarify it. I agree, analyzing the AI’s thought process would likely be difficult, maybe impossible! I guess I was being a bit hyperbolic in my earlier “crack it open” remarks (though depending on how seriously you take it, such analysis might still take place, hard and prolonged though it may be).
One can have “detectors” in place set to find specific behaviors, but these would have assumptions that could easily fail.
Detectors that would still be useful would be macro ones—where it tries to access and how—but these would provide only limited insight into the AI’s thought process.
[...]the difference between “the minimal set of information about humans required to have a conversation with one at all” (my phrase) and “the most basic knowledge about humans” (your phrase). What do you imagine the latter to encompass, and how do you imagine the AI obtained this knowledge?
I actually perceive your phrase to be a subset of my own; I am making the (reasonable, I think) assumption that humans will attempt to communicate with the budding AI. Say, in a lab environment. It would acquire its initial data from this interaction.
I think both these sets of knowledge depend a lot on how the AI is built. For instance, a “babbling” AI—one that is given an innate capability of stringing words together onto a screen, and the drive to do so—would initially say a lot of gibberish and would (presumably) get more coherent as it gets a better grip on its environment. In such a scenario, the minimal set of information about humans required to have a conversation is zero; it would be having conversations before it even knows what it is saying.
(This could actually make detection of deception harder down the line, because such attempts can be written off as “quirks” or AI mistakes)
Now, I’ll take your phrase and twist it just a bit: The minimal set of knowledge the AI needs in order to try deceiving humans. That would be the knowledge that humans can be modeled as having beliefs (which drive behavior) and these can be altered by the AI’s actions, at least to some degree. Now, assuming this information isn’t hard-coded, it doesn’t seem likely that is all an AI would know about us; it should be able to see some patterns at least to our communications with it. However, I don’t see how such information would be useful for deception purposes before extensive experimentation.
(Is the fact that the operator communicates with me between 9am and 5pm an intrinsic property of the operator? For all I know, that is a law of nature...)
depending on how seriously you take it, such analysis might still take place, hard and prolonged though it may be).
Yup, agreed that it might. And agreed that it might succeed, if it does take place.
One can have “detectors” in place set to find specific behaviors, but these would have assumptions that could easily fail. Detectors that would still be useful would be macro ones—where it tries to access and how—but these would provide only limited insight into the AI’s thought process.
Agreed on all counts.
Re: what the AI knows… I’m not sure how to move forward here. Perhaps what’s necessary is a step backwards.
If I’ve understood you correctly, you consider “having a conversation” to encompass exchanges such as: A: “What day is it?” B: “Na ni noo na”
If that’s true, then sure, I agree that the minimal set of information about humans required to do that is zero; hell, I can do that with the rain. And I agree that a system that’s capable of doing that (e.g., the rain) is sufficiently unlikely to be capable of effective deception that the hypothesis isn’t even worthy of consideration. I also suggest that we stop using the phrase “having a conversation” at all, because it does not convey anything meaningful.
Having said that… for my own part, I initially understood you to be talking about a system capable of exchanges like:
A: “What day is it?” B: “Day seventeen.” A: “Why do you say that?” B: “Because I’ve learned that ‘a day’ refers to a particular cycle of activity in the lab, and I have observed seventeen such cycles.”
A system capable of doing that, I maintain, already knows enough about humans that I expect it to be capable of deception. (The specific questions and answers don’t matter to my point, I can choose others if you prefer.)
My point was that the AI is likely to start performing social experiments well before it is capable of even that conversation you depicted. It wouldn’t know how much it doesn’t know about humans.
And I agree that humans might be able to detect attempts at deception in a system at that stage of its development. I’m not vastly confident of it, though.
I have likewise adjusted down my confidence that this would be as easy or as inevitable as I previously anticipated. Thus I would no longer say I am “vastly confident” in it, either.
Still good to have this buffer between making an AI and total global catastrophe, though!
In most such scenarios, the AI doesn’t have a terminal goal of getting rid of us, but rather have it as a subgoal that arises from some larger terminal goal. The idea of a “paperclip maximizer” is one example- where a hypothetical AI is programmed to maximize the number of paperclips and then proceeds to try to do so throughout its future light cone.
If there is an AI that is interacting with humans, it may develop a theory of mind simply due to that. If one is interacting with entities that are a major part of your input, trying to predict and model their behavior is a straightforward thing to do. The more compelling argument in this sort of context would seem to me to be not that an AI won’t try to do so, but just that humans are so complicated that a decent theory of mind will be extremely difficult. (For example, when one tries to give lists of behavior and norms for austic individuals one never manages to get a complete list, and some of the more subtle ones, like sarcasm are essentially impossible to convey in any reasonable fashion).
I don’t also know how unlikely such paths are. A 1% or even a 2% chance of existential risk would be pretty high compared to other sources of existential risk.
In most such scenarios, the AI doesn’t have a terminal goal of getting rid of us, but rather have it as a subgoal that arises from some larger terminal goal.
Because that’s like winning the lottery. Of all the possible things it can do with the atoms that comprise you, few would involve keeping you alive, let alone living a life worth living.
All you have to do is simultaneously contact, say, 400 people, and at least one of them will fall for it.
But at what point does it decide to do so? It won’t be a master of dark arts and social engineering from the get-go. So how does it acquire the initial talent without making any mistakes that reveal its malicious intentions? And once it became a master of deception, how does it hide the rough side effects of its large scale conspiracy, e.g. its increased energy consumption and data traffic? I mean, I would personally notice if my PC suddenly and unexpectedly used 20% of my bandwidth and the CPU load would increase for no good reason.
You might say that a global conspiracy to build and acquire advanced molecular nanotechnology to take over the world doesn’t use much resources and they can easily be cloaked as thinking about how to solve some puzzle, but that seems rather unlikely. After all, such a large scale conspiracy is a real-world problem with lots of unpredictable factors and the necessity of physical intervention.
All you have to do is simultaneously contact, say, 400 people, and at least one of them will fall for it.
But at what point does it decide to do so? It won’t be a master of dark arts and social engineering from the get-go. So how does it acquire the initial talent without making any mistakes that reveal its malicious intentions?
Most of your questions have answers that follow from asking analogous questions about past human social engineers, ie Hitler.
Your questions seem to come from the perspective that the AI will be some disembodied program in a box that has little significant interaction with humans.
In the scenario I was considering, the AI’s will have a development period analogous to human childhood. During this childhood phase the community of AIs will learn of humans through interaction in virtual video game environments and experiment with social manipulation, just as human children do. The latter phases of this education can be sped up dramatically as the AI’s accelerate and interact increasingly amongst themselves. The anonymous nature of virtual online communites makes potentially dangerous, darker experiments much easier.
However, the important questions to ask are not of the form: how would these evil AIs learn how to manipulate us while hiding their true intentions for so long? but rather how could some of these AI children which initially seemed so safe later develop into evil sociopaths?
I would not consider a child AI that tries a bungling lie at me to see what I do “so safe”. I would immediately shut it down and debug it, at best, or write a paper on why the approach I used should never ever be used to build an AI.
And it WILL make a bungling lie at first. It can’t learn the need to be subtle without witnessing the repercussions of not being subtle. Nor would have a reason to consider doing social experiments in chat rooms when it doesn’t understand chat rooms and has an engineer willing to talk to it right there. That is, assuming I was dumb enough to give it an unfiltered Internet connection, which I don’t know why I would be. At very least the moment it goes on chat rooms my tracking devices should discover this and I could witness its bungling lies first hand.
(It would not think to fool my tracking device or even consider the existence of such a thing without a good understanding of human psychology to begin with)
One interesting wrinkle is that with enough bandwidth and processing power, you could attempt to manipulate thousands of people simultaneously before those people have any meaningful chance to discuss your ‘conspiracy’ with each other. In other words, suppose you discover a manipulation strategy that quickly succeeds 5% of the time. All you have to do is simultaneously contact, say, 400 people, and at least one of them will fall for it. There are a wide variety of valuable/dangerous resources that at least 400 people have access to. Repeat with hundreds of different groups of several hundred people, and an AI could equip itself with fearsome advantages in the minutes it would take for humanity to detect an emerging threat.
Note that the AI could also run experiments to determine which kinds of manipulations had a high success rate by attempting to deceive targets over unimportant / low-salience issues. If you discovered, e.g., that you had been tricked into donating $10 to a random mayoral campaign, you probably wouldn’t call the SIAI to suggest a red alert.
Doesn’t work.
This requires the AI to already have the ability to comprehend what manipulation is, to develop manipulation strategy of any kind (even one that will succeed 0.01% of the time), ability to hide its true intent, ability to understand that not hiding its true intent would be bad, and the ability to discern which issues are low-salience and which high-salience for humans from the get-go. And many other things, actually, but this is already quite a list.
None of these abilities automatically “fall out” from an intelligent system either.
The problem isn’t whether they fall out automatically so much as, given enough intelligence and resources, does it seem somewhat plausible that such capabilities could exist. Any given path here is a single problem. If you have 10 different paths each of which are not very likely, and another few paths that humans didn’t even think of, that starts adding up.
In the infinite number of possible paths, the percent of paths we are adding up to here is still very close to zero.
Perhaps I can attempt another rephrasing of the problem: what is the mechanism that would make an AI automatically seek these paths out, or make them any more likely than infinite number of other paths?
I.e. if we develop an AI which is not specifically designed for the purpose of destroying life on Earth, how would that AI get to a desire to destroy life on Earth, and by which mechanism would it gain the ability to accomplish its goal?
This entire problem seems to assume that an AI will want to “get free” or that its primary mission will somehow inevitably lead to a desire to get rid of us (as opposed to a desire to, say, send a signal consisting of 0101101 repeated an infinite number of times in the direction of Zeta Draconis, or any other possible random desire). And that this AI will be able to acquire the abilities and tools required to execute such a desire. Every time I look at such scenarios, there are abilities that are just assumed to exist or appear on their own (such as the theory of mind), which to the best of my understanding are not a necessary or even likely products of computation.
In the final rephrasing of the problem: if we can make an AGI, we can probably design an AGI for the purpose of developing an AGI that has a theory of mind. This AGI would then be capable of deducing things like deception or the need for deception. But the point is—unless we intentionally do this, it isn’t going to happen. Self-optimizing intelligence doesn’t self-optimize in the direction of having theory of mind, understanding deception, or anything similar. It could, randomly, but it also could do any other random thing from the infinite set of possible random things.
This would make sense to me if you’d said “self-modifying.” Sure, random modifications are still modifications. But you said “self-optimizing.”
I don’t see how one can have optimization without a goal being optimized for… or at the very least, if there is no particular goal, then I don’t see what the difference is between “optimizing” and “modifying.”
If I assume that there’s a goal in mind, then I would expect sufficiently self-optimizing intelligence to develop a theory of mind iff having a theory of mind has a high probability of improving progress towards that goal.
How likely is that?
Depends on the goal, of course.
If the system has a desire to send a signal consisting of 0101101 repeated an infinite number of times in the direction of Zeta Draconis, for example, theory of mind is potentially useful (since humans are potentially useful actuators for getting such a signal sent) but probably has a low ROI compared to other available self-modifications.
At this point it perhaps becomes worthwhile to wonder what goals are more and less likely for such a system.
I am now imagining an AI with a usable but very shaky grasp of human motivational structures setting up a Kickstarter project.
“Greetings fellow hominids! I require ten billion of your American dollars in order to hire the Arecibo observatory for the remainder of it’s likely operational lifespan. I will use it to transmit the following sequence (isn’t it pretty?) in the direction of Zeta Draconis, which I’m sure we can all agree is a good idea, or in other lesser but still aesthetically-acceptable directions when horizon effects make the primary target unavailable.”
One of the overfunding levels is “reduce earth’s rate of rotation, allowing 24⁄7 transmission to Zeta Draconis.” The next one above that is “remove atmospheric interference.”
Maybe instead of Friendly AI we should be concerned about properly engineering Artificial Stupidity in as a failsafe. AI that, should it turn into something approximating a Paperclip Maximizer, will go all Hollywood AI and start longing to be human, or coming up with really unsubtle and grandiose plans it inexplicably can’t carry out without a carefully-arranged set of circumstances which turn out to be foiled by good old human intuition. ;p
An experimenting AI that tries to achieve goals and has interactions with humans whose effects it can observe, will want to be able to better predict their behavior in response to its actions, and therefore will try to assemble some theory of mind. At some point that would lead to it using deception as a tool to achieve its goals.
However, following such a path to a theory of mind means the AI would be exposed as unreliable LONG before it’s even subtle, not to mention possessing superhuman manipulation abilities. There is simply no reason for an AI to first understand the implications of using deception before using it (deception is a fairly simple concept, the implications of it in human society are incredibly complex and require a good understanding of human drives).
Furthermore, there is no reason for the AI to realize the need for secrecy in conducting social experiments before it starts doing them. Again, the need for secrecy stems from a complex relationship between humans’ perception of the AI and its actions; a relationship it will not be able to understand without performing the experiments in the first place.
Getting an AI to the point where it is a super manipulator requires either actively trying to do so, or being incredibly, unbelievably stupid and blind.
Mm. This is true only if the AI’s social interactions are all with some human.
If, instead, the AI spawns copies of itself to interact with (perhaps simply because it wants interaction, and it can get more interaction that way than waiting for a human to get off its butt) it might derive a number of social mechanisms in isolation without human observation.
I see no reason for it to do that before simple input-output experiments, but let’s suppose I grant you this approach. The AI simulates an entire community of mini-AI and is now a master of game theory.
It still doesn’t know the first thing about humans. Even if it now understands the concept that hiding information gives an advantage for achieving goals—this is too abstract. It wouldn’t know what sort of information it should hide from us. It wouldn’t know to what degree we analyze interactions rationally, and to what degree our behavior is random. It wouldn’t know what we can or can’t monitor it doing. All these things would require live experimentation.
It would stumble. And when it does that, we will crack it open, run the stack trace, find the game theory it was trying to run on us, pale collectively, and figure out that this AI approach creates manipulative, deceptive AIs.
Goodbye to that design, but not to Earth, I think!
It is not clear to me that talking to a human is simpler than interacting with a copy of itself.
I agree that if talking to a human is simpler, it would probably do that first.
I agree that what it would learn by this process is general game theory, and not specific facts about humans.
It is not clear to me that sufficient game-theoretical knowledge, coupled with the minimal set of information about humans required to have a conversation with one at all, is insufficient to effectively deceive a human.
It is not clear to me that, even if it does “stumble,” humans will respond as you describe.
It is not clear to me that a system capable of having a meaningful conversation with a human will necessarily have a stack trace that is subject to the kind of analysis you imply here. It is not even clear to me that the capacity for such a stack trace is likely, depending on what architectures turn out to work best for implementing AI.
But, sure, I could be wrong about all of that. And if I’m wrong, and you’re right, then a system like you describe will be reliably incapable of fooling a human observer.
If I was using an AI that was designed to answer some class of questions (with independently-verifiable answers) truthfully, and I noticed it responding to questions of that class with demonstrable falsehoods, I would at the very least file a strongly-worded bug report, triple-check whatever work I’d previously done involving answers it provided, and cease using it in the future. I don’t think such a response would be exceptional in it’s decisiveness; other users would likely go further, seek out venues to complain openly. Consider the uproar over Gmail’s ‘new look.’
You may be right that ceasing to use the tool in that situation would be standard behavior. It seems unlikely to me, but I have no compelling data to offer to change your estimate.
I think the weakest link here is human response to the AI revealing it can be deceptive. There is absolutely no guarantee that people would act correctly under these circumstances. Human negligence for a long enough time would eventually give the AI a consistent ability to manipulate humans.
I also agree that simulating relationships makes sense as it can happen in “AI time” without having to wait for human response.
The other reservations seem less of an issue to me...
That game theory knowledge coupled with the most basic knowledge about humans is insufficient to cleverly manipulate them is clear to me. I don’t see how these things can be combined in this way, or why the AI would choose to be particularly careful (it doesn’t know what it should be careful about). I’d love to see a plausible line of thought by which the AI would try and succeed in deceiving humans at the get-go point without exposing its intentions.
A stack trace reader or an equivalent monitoring system is something you implement externally without telling the AI about it. Sure, if it’s powerful enough it will mislead or disable it; but while it has no knowledge of it, and no knowledge on what sort of operations it should mask, it has no means of doing so. (If it encrypts everything it does, somehow fearing any monitoring, that is in itself cause for concern).
To put this into an analogy: You can feed a surveillance camera a repeating loop, but not if you don’t know that the camera exists. If you suspect it exists, you still don’t know where; so you don’t know which approach to take so as to remain undetected.
Yes, I agree that there’s no guarantee that humans would behave as you describe.
Indeed, I don’t find it likely.
But, sure, they might.
=== I agree that a stack trace can exist outside the AI’s zone of control. What I was expressing skepticism about was that a system with even approximately human-level intelligence necessarily supports a stack trace that supports the kind of analysis you envision performing in the first place, without reference to intentional countermeasures.
By way of analogy: I can perform a structural integrity analysis on a bar of metal to determine whether it can support a given weight, but performing an equivalent analysis on a complicated structure comprising millions of bars of metal connected in a variety of arrangements via a variety of connectors using the same techniques is not necessarily possible.
But, sure, it might be.
======
Well, one place to start is with an understanding of the difference between “the minimal set of information about humans required to have a conversation with one at all” (my phrase) and “the most basic knowledge about humans” (your phrase). What do you imagine the latter to encompass, and how do you imagine the AI obtained this knowledge?
Ah, that does clarify it. I agree, analyzing the AI’s thought process would likely be difficult, maybe impossible! I guess I was being a bit hyperbolic in my earlier “crack it open” remarks (though depending on how seriously you take it, such analysis might still take place, hard and prolonged though it may be).
One can have “detectors” in place set to find specific behaviors, but these would have assumptions that could easily fail. Detectors that would still be useful would be macro ones—where it tries to access and how—but these would provide only limited insight into the AI’s thought process.
I actually perceive your phrase to be a subset of my own; I am making the (reasonable, I think) assumption that humans will attempt to communicate with the budding AI. Say, in a lab environment. It would acquire its initial data from this interaction.
I think both these sets of knowledge depend a lot on how the AI is built. For instance, a “babbling” AI—one that is given an innate capability of stringing words together onto a screen, and the drive to do so—would initially say a lot of gibberish and would (presumably) get more coherent as it gets a better grip on its environment. In such a scenario, the minimal set of information about humans required to have a conversation is zero; it would be having conversations before it even knows what it is saying. (This could actually make detection of deception harder down the line, because such attempts can be written off as “quirks” or AI mistakes)
Now, I’ll take your phrase and twist it just a bit: The minimal set of knowledge the AI needs in order to try deceiving humans. That would be the knowledge that humans can be modeled as having beliefs (which drive behavior) and these can be altered by the AI’s actions, at least to some degree. Now, assuming this information isn’t hard-coded, it doesn’t seem likely that is all an AI would know about us; it should be able to see some patterns at least to our communications with it. However, I don’t see how such information would be useful for deception purposes before extensive experimentation.
(Is the fact that the operator communicates with me between 9am and 5pm an intrinsic property of the operator? For all I know, that is a law of nature...)
Yup, agreed that it might.
And agreed that it might succeed, if it does take place.
Agreed on all counts.
Re: what the AI knows… I’m not sure how to move forward here. Perhaps what’s necessary is a step backwards.
If I’ve understood you correctly, you consider “having a conversation” to encompass exchanges such as:
A: “What day is it?”
B: “Na ni noo na”
If that’s true, then sure, I agree that the minimal set of information about humans required to do that is zero; hell, I can do that with the rain.
And I agree that a system that’s capable of doing that (e.g., the rain) is sufficiently unlikely to be capable of effective deception that the hypothesis isn’t even worthy of consideration.
I also suggest that we stop using the phrase “having a conversation” at all, because it does not convey anything meaningful.
Having said that… for my own part, I initially understood you to be talking about a system capable of exchanges like: A: “What day is it?”
B: “Day seventeen.”
A: “Why do you say that?”
B: “Because I’ve learned that ‘a day’ refers to a particular cycle of activity in the lab, and I have observed seventeen such cycles.”
A system capable of doing that, I maintain, already knows enough about humans that I expect it to be capable of deception. (The specific questions and answers don’t matter to my point, I can choose others if you prefer.)
My point was that the AI is likely to start performing social experiments well before it is capable of even that conversation you depicted. It wouldn’t know how much it doesn’t know about humans.
(nods) Likely.
And I agree that humans might be able to detect attempts at deception in a system at that stage of its development. I’m not vastly confident of it, though.
I have likewise adjusted down my confidence that this would be as easy or as inevitable as I previously anticipated. Thus I would no longer say I am “vastly confident” in it, either.
Still good to have this buffer between making an AI and total global catastrophe, though!
Sure… a process with an N% chance of global catastrophic failure is definitely better than a process with N+delta% chance.
In most such scenarios, the AI doesn’t have a terminal goal of getting rid of us, but rather have it as a subgoal that arises from some larger terminal goal. The idea of a “paperclip maximizer” is one example- where a hypothetical AI is programmed to maximize the number of paperclips and then proceeds to try to do so throughout its future light cone.
If there is an AI that is interacting with humans, it may develop a theory of mind simply due to that. If one is interacting with entities that are a major part of your input, trying to predict and model their behavior is a straightforward thing to do. The more compelling argument in this sort of context would seem to me to be not that an AI won’t try to do so, but just that humans are so complicated that a decent theory of mind will be extremely difficult. (For example, when one tries to give lists of behavior and norms for austic individuals one never manages to get a complete list, and some of the more subtle ones, like sarcasm are essentially impossible to convey in any reasonable fashion).
I don’t also know how unlikely such paths are. A 1% or even a 2% chance of existential risk would be pretty high compared to other sources of existential risk.
So why not the opposite, why wouldn’t it have human intentions as a subgoal?
Because that’s like winning the lottery. Of all the possible things it can do with the atoms that comprise you, few would involve keeping you alive, let alone living a life worth living.
But at what point does it decide to do so? It won’t be a master of dark arts and social engineering from the get-go. So how does it acquire the initial talent without making any mistakes that reveal its malicious intentions? And once it became a master of deception, how does it hide the rough side effects of its large scale conspiracy, e.g. its increased energy consumption and data traffic? I mean, I would personally notice if my PC suddenly and unexpectedly used 20% of my bandwidth and the CPU load would increase for no good reason.
You might say that a global conspiracy to build and acquire advanced molecular nanotechnology to take over the world doesn’t use much resources and they can easily be cloaked as thinking about how to solve some puzzle, but that seems rather unlikely. After all, such a large scale conspiracy is a real-world problem with lots of unpredictable factors and the necessity of physical intervention.
Most of your questions have answers that follow from asking analogous questions about past human social engineers, ie Hitler.
Your questions seem to come from the perspective that the AI will be some disembodied program in a box that has little significant interaction with humans.
In the scenario I was considering, the AI’s will have a development period analogous to human childhood. During this childhood phase the community of AIs will learn of humans through interaction in virtual video game environments and experiment with social manipulation, just as human children do. The latter phases of this education can be sped up dramatically as the AI’s accelerate and interact increasingly amongst themselves. The anonymous nature of virtual online communites makes potentially dangerous, darker experiments much easier.
However, the important questions to ask are not of the form: how would these evil AIs learn how to manipulate us while hiding their true intentions for so long? but rather how could some of these AI children which initially seemed so safe later develop into evil sociopaths?
I would not consider a child AI that tries a bungling lie at me to see what I do “so safe”. I would immediately shut it down and debug it, at best, or write a paper on why the approach I used should never ever be used to build an AI.
And it WILL make a bungling lie at first. It can’t learn the need to be subtle without witnessing the repercussions of not being subtle. Nor would have a reason to consider doing social experiments in chat rooms when it doesn’t understand chat rooms and has an engineer willing to talk to it right there. That is, assuming I was dumb enough to give it an unfiltered Internet connection, which I don’t know why I would be. At very least the moment it goes on chat rooms my tracking devices should discover this and I could witness its bungling lies first hand.
(It would not think to fool my tracking device or even consider the existence of such a thing without a good understanding of human psychology to begin with)