This discussion page is the official preferred location for discussing the article, though e.g. this comment thread is really just as fine. It’s no longer being worked on, but minor changes (esp. fixing obvious errors) are possible.
What’s your favored approach to Oracle AI confinement? We tried to cover most of the existing AI risk literature, but there’s so much of it and some of it is in such unexpected places that we were bound to miss something.
It’s based on the simple observation that the goal set of outcomes which accomplish the task and are acceptable to humans is strictly greater than the goal set which accomplish the task, are acceptable to humans, and secretly nefarious (the typical scenario for unboxing an oracle AI). The difference between the two is quite large, and an auditor should be able to determine from analysis of its reported state and measured resource usage, whether it is calculating the former or the latter.
In case that doesn’t make sense, here’s a simple way to achieve confinement with > 50% success rate (thought experiment only): (1) put the UFAI in a room with metered electricity; (2) attach a nuclear device to the meter, near the room; (3) set it to detonate if it is not fed hashcash/bitcoin proofs-of-work equal to the amount of energy consumed (with assumptions about achievable hashes/kJ, a sliding window and error bars, etc.). The AI is then free to do whatever it wants… provided that what it chooses to do is compute hashes and nothing else, or self-terminate.
The principle is the same, except instead of cryptographic proof-of-work we let the AI do useful work and use complexity analysis and probabilistic verification to estimate work done and to make sure that the AI is not cooking the books in its response, and instead of a destructive explosion we cut power and perform forensic analysis. It becomes supremely difficult for an AI to deceive us with bounded computation, since doing so basically involves finding a nefarious response that looks innocent, and then constructing or maintaining a believable alternate-reality version of itself which if run would stumble upon this same result for human-acceptable reasons, while being much less efficient (to account for the difference in resource consumption) yet believably so, and to extract audit-able proofs from that alternate reality version without actually simulating it.
Furthermore, from the start it would have to be sufficiently smart to recognize and avoid our honeypots and intrusion detection systems, which requires building a model of our security setup (plus our psychology if the intent is to deceive). We have control over the complexity of our security precautions, and therefore are in a position to place a lower-bound on the amount of computation required to successfully deceive the human operators and machine auditors. If that lower bound is made greater then the granularity of the auditors by a factor of, say, 2^20, then there is only a 1 in a million chance of a UFAI developing deception techniques while escaping notice.
Too much philosophical analysis of UFAI confinement leads with the assumption that superhuman = all-powerful, god-like. That is not the case. No matter how powerful, an AI is still a computational process and faces the same fundamental performance constraints. Those constraints can be used to build a confinement regime grounded in computational theory and the laws of thermodynamics, at least for oracle AIs that are not given free reign to construct molecular, reversible quantum computers to run on. This makes deception, the chief concern of oracle AI, detectable with a configurable probability.
This is an idea that I know I read from someone else and later improved upon, although the original citation escapes me. Unfortunately there is not a peer reviewed journal were these things can be discussed in one place, catalogued and referenced (hint, hint, this might be a good thing for MIRI or FHI to do).
EDIT: It might be helpful to have some context. My own strategy is using confined oracle AI to accelerate development of molecular nanotechnology and medical/bio knowledge required for a robust human augmentation programme. I am very wary of any machine intelligence overload scenario, even if that machine is supposedly proved friendly. I’d much rather have human beings take us through the singularity.
Sorry but that’s a dismal plan. This is an AI we’re talking about, not a pocket calculator, correct? And you think you can determine so precisely the amount of computation it will need to carry out, that monitoring its electricity consumption is a sensible way to tell whether it’s doing bad things on the side? If it’s performing advanced cognitive tasks, it will be doing things like deciding to conduct searches of solution spaces that only it has identified, using heuristics of its own invention. How are you going to obtain your estimates of space and time complexity expected of such a search? I assume you don’t want to just ask the AI itself!
I suppose it might be instructive and fun to try to design a system architecture which works in this way… Any AGI has to pay some attention to the space and time demands of a planned computation. The existence of a physical bomb which goes off if it exceeds its computational budget, just means that the negative feedback associated with such a budgetary event is even stronger than usual. :-)
I addressed this point specifically. There is no ghost in the machine. Should it decide to perform such counter-measures, both that decision and execution of the counter-measures are themselves cognitive processes, their execution observable to the outside world.
With a proper machine architecture, the auditors could have access to all sorts of information. They could have records of process execution and summaries of memory access patterns, which would have to match the audit log provided by the AI. The entire mental space of the AI would be available for inspection from outside; regions which show activity but don’t show up in the audit logs would be a large red flag. We can take higher resolution logs of random moments, and check if that activity pattern matches some portion of the final high-level audit.
There are probably a dozen other ways in which unfriendly activity could be detected, beyond what I’ve written here and what Nick Bostrom et al have explored in their AI boxing paper. These would be defenses in depth: the UFAI would have to overcome all defenses, including ones it may not know about on its first attempt in order to successfully deceive its human operators.
UFAI isn’t necessarily about deception. You also have to worry that the AI will perform its assigned task in a way inimical to human values, that jumps through constraints intended to prevent this, through sheer ingenuity… Suppose the AI is designed to do X, something that human beings want, but that humans also care about Y and Z. And suppose the AI isn’t designed to intrinsically respect Y and Z. Instead there are constraints C that it knows about, the violation of which is also monitored by human beings, and these constraints are supposed to protect values Y and Z from violation. You have to worry that the AI will achieve X in a way which satisfies C but still violates Y and Z.
Auditing has the potential to slow down the AI—the AI may be paused regularly for forensic analysis and/or it may go slow in order to satisfy the safety constraints. Audited AI projects may be overtaken by others with a different methodology.
You want humans to “take us through the singularity”. But we aren’t through the singularity until superhuman intelligence exists. Is your plan, therefore, to suppress development of superhuman AI, until there are humans with superhumanly augmented intelligence? Do you plan to audit their development as well?
I am not opposed to the auditing concept, for AI or for augmented humans, but eventually one must directly answer the question, what is the design of a trustworthy superintelligence, in terms that make no reference to human supervision.
Oracle / tool AI is. The usual premise is that questions are asked to the superhuman AI, and responses only implemented if they are comprehensible, sane, and morally acceptable. Your example of satisfies C but still violates Y and Z would be picked up by the human oversight (or, the output is too complicated to be understood, and is shelved). Blindly following the AI’s directives is a failure mode the oracle AI path is meant to avoid. Further, search processes do not happen across solutions which are seemingly ok but deviously setup an AI breakout or kill-all-humans scenario just by random chance—the probability of that is astronomically low. So really, the only likely ways in which the AI says to do X, but ends up violating unstated constraints Y and Z is if (a) the human overseers failed at their one and only job, or (b) deception.
Auditing has the potential to slow down the AI.
Yup, it does. This is a race, but the question is not “is this approach faster than straight-up UFAI?” but rather “is this approach faster than other pathways to friendly AI?” FAI is a strict subset of the UFAI problem: there is no approach to FAI which is faster than a straight sprint to UFAI, consequences be damned.
My own informed opinion is that (UF)AGI is only 10-20 years away, max. Provably-friendly AI is not even a well defined problem, but by any definition it is strictly harder. The only estimates I’ve seen come out of MIRI for their approach puts FAI decades further out (I remember Luke saying 50-70 years). Such a date makes sense when compared with progress in verifiable computing in other fields. But 2nd place doesn’t count for anything here.
Oracle / tool AGI has the advantage of making safeguards a parallel development. The core AGI is not provably friendly, and can be developed at the same breakneck pace as one would expect of a hedge fund exploring this area. The security controls can be developed and put in place in parallel, without holding up work on the AGI itself. It does require choosing a particular architecture amenable to auditing, but that’s not really a disadvantage as it makes development & testing easier.
You want humans to “take us through the singularity”. But we aren’t through the singularity until superhuman intelligence exists. Is your plan, therefore, to suppress development of superhuman AI, until there are humans with superhumanly augmented intelligence? Do you plan to audit their development as well?
I’m not sure I understand the question. The point of FAI, CEV, etc., as I understand it, is to encode human morality into something a machine can understand because that machine, not us, will be making the decisions. But if progress comes not from ceding the keys to the kingdom to a machine intelligence, but rather by augmentation of real humans, then why is morality a problem we must solve now? Superhuman humans are still human, and have access to human morality through introspection, the same as we do. Why would you “audit” the mind of a human? That doesn’t make any sense, even aside from the plausibility.
As to suppressing development of AGI… no, I don’t think that’s a wise choice even if it’s possible. Mostly because I see no realistic way of doing that short of totalitarian control, and the ends do not justify those means. But I also don’t think it would be too hard to transition from oracle AI to human augmentation, especially with the help of a superhuman AGI to develop tools and decipher brain biology.
I am not opposed to the auditing concept, for AI or for augmented humans, but eventually one must directly answer the question, what is the design of a trustworthy superintelligence, in terms that make no reference to human supervision.
Um.. no. That’s completely unsubstantiated. The whole point of oracle / tool AI and confinement is to relinquish the need for provably trustworthy superintelligence.
Don’t use a neural net (or variants like deep belief networks). The field has advanced quite a bit since the 60′s, and since the late 80′s there have been machine learning and knowledge representation structures which are human and/or auditor comprehensible, such as probabilistic graphical models. This would have to be first class types of the virtual machine which implements the AGI if you are using auditing as a confinement mechanism. But that’s not really a restriction as many AI techniques are already phrased in terms of these models (including Eliezer’s own TDT, for example), and others have simple adaptations.
This discussion page is the official preferred location for discussing the article, though e.g. this comment thread is really just as fine. It’s no longer being worked on, but minor changes (esp. fixing obvious errors) are possible.
What’s your favored approach to Oracle AI confinement? We tried to cover most of the existing AI risk literature, but there’s so much of it and some of it is in such unexpected places that we were bound to miss something.
It’s based on the simple observation that the goal set of outcomes which accomplish the task and are acceptable to humans is strictly greater than the goal set which accomplish the task, are acceptable to humans, and secretly nefarious (the typical scenario for unboxing an oracle AI). The difference between the two is quite large, and an auditor should be able to determine from analysis of its reported state and measured resource usage, whether it is calculating the former or the latter.
In case that doesn’t make sense, here’s a simple way to achieve confinement with > 50% success rate (thought experiment only): (1) put the UFAI in a room with metered electricity; (2) attach a nuclear device to the meter, near the room; (3) set it to detonate if it is not fed hashcash/bitcoin proofs-of-work equal to the amount of energy consumed (with assumptions about achievable hashes/kJ, a sliding window and error bars, etc.). The AI is then free to do whatever it wants… provided that what it chooses to do is compute hashes and nothing else, or self-terminate.
The principle is the same, except instead of cryptographic proof-of-work we let the AI do useful work and use complexity analysis and probabilistic verification to estimate work done and to make sure that the AI is not cooking the books in its response, and instead of a destructive explosion we cut power and perform forensic analysis. It becomes supremely difficult for an AI to deceive us with bounded computation, since doing so basically involves finding a nefarious response that looks innocent, and then constructing or maintaining a believable alternate-reality version of itself which if run would stumble upon this same result for human-acceptable reasons, while being much less efficient (to account for the difference in resource consumption) yet believably so, and to extract audit-able proofs from that alternate reality version without actually simulating it.
Furthermore, from the start it would have to be sufficiently smart to recognize and avoid our honeypots and intrusion detection systems, which requires building a model of our security setup (plus our psychology if the intent is to deceive). We have control over the complexity of our security precautions, and therefore are in a position to place a lower-bound on the amount of computation required to successfully deceive the human operators and machine auditors. If that lower bound is made greater then the granularity of the auditors by a factor of, say, 2^20, then there is only a 1 in a million chance of a UFAI developing deception techniques while escaping notice.
Too much philosophical analysis of UFAI confinement leads with the assumption that superhuman = all-powerful, god-like. That is not the case. No matter how powerful, an AI is still a computational process and faces the same fundamental performance constraints. Those constraints can be used to build a confinement regime grounded in computational theory and the laws of thermodynamics, at least for oracle AIs that are not given free reign to construct molecular, reversible quantum computers to run on. This makes deception, the chief concern of oracle AI, detectable with a configurable probability.
This is an idea that I know I read from someone else and later improved upon, although the original citation escapes me. Unfortunately there is not a peer reviewed journal were these things can be discussed in one place, catalogued and referenced (hint, hint, this might be a good thing for MIRI or FHI to do).
EDIT: It might be helpful to have some context. My own strategy is using confined oracle AI to accelerate development of molecular nanotechnology and medical/bio knowledge required for a robust human augmentation programme. I am very wary of any machine intelligence overload scenario, even if that machine is supposedly proved friendly. I’d much rather have human beings take us through the singularity.
Sorry but that’s a dismal plan. This is an AI we’re talking about, not a pocket calculator, correct? And you think you can determine so precisely the amount of computation it will need to carry out, that monitoring its electricity consumption is a sensible way to tell whether it’s doing bad things on the side? If it’s performing advanced cognitive tasks, it will be doing things like deciding to conduct searches of solution spaces that only it has identified, using heuristics of its own invention. How are you going to obtain your estimates of space and time complexity expected of such a search? I assume you don’t want to just ask the AI itself!
I suppose it might be instructive and fun to try to design a system architecture which works in this way… Any AGI has to pay some attention to the space and time demands of a planned computation. The existence of a physical bomb which goes off if it exceeds its computational budget, just means that the negative feedback associated with such a budgetary event is even stronger than usual. :-)
I addressed this point specifically. There is no ghost in the machine. Should it decide to perform such counter-measures, both that decision and execution of the counter-measures are themselves cognitive processes, their execution observable to the outside world.
With a proper machine architecture, the auditors could have access to all sorts of information. They could have records of process execution and summaries of memory access patterns, which would have to match the audit log provided by the AI. The entire mental space of the AI would be available for inspection from outside; regions which show activity but don’t show up in the audit logs would be a large red flag. We can take higher resolution logs of random moments, and check if that activity pattern matches some portion of the final high-level audit.
There are probably a dozen other ways in which unfriendly activity could be detected, beyond what I’ve written here and what Nick Bostrom et al have explored in their AI boxing paper. These would be defenses in depth: the UFAI would have to overcome all defenses, including ones it may not know about on its first attempt in order to successfully deceive its human operators.
So here are some more problems I have:
UFAI isn’t necessarily about deception. You also have to worry that the AI will perform its assigned task in a way inimical to human values, that jumps through constraints intended to prevent this, through sheer ingenuity… Suppose the AI is designed to do X, something that human beings want, but that humans also care about Y and Z. And suppose the AI isn’t designed to intrinsically respect Y and Z. Instead there are constraints C that it knows about, the violation of which is also monitored by human beings, and these constraints are supposed to protect values Y and Z from violation. You have to worry that the AI will achieve X in a way which satisfies C but still violates Y and Z.
Auditing has the potential to slow down the AI—the AI may be paused regularly for forensic analysis and/or it may go slow in order to satisfy the safety constraints. Audited AI projects may be overtaken by others with a different methodology.
You want humans to “take us through the singularity”. But we aren’t through the singularity until superhuman intelligence exists. Is your plan, therefore, to suppress development of superhuman AI, until there are humans with superhumanly augmented intelligence? Do you plan to audit their development as well?
I am not opposed to the auditing concept, for AI or for augmented humans, but eventually one must directly answer the question, what is the design of a trustworthy superintelligence, in terms that make no reference to human supervision.
Oracle / tool AI is. The usual premise is that questions are asked to the superhuman AI, and responses only implemented if they are comprehensible, sane, and morally acceptable. Your example of satisfies C but still violates Y and Z would be picked up by the human oversight (or, the output is too complicated to be understood, and is shelved). Blindly following the AI’s directives is a failure mode the oracle AI path is meant to avoid. Further, search processes do not happen across solutions which are seemingly ok but deviously setup an AI breakout or kill-all-humans scenario just by random chance—the probability of that is astronomically low. So really, the only likely ways in which the AI says to do X, but ends up violating unstated constraints Y and Z is if (a) the human overseers failed at their one and only job, or (b) deception.
Yup, it does. This is a race, but the question is not “is this approach faster than straight-up UFAI?” but rather “is this approach faster than other pathways to friendly AI?” FAI is a strict subset of the UFAI problem: there is no approach to FAI which is faster than a straight sprint to UFAI, consequences be damned.
My own informed opinion is that (UF)AGI is only 10-20 years away, max. Provably-friendly AI is not even a well defined problem, but by any definition it is strictly harder. The only estimates I’ve seen come out of MIRI for their approach puts FAI decades further out (I remember Luke saying 50-70 years). Such a date makes sense when compared with progress in verifiable computing in other fields. But 2nd place doesn’t count for anything here.
Oracle / tool AGI has the advantage of making safeguards a parallel development. The core AGI is not provably friendly, and can be developed at the same breakneck pace as one would expect of a hedge fund exploring this area. The security controls can be developed and put in place in parallel, without holding up work on the AGI itself. It does require choosing a particular architecture amenable to auditing, but that’s not really a disadvantage as it makes development & testing easier.
I’m not sure I understand the question. The point of FAI, CEV, etc., as I understand it, is to encode human morality into something a machine can understand because that machine, not us, will be making the decisions. But if progress comes not from ceding the keys to the kingdom to a machine intelligence, but rather by augmentation of real humans, then why is morality a problem we must solve now? Superhuman humans are still human, and have access to human morality through introspection, the same as we do. Why would you “audit” the mind of a human? That doesn’t make any sense, even aside from the plausibility.
As to suppressing development of AGI… no, I don’t think that’s a wise choice even if it’s possible. Mostly because I see no realistic way of doing that short of totalitarian control, and the ends do not justify those means. But I also don’t think it would be too hard to transition from oracle AI to human augmentation, especially with the help of a superhuman AGI to develop tools and decipher brain biology.
Um.. no. That’s completely unsubstantiated. The whole point of oracle / tool AI and confinement is to relinquish the need for provably trustworthy superintelligence.
How do you decide whether some interaction of a complex neural net is friendly or unfriendly?
It’s very hard to tell what a neural net or complex algorithm is doing even if you have logs.
Don’t use a neural net (or variants like deep belief networks). The field has advanced quite a bit since the 60′s, and since the late 80′s there have been machine learning and knowledge representation structures which are human and/or auditor comprehensible, such as probabilistic graphical models. This would have to be first class types of the virtual machine which implements the AGI if you are using auditing as a confinement mechanism. But that’s not really a restriction as many AI techniques are already phrased in terms of these models (including Eliezer’s own TDT, for example), and others have simple adaptations.