MIRI’s technical research agenda
I’m pleased to announce the release of Aligning Superintelligence with Human Interests: A Technical Research Agenda written by Benja and I (with help and input from many, many others). This document summarizes and motivates MIRI’s current technical research agenda.
I’m happy to answer questions about this document, but expect slow response times, as I’m travelling for the holidays. The introduction of the paper is included below. (See the paper for references.)
The characteristic that has enabled humanity to shape the world is not strength, not speed, but intelligence. Barring catastrophe, it seems clear that progress in AI will one day lead to the creation of agents meeting or exceeding human-level general intelligence, and this will likely lead to the eventual development of systems which are “superintelligent″ in the sense of being “smarter than the best human brains in practically every field” (Bostrom 2014). A superintelligent system could have an enormous impact upon humanity: just as human intelligence has allowed the development of tools and strategies that let humans control the environment to an unprecedented degree, a superintelligent system would likely be capable of developing tools and strategies that give it extraordinary power (Muehlhauser and Salamon 2012). In light of this potential, it is essential to use caution when developing artificially intelligent systems capable of attaining or creating superintelligence.
There is no reason to expect artificial agents to be driven by human motivations such as lust for power, but almost all goals can be better met with more resources (Omohundro 2008). This suggests that, by default, superintelligent agents would have incentives to acquire resources currently being used by humanity. (Can’t we share? Likely not: there is no reason to expect artificial agents to be driven by human motivations such as fairness, compassion, or conservatism.) Thus, most goals would put the agent at odds with human interests, giving it incentives to deceive or manipulate its human operators and resist interventions designed to change or debug its behavior (Bostrom 2014, chap. 8).
Care must be taken to avoid constructing systems that exhibit this default behavior. In order to ensure that the development of smarter-than-human intelligence has a positive impact on humanity, we must meet three formidable challenges: How can we create an agent that will reliably pursue the goals it is given? How can we formally specify beneficial goals? And how can we ensure that this agent will assist and cooperate with its programmers as they improve its design, given that mistakes in the initial version are inevitable?
This agenda discusses technical research that is tractable today, which the authors think will make it easier to confront these three challenges in the future. Sections 2 through 4 motivate and discuss six research topics that we think are relevant to these challenges. Section 5 discusses our reasons for selecting these six areas in particular.
We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.” To become confident that an agent is aligned in this way, a practical implementation that merely seems to meet the challenges outlined above will not suffice. It is also necessary to gain a solid theoretical understanding of why that confidence is justified. This technical agenda argues that there is foundational research approachable today that will make it easier to develop aligned systems in the future, and describes ongoing work on some of these problems.
Of the three challenges, the one giving rise to the largest number of currently tractable research questions is the challenge of finding an agent architecture that will reliably pursue the goals it is given—that is, an architecture which is alignable in the first place. This requires theoretical knowledge of how to design agents which reason well and behave as intended even in situations never envisioned by the programmers. The problem of highly reliable agent designs is discussed in Section 2.
The challenge of developing agent designs which are tolerant of human error has also yielded a number of tractable problems. We argue that smarter-than-human systems would by default have incentives to manipulate and deceive the human operators. Therefore, special care must be taken to develop agent architectures which avert these incentives and are otherwise tolerant of programmer error. This problem and some related open questions are discussed in Section 3.
Reliable, error-tolerant agent designs are only beneficial if they are aligned with human interests. The difficulty of concretely specifying what is meant by “beneficial behavior” implies a need for some way to construct agents that reliably learn what to value (Bostrom 2014, chap. 12). A solution to this “value learning″ problem is vital; attempts to start making progress are reviewed in Section 4.
Why these problems? Why now? Section 5 answers these questions and others. In short, the authors believe that there is theoretical research which can be done today that will make it easier to design aligned smarter-than-human systems in the future.
(1) I read the paper carefully, and enjoyed it. Thanks for publishing it! (2) I only noticed one typo—a missing period on page 3. There may also be an accidental CR at the same location that unintentionally splits a paragraph. (3) I’m skeptical whether a useful theory of machine intelligence safety can be developed prior to the development of advanced machine intelligence capability. Instead, I think that safety and capability must co-evolve. If so, then a technical research agenda which fails to include monitoring and/or participating in capability development may need to be revised. (4) My own experience is that having a prototype machine intelligence capability available, even if primitive, is immensely valuable in thinking about safety.
Thanks! Typo has been fixed.
Re: (3), I think that computer chess is a fine analogy to use here. It’s much easier to make a practical chess program when you possess an actual computer capable of implementing a chess program, but the theoretical work done by Shannon (to get an unbounded solution) still constituted great progress over e.g. the state of knowledge held by Poe. FAI research is still at a place where unbounded solutions would likely provide significant insight, and those are much easier to develop without a practical machine intelligence on hand.
Re: (4), I too expect that having a prototype generally intelligent system on hand would be immensely valuable when thinking about safety. However, it is both the case that (a) we don’t have prototype generally intelligent systems, and it may be some time before they are available, and (b) it seems imprudent to neglect safety research until prototype generally intelligent machines are available. (These points are discussed a bit more in section 5.)
The proof-reading and the comments are much appreciated!
You keep using this analogy of Shannon and chess, but I’m not sure what the problem domain of chess has to do with AGI.
EDIT: To be clear, I can think of other examples, e.g. bridge building or aviation, where a foundational understanding did not by itself lead to being able to construct larger, longer bridges or flying machines, but rather practical experimentation was required hand-in-hand with theoretical analysis. This is because even though we knew the foundational laws, there were still higher order complications in material science and air flow which frustrated ab initio theoretical analysis.
Designing a practical chess program before understanding backtracking algorithms and search trees (or analogous conceptual tools) seems difficult. The same concept applies to your other examples (bridge building, aviation) where it is important to have a theoretical understanding of relevant physics before trying to build the Golden Gate bridge or a 747.
It’s an analogy: chess is a domain where smart people were confused about it (Poe), and then Shannon developed conceptual tools to resolve many of those confusions (trees, backtracking), and this enabled the creation of practical algorithms (Deep Blue). The claim is that AGI is similar: we are currently confused about it, and conceptual tools / theoretical understanding seem quite important.
This theoretical understanding alone is not enough, of course. As with flight, practical experimentation is necessary both before (e.g., to figure out which physics questions to ask) and after (e.g., to deal with higher-order complications, as you said). The point we’re trying to make with the technical agenda is that the current understanding of AGI is still wanting for conceptual tools. (The rest of the document attempts to back this claim up, by providing examples of confusing questions that we lack the conceptual tools to answer.)
(I definitely agree that significant practical work is necessary, but I don’t think that modern practical systems are close enough to AGI for practical work available today to have a high chance of being relevant in the future. I’ll expand on this when I reply to your other comments :-)
I think i finally found a foundational point of disagreement. So there is at this time no evidence that there is some underlying intelligence function being approximated by the ad hoc amalgum of heuristics which is every example of intelligence we have. Minsky expands on this point quite eloquantly in the preface to Society of Mind. General intelligence is not some abstract thing approximated by heuristics, it is the cyclical heuristic generation and execution framework. If you think you can find a computable general intelligence function, great—that’d be a significant advance! Get it published. But you’re not the only one who has looked, and failed. Right now the search for a computable general intelligence function is more resembling the search for El Dorado or Lock Ness monster.
I largely agree—but what are the heuristics generated to do? If you can generate a practical “heuristic generation and execution framework” using bounded computing power, then you should be able to tell me how to do it using unbounded finite computing power: and I haven’t seen unbounded solutions that would reliably work. Finding an unbounded solution is a strictly easier problem, and if you can show me an unbounded solution, I’d feel a lot less confused.
(Serious question: if you did have unbounded computing power, and you did have access to strategies such as “search all possible heuristics and evaluate them against the real world,” how would you go about constructing an AGI?)
Asking for unbounded solutions does not seem to me like a hunt for El Dorado; rather, it seems more like asking that we understand the difference between pyrite and gold before attempting to build a city of pure gold using these gold-colored rocks from various places.
You say you agree, and then talk about something entirely unrelated. I suspect I failed in communicating my point. That’s okay, let’s try again.
If you want a general intelligence running on unbounded computational hardware, that’s what AIXI is, and approximations thereof. But I hope I am not making a controversial statement if I claim that the behavior of AIXI is qualitatively different than human-like general intelligence. There is absolutely nothing about human-like intelligence I can think of which approximates brute-force search over the space of universal Turing machines. If you want to study human-like intelligence, then you will learn nothing of value by looking at AIXI.
For the purpose of this discussion I’m going to make broad, generalizing assumptions about how human-like intelligence works. Please forgive me and don’t nit the purposefully oversimplified details. General intelligence arises primarily out of the neocortex, which is a network of hundreds of millions of cortical columns, which can be thought of as learned heuristics. Both the behavior of columns and their connections change over time via some learning process.
Architectures like CogPrime similarly consist of networks of learned behaviors / heuristic programs connected in a probability graph model. Both the heuristics and the connections between them can be learned over time, as can the underlying rules (a difference from the biological model which allows the AI to change its architecture over time).
In these models of human-like thinking, the generality of general intelligence is not represented in any particular snapshot of its state. Rather, it is the fact that its behavior drifts over time, in sometimes directed and sometime undirected ways. The drift is the source of generality. The heuristics are learned ways of guiding that drift in productive directions, but (1) it is the update process that gives generality, not the guiding heuristics; (2) it is an inherently unconstrained process; and (3) the abstract function being approximated by the heuristic network over time is constantly changing.
One way of looking at this is to say that a human-like general intelligence is not a static machine able to solve any problem, but rather a dynamic machine able to solve only certain problems at any given point in time, but is able to drift within problem solving space in response to its percepts. And the space of specialized problem solvers is sufficiently connected that the human-like intelligence is able to move from its current state to become any other specialized problem solver in reasonable time, a process we call learning.
One of the stated research objectives of MIRI is learning how to build a “reliable” / steadfast agent. I’ll go out on a limb and say it: the above description of human intelligence, if true, is evidence that a steadfast human-like general intelligence is a contradiction of terms. This is what I mean by making the comparison to El Dorado: you are looking for something of which there is no a priori evidence of its existence.
Maybe there are other architectures for general problem solving which look nothing like the neocortex or integrative AGI designs like CogPrime. But so far the evidence is lacking...
I disagree. AIXI does not in fact solve the problem. It leaves many questions (of logical uncertainty, counterfactual reasoning, naturalized induction, etc.) unanswered, even in the unbounded case. (These points are touched upon in the technical agenda, and will be expanded upon in one of the forthcoming papers mentioned.) My failure to communicate this is probably why my previous comment looked like a non-sequitur; sorry about that. I am indeed claiming that we aren’t even far along enough to have an unbounded solution, and that I strongly expect that unbounded solutions will yield robust insights that help us build more reliable intelligent systems.
(The technical agenda covers questions that are not answered by AIXI, and these indicate places where we’re still confused about intelligence even in the unbounded case. I continue to expect that resolving these confusions will be necessary to create reliable AGI systems. I am under the impression that you believe that intelligence is not all that confusing, and that we instead simply need bigger collections of heuristics, better tools for learning heuristics, and better tools for selecting heuristics, but that this will largely arise from continued grinding on e.g. OpenCog. This seems like our core disagreement, to me—does that seem accurate to you?)
Yep, this seems quite likely in the bounded case. A generally intelligent reasoner would have to be able to figure out new ways to solve new problems, learn new heuristics, and so on. I agree.
This depends upon how you cash out the word “steadfast”, but I don’t think that the type of reliability we are looking for is a contradiction in terms. Can you think of another meaning of the word “reliability” that we are looking for, that allows me to simultaneously believe that generally intelligent systems are “dynamic machines [...] able to drift within problem solving space in response to its percepts” and that reliability doesn’t arise in generally intelligent systems by default? (Such an interpretation is more likely to be the thing I’m trying to communicate.)
I think I can see what Mark_Friedenbach is getting at here; I consider this sentence:
And I note that “any other specialised problem solver” includes both Friendly and Unfriendly AIs; this implies that Mark’s definition of human-like includes the possibility that, at any point, the AI may learn to be Unfriendly. Which would be in direct contradiction to the idea of an AI which is steadfastly Friendly. (Interestingly, if I am parsing this correctly, it does not preclude the possibility of a Friendly non-human-like intelligence...)
No, I’ve been trying for a while and can’t. I think what I mean is what you are saying. Sorry, can you try another explanation? I use “steadfast goal” in the way that Goertzel defined the term:
http://goertzel.org/GOLEM.pdf
If you literally can’t think of a meaning of the word “reliability” such that intelligent systems could be both dynamic problem-solvers (in the sense above) and “unreliable” by default, then I seriously doubt that I can communicate my point in the time available, sorry—I’m going to leave this conversation to others.
To reiterate where we are, an AI is described as steadfast by Goertzel “if, over a long period of time, it either continues to pursue the same goals it had at the start of the time period, or stops acting altogether.”[1] I took this to be a more technical specification of what you mean by “reliable”, you disagreed. I don’t see what other definition you could mean ….
[1] http://goertzel.org/GOLEM.pdf
Mark: So you think human-level intelligence by principle does not combine with goal stability. Aren’t you simply disagreeing with the orthogonality thesis, “that an artificial intelligence can have any combination of intelligence level and goal”?
To be clear I’ve been talking about human-like, which is a different distinction than human-level. Human-like intelligences operate similarly to human psychology. And it is demonstrably true that humans do not have a fixed set of fundamentally unchangeable goals, and human society even less so. For all its faults, the neoreactionaries get this part right in their critique of progressive society: the W-factor introduces a predictable drift in social values over time. And although people do tend to get “fixed in their ways”, it is rare indeed for a single person to remain absolutely rigidly so. So yes, in as far as we are talking about human-like intelligences, if they had fixed truly steadfast goals then that would be something which distinguishes them from humans.
I don’t think the orthogonality thesis is well formed. The nature of an intelligence may indeed cause it to develop certain goals in due coarse, or for its overall goal set to drift in certain, expected if not predictable ways.
Of course denying the orthogonality thesis as stated does not mean endorsing a cosmist perspective either, which would be just as ludicrous. I’m not naive enough to think that there is some hidden universal morality that any smart intelligence naturally figures out—that’s bunk IMHO. But it’s just as naive to think that the structure of an intelligence and its goal drift over time are purely orthogonal issues. In real, implementable designs (e.g. not AIXI), one informs the other.
So you disagree with the premise of the orthogonality thesis. Then you know a central concept to probe to understand the arguments put forth here. For example, check out Stuart’s Armstrong’s paper: General purpose intelligence: arguing the Orthogonality thesis
I explained in my post how the orthogonality thesis as argued by Stuart Armstrong et al presents a false choice. His argument is flawed.
I’m sorry I’m having trouble parsing what you are saying here...
There are already multiple designs for smarter-than-human systems in various stages of implementation (CogPrime, LIDA, ACT-R, SOAR, NARS, and others). How far along do these projeccts need to go before MIRI engages with their communities? Wouldn’t it be a little too late to engage after we have a working UFAI?
There have also been AM, Eurisko, Copycat, and Cyc. Cyc is an interesting system with practical applications, but I doubt that an analyses of Cyc would lead to progress in FAI that would be relevant to early practical AGI systems.
It seems like you put CogPrime and ACT-R into the reference class of things that could become seed AGIs if only they were bigger, whereas I put them into the reference class of useful tools with little chance of becoming seed AGIs. Unless I’m mistaken, you’re of the opinion that AGI isn’t all that confusing, and we just need to keep iterating on existing solutions such as CogPrime until they’re as smart as a person? I don’t think this is the case: there are parts of the problem that aren’t yet well-understood, and it seems to me that the solution will require more than just an increased “common sense” database (a la Cyc), a bigger AtomSpace (in OpenCog) or deeper neural networks. Part of the reason why I don’t currently expect that, e.g., CogPrime could become a seed AGI is because it doesn’t shed light on the parts of the problem that I still find confusing.
I know you don’t like the chess analogy all that much, but I think it is again useful here. Consider Poe’s state of knowledge: if you try to make a practical chess program, Poe will make misguided objections such as “but chess is in the realm of thought, because your move rarely follows by necessity from the previous moves!”
Person 1 replies “yes, chess is in the realm of thought, but we’ll make a thinking machine! Humans are just machines, so we merely need to add enough gears for consciousness to emerge, and then it will play chess!”
Person 2 replies “it is possible, in theory, to compute out all possible future games from a given position and then select moves that tend to lead to victory”, and then explains trees and backtracking.
I am much more confident in the ability of Person 2 to construct a practical chess algorithm.
The question of general intelligence is similar: my state of knowledge is similar to Poe’s, and my questions similarly misguided. But Cyc, CogPrime, ACT-R, etc. all strike me as Person 1-esque answers: when I point to my confusions, the answers are of the form “ah, but you see, we will create a thinking machine!” and not of the form “here is how your question could be answered in theory.” To me, this indicates that these architectures probably aren’t solving the hard parts of the problem.
(The problem is not that it’s impossible to create “thinking machines.” The problem is that saying “it will think!” begs the question. You could roughly describe a practical chess program’s search for a good move as “thinking,” but this doesn’t change the fact that Person 1 is confused and that you need Person 2′s insights before designing a practical chess program.)
For these reasons, the architectures you listed don’t seem close enough to AGI for practical safety research on those architectures to be useful in the future with high probability.
If any project that starts answering or dissolving various confusions surrounding general intelligence (naturalized induction, logical counterfactuals, etc.) I would definitely take note.
Could you mention some specific examples of problems with general intelligence that you find confusing but which CogPrime doesn’t shed light on?
Most of the technical problems in the agenda have this property, including:
How can one inductively learn an environment that computes them? (See: Naturalized Induction, discussed on p4)
What does it mean to select the “best available action”? (See: Theory of Counterfactuals, discussed on p5)
What is a satisfactory set of logical priors? (See: Logical Priors, discussed on p6)
On my understanding of CogPrime, its answers to these questions are of the form “we’ll just keep adding/refining heuristics and heuristic-learning mechanisms and heuristic-selection methods until it’s smart, and so we don’t need to answer these problems ourselves.” To me, this seems like a Person 1-type answer (“we’ll just add more gears until it can think”) rather than a Person 2-type answer (“here is how these things could be done in theory”).
It is possible to design a chess-playing machine by simply adding more and more gears—the trick is to add the right gears in the right places. In practice, this turns out to be really difficult. It seems implausible that someone could put the gears in the right places on purpose before having a conceptual understanding of trees and backtracking.
Similarly, I think it is possible to design an intelligent system by developing better collections of heuristics, learning methods, and heuristic-selection methods—but this is in fact pretty difficult, and it doesn’t seem likely to me that someone can get the right heuristics and learning methods before they can generate Person 2-type answers to questions such as the above.
(Well, actually, I worry that if we take the “add heuristics until it seems intelligent” route before we can give Person 2-type answers to questions such as the above, then we may be able to succeed, but we will be much more likely to end up with UFAI.)
Have you read Engineering General Intelligence[1]? CogPrime is not a hack job—there is significant theory going into the architectural design.
But I won’t belabor that point because even if CogPrime / OpenCog were as you describe, that pretty much also describes how human intelligence works too. The only example of general intelligence we have is also a grab-bag of heuristics and heuristic-learning procedures. I hope we can agree that human intelligence was evolved, not designed. And it pretty much was a matter of adding more gears to get more capability (oversimplified a bit, but qualitatively correct).
The human mind does not have a naturalized induction algorithm. The human mind does not have a rigorous understanding of counterfactuals. The human mind does not have an explicit set of logical priors. If it did, we wouldn’t need all this rationality business and LessWrong. Calling these out as problems that need to be solved to build an AGI betrays a certain top-down, unified design bias which is neither reflected in the human mind nor relevant to architectures like CogPrime. CogPrime approaches these problems in the same way the human mind does—namely it doesn’t solve them, because general solutions to these problems are not required to build human-level or better intelligence.
You seem under-confident that an AGI could be developed by “adding more gears” as you put it. Yet that is exactly how the only known examples of general intelligence we know of originated. It would seem a less conservative assumption to presume that there exists some single-principle universal inference engine that can be realistically implemented and is compatible with human morality, itself a result of the complex interaction of our learned heuristics, long-term memory, and basic instincts.
CogPrime is extremely complex and I don’t want to oversimplify it. But relevant to this conversation, CogPrime does work by networking specialized pattern recognizers and procedures together, much like the above description of human thinking. However it does so in a way that is better adapted to modern computational hardware and ease of analysis, rather than precise neural emulation.
You seem to think that a heuristic approach is more likely to lead to unfriendly AI. I find this a hard position to sympathize with—and not just because I personally favor boxing strategies that let us delay solving the FAI problem. You list the value learning problem as one of your core problems in AGI theory. Well human values arise in a very complex way out of the structure of human thought processes, so it seems reasonable that an AGI designed to more closely resemble human thought processes would be more amenable to direct value loading—simply structure the built-in processes to roughly similar to the best we know about human neural science based psychology, then test and iterate. (This is OpenCog’s plan. As said, I prefer side-stepping the issue with boxing.)
[1] http://lesswrong.com/lw/kq4/link\_engineering\_general\_intelligence\_the/
The only working generally intelligent system we know of (the brain) was indeed evolved, and I do indeed expect that we could invent general intelligence by doing something kinda like what evolution did.
However, I think that value loading on an evolved system would prove quite difficult: just as humans are godshatter, I expect you’d end up with an AI that is a thousand shards of desire but not the same shards as you. Maybe it would be possible to value-align such a system, but there are a number of reasons why I expect that this would be very difficult. (I would expect it to by default manipulate and deceive the programmers, etc.)
More importantly, I do in fact expect that there are shortcuts to general intelligence, just as there are shortcuts to flight that don’t involve feathers and flapping wings. Before airplanes, the only examples we had of flying things were birds. However, upon gaining an understanding of physics we found that it was possible to build something simpler (planes can’t heal or reproduce) but also much more powerful along the relevant dimensions (speed, carrying capacity).
I agree that we can probably get general intelligence via “adding more gears,” but you could have made similar arguments to support using genetic algorithms to develop chess programs: I’m sure you could develop a very strong chess program via a genetic algorithm (“general solutions to chess are not required” / “humans don’t build a game tree” / etc.), but I don’t expect that that’s the shortest nor safest path to superhuman chess programs.
If the AI can deceive you, then it has in principle solved the FAI problem. You simply take the function which tests whether the operator would be disgusted by the plan, and combine it with a Occam preference for simple plans and excessive detail.
It seems like you, lukeprog, EY, and others are arguing that an UFAI will in a matter of time too close to notice, learn enough to build such an human moral judgment predictor that in principle also solves FAI. But you are also arguing that this very FAI sub-problem of value learning is such a ridiculously hard problem that it will take a monumental effort to solve. So which is it?
The AI won’t deceive it’s operators. It doesn’t know how to deceive its operators, and can’t learn how to carry out such deception undetected. If it is built in the human-like model I described previously, it wouldn’t even know deception was an option unless you taught it (thinking within its principles, not about them).
It is simply unfathomable to me how you come to the logical conclusion that an UFAI will automatically and instantly and undetectably work to bypass and subvert its operators. Maybe that’s true of a hypothetical unbounded universal inference engine, like AIXI. But real AIs behave in ways quite different from that extreme, alien hypothetical intelligence.
I hope that you have the time at some point to read Engineering General Intelligence. I fear that there is little more we can discuss on this topic until then. The proposed designs and implementation pathways bear little resemblance to “adding more gears” in the sense that you seem to be using the phrase.
I don’t think that this follows: it’s easier to predict that someone won’t like a plan, than it is to predict what’s the plan that would maximally fulfill their values.
For example, I can predict with a very high certainty that the average person on the street would dislike it if I were to shoot them with a gun; but I don’t know what kind of a world would maximally fulfill even my values, nor am I even sure of what the question means.
Similarly, an AI might not know what exactly its operators wanted it to do, but it could know that they didn’t want it to break out of the box and kill them, for example.
This seems like a very strong claim to me.
Suppose that the AI has been programmed to carry out some goal G, and it builds a model of the world for predicting what would be the best way of to achieve G. Part of its model of the world involves a model of its controllers. It notices that there exists a causal chain “if controller becomes aware of intention I, and controller disapproves of intention I, controller will stop AI from carrying out intention I”. It doesn’t have a full model of the function controller-disapproves(I), but it does develop a plan that it thinks would cause it to achieve G, and which—based on earlier examples—seems to be more similar to the plans which were disapproved of than the plans that were approved of. A straightforward analysis of “how to achieve G” would then imply “prevent the predicate controller-aware(I) from being fulfilled while I’m carrying out the other actions needed for fulfilling G”.
This doesn’t seem like it would require the AI to be taught the concept of deception, or even to necessarily possess “deception” as a general concept: it only requires that it has a reasonably general capability for reasoning and modeling the world, and that it manages to detect the relevant causal chain.
While I agree, his proposal does seem like a good start. Restricting a UFAI to pursue only a subset of all potentially detrimental plans is a modest gain, but still a gain worth achieving. I am skeptical that FAI should consist of a grand unified moral theory. I think an FAI made of many overlapping moral heuristics and patches, such as the restriction he describes, is more technically feasible, and might even be more likely to match actual human value systems, given the ambiguous, varying, and context sensitive nature of our evolved moral inclinations.
(I realize that these are not properties generally considered when thinking about computer superintelligences—we’re inclined to see computers as rigidly algorithmic, which makes sense given current technology levels. But I believe extrapolating from current technology to predict how AGI will function is a greater mistake than extrapolating from known examples of intelligence - a process is better understood when looking at actions than when looking at substrate. With regard to intelligence at least, AGI will necessarily be much more flexible in its operations than traditional computers are. I expect that the cost of this flexibility in behavior will be sacrificing rigidity at the process level. Performing billions of Bayesian calculations a second isn’t feasible, so a more organic and heuristic based approach will be necessary. If this is correct and such technologies will be necessary for an AGI’s intelligence, then it makes sense that we’d be able to use them for an AGI’s emotions or goals as well.)
Even if we do attempt to build a grand unified Friendly software, I expect little downside (relative to potential risks) to adding these sort of restrictions in addition.
Wow there is a world of assumptions wrapped up in there. For example that the AI has a concept of external agents and an ability to model their internal belief state. That an external agent can have a belief about the world which is wrong. This may sound intuitively obvious, but it’s not a simple thing. This kind of social awareness takes time to be learnt by humans as well. Heinz Wimmer and Josef Perner showed that below a certain age (3-4 years) kids lack an ability to track this information. A teacher puts a toy in a blue cupboard, then leaves the room and you move it to the red cupboard, and the teacher comes back into the room. If you ask the kid not where the toy is, but what cupboard the teacher will look in to find it, and they will say the red cupboard.
It’s no accident that it takes time for this skill to develop. It’s actually quite complex to be able to keep track of and simulate the states of mind of other agents acting in our world. We just take it for granted because we are all well-adjusted adults of a species evolved for social intelligence. But an AI need not think in that way, and indeed of the most interesting use cases for tool AI (“design me a nanofactory constructible with existing tools” or “design a set of experiments organized as a decision tree for accomplishing the SENS research objectives”) would be best accomplished by an idiot savant with no need for social awareness.
I think it goes without saying that obvious AI safety rule #1 is don’t connect an UFAI to the internet. Another obvious rule I think is don’t build in capabilities not required to achieve the things it is tasked with. For the applications of AI I imagine in the pre-singularity timeframe, social intelligence is not a requirement. So when you say “part of its model of the world involves a model of its controllers”, I think that is assuming a capability the AI should not have built-in.
(This is all predicated on soft-enough takeoff that there would be sufficient warning if/when the AI self-developed a social awareness capability.)
Also, what 27chaos said is also worth articulating in my own words. If you want to prevent an intelligent agent from taking a particular category of actions there are two ways of achieving that requirement: (a) have a filter or goal system which prevents the AI from taking (box) or selecting (goal) actions of that type; or (b) prevent it by design from thinking such thoughts to begin with. An AI won’t take actions it never even considered in the first place. While the latter course of action isn’t really possible with unbounded universal inference engines (since “enumerate all possibilities” is usually a step in their construction), such designs arise quite naturally out of more realistic psychology-inspired designs.
The approach to AGI safety that you’re outlining (keep it as a tool AI, don’t give it sophisticated social modeling capability, never give it access to the Internet) is one that I agree should work to keep the AGI safely contained in most cases. But my worry is that this particular approach being safe isn’t actually very useful, because there are going to be immense incentives to give the AGI more general capabilities and have it act more autonomously.
As we wrote in Responses to Catastrophic AGI Risk:
So while I agree that a strict boxing approach would be sufficient to contain the AGI if everyone were to use it, it only works if everyone were indeed to use it, so what we need is an approach that works for more autonomous systems as well.
Hmm. That sounds like a very interesting idea.
While I actually agree that tool AI goals can be programmed, if you want to keep the whole thing from turning unsafely agenty, you’re going to have to strictly separate the inductive reasoning from the actual tool run: run induction for a while, then use tool-mode to compose plans over the induced models of the world, potentially after censoring those models for safety.
Well, it follows pretty straightforwardly from point 6 (“AIs will want to acquire resources and use them efficiently”) of Omohundro’s The Basic AI Drives, given that the AI would prefer to act in a way conducive to securing human cooperation. We’d probably agree that such goal-camouflage would be what an AI would attempt above a certain intelligence-threshold. The difference seems to be that you say that threshold is so high as to practically only apply to “hypothetical unbounded universal inference engines”, not “real AIs”. Of course, your “undetectably” requirement does a lot of work in raising the required threshold, though “likely not to be detected in practice” translates to something different than, say, “assured undetectability”.
The softer the take-off (plus, the lower the initial starting point in terms of intelligence), the more likely your interpretation would pan out. The harder the take-off (plus, the higher the initial starting point in terms of intelligence), the more likely So8res’ predicted AI behavior would be to occur. Take-off scenarios aren’t mutually exclusive. On the contrary, the probable temporal precedence of the advent of slow-take-off AI with rather predictable behavior could lull us into a sense of security, not expecting its slightly more intelligent cousin, taking off just hard enough, and/or unsupervised enough, that it learns to lie to us (and since we’d group it with the reference class of CogSci-like docile AI, staying undetected may not be as hard as it would have been for the first AI).
Both, considering the task sure seems hard from a human vantage point, and by definition will seem easy from a sufficiently intelligent agent’s.
Well this argument I can understand, although Omohundro’s point 6 is tenuous. Boxing setups could prevent the AI from acquiring resources, and non-agents won’t be taking actions in the first place, to acquire resources or otherwise. And as you notice the ‘undetectable’ qualifier is important. Imagine you were locked in a box guarded by a gatekeeper of completely unknown and alien psychology. What procedure would you use for learning the gatekeeper’s motives well enough to manipulate it, all the while escaping detection? It’s not at all obvious to me that with proper operational security the AI would even be able to infer the gatekeeper’s motivational structure enough to deceive, no matter how much time it is given.
MIRI is currently taking actions that only really make sense as priorities in a hard-takeoff future. There are also possible actions which align with a soft-takeoff scenario, or double-dip for both (e.g. Kaj’s proposed research[1]), but MIRI does not seem to be involving itself with this work. This is a shame.
[1] http://intelligence.org/files/ConceptLearning.pdf
There’s no guarantee that boxing will ensure the safety of a soft takeoff. When your boxed AI starts to become drastically smarter than a human -- 10 times --- 1000 times -- 1000000 times—the sheer enormity of the mind may slip out of human possibility to understand. All the while, a seemingly small dissonance between the AI’s goals and human values—or a small misunderstanding on our part of what goals we’ve imbued—could magnify to catastrophe as the power differential between humanity and the AI explodes post-transition.
If an AI goes through the intelligence explosion, its goals will be what orchestrates all resources (as Omohundro’s point 6 implies). If the goals of this AI does not align with human values, all we value will be lost.
If you want guarantees, find yourself another universe. “There’s no guarantee” of anything.
You’re concept of a boxed AI seems very naive and uninformed. Of course a superintelligence a million times more powerful than a human would probably be beyond the capability of a human operator to manually debug. So what? Actual boxing setups would involve highly specialized machine checkers that assure various properties about the behavior of the intelligence and its runtime, in ways that truly can’t be faked.
And boxing, by the way, means giving the AI zero power. If there is a power differential, then really by definition it is out of the box.
Regarding your last point, is is in fact possible to build an AI that is not a utility maximizer.
No, hairyfigment’s answer was entirely appropriate. Zero power would mean zero effect. Any kind of interaction with the universe means some level of power. Perhaps in the future you should say nearly zero power instead so as to avoid misunderstanding on the parts of others, as taking you literally on the “zero” is apparently “legalistic”.
As to the issues with nearly zero power:
A superintelligence with nearly zero power could turn to be a heck of a lot more power than you expect.
The incentives to tap more perceived utility by unboxing the AI or building other unboxed AIs will be huge.
Mind, I’m not arguing that there is anything wrong with boxing. What’s I’m arguing is that it’s wrong to rely only on boxing. I recommend you read some more material on AI boxing and Oracle AI. Don’t miss out on the references.
I have read all of the resources you linked to and their references, the sequences, and just about every post on the subject here on LessWrong. Most of what passes for thinking regarding AI boxing and oracles here is confused and/or fallacious.
It would be helpful if you could point to the specific argument which convinced you of this point. For the most part every argument I’ve seen along these lines either stacks the deck against the human operator(s), or completely ignores practical and reasonable boxing techniques.
Again, I’d love to see a citation. Having a real AGI in a box is basically a ticket to unlimited wealth and power. Why would anybody risk losing control over that by unboxing? Seriously, someone owns an AGI would be paranoid about keeping their relative advantage and spend their time strengthening the box and investing in physical security.
A fact that is only relevant if those properties can capture the desired feature. You’ll recall that defining the desired feature is a major goal of MIRI.
No it doesn’t. Giving the AI zero power to affect our behavior, in the strict sense, would mean not running it (or not letting it produce even one bit of output and not expecting any).
Look, I know the obvious rejoinder doesn’t necessarily tell us that an arbitrary AI’s utility function will attach any value to conquering the world. But the converse part of the theorem does show that world-conquering functions can work. Utility maximization today seems like the best-formalized part of human general intelligence, especially the part that CEOs would like more of. You have not, as far as I’ve seen, shown that any other approach is remotely feasible, much less likely to happen first. (It doesn’t seem like you even want to focus on uploading.) And the parent makes a stronger claim—assuming you want to say that some credible route to AGI will produce different results, despite being mathematically equivalent to some utility function.
No that presumes what is being checked against is the friendly goal system. What I’m talking about is checking that e.g. all actions being taken by the AI are in search of solutions to a compact goal description, also extracted from the machine in the form of a bayesian concept net. Then both the goal set and stochastic samplings of representative mental processes are checked by humans for anomalous behavior (and a much larger subset frequency mined to determine what’s representative).
You’re not testing that the machine obeys some as-of-yet-not-figured-out friendly goal set, but that the extracted goals and computational traces are representative, and then manually inspecting those.
That’s a legalistic definition which belongs only in philosophy debates.
I disagree. Much of human behavior is not utility maximizing. Much of it is about fulfilling needs, which is often about eliminating conditions. You have hunger? You eliminate this condition by eating a reasonable amount of food. You do not maximize your lack of hunger by turning the whole planet into a food-generating system and force-feeding the products down your own throat.
Anyway, in my own understanding general intelligence has to do with concept formation and system 1/system 2 learned behavior. There’s not much about utility maximization there.
Do you count intelligence augmentation as uploading? Because that’s my path throughthe singularity.
Gah, no no no. Not every program is equal to a utility maximizer. Not if utility and utility maximization is to have any meaning at all. Sure you can take any program and call it a utility maximizer by finding some super contrived function which is maximized by the program. But if that goal system is more complex than the program that supposidly maximizes it, then all you’ve done is demonstrate the principle of overfitting a curve.
I’d be curious to hear your opinion about my recent paper.
Your link is broken (correct version): you need to escape underscores in URLs outside a link with a backslash, see formatting help. (Amusingly, the copy-pasted version in this comment looks to work fine.)
Kaj, sorry for the delay. I was on vacation and read your proposal on my phone, but a small touch screen keyboard wasn’t the ideal mechanism to type a response.
This is the type of research I wish MIRI was spending at least half it’s money on.
The mechanisms of concept generation are extremely critical to human morality. For most people most of the time, decisions about whether to pursue a course of action are not made based on whether it is morally justified or not. Indeed we collectively spend a good deal of time and money on workforce training to make sure people in decision making roles consciously think about these things, something which wouldn’t be necessary at all if this was how we naturally operate.
No, we do not tend to naturally think about our principles. Rather, we think within them. Our moral principles are a meta abstraction of our mental structure, which itself guides concept generation such that the things we choose to do comply with our moral principles because—most of the time—only compliant possibilities were generated and considered in the first place. Understanding how this concept generation would occur in a real human-like AGI is critical to understanding how value learning or value loading might actually occur in a real design.
We might even find out that we can create a sufficiently human-like intelligence that we can have it learn morality in the same way we do—by instilling a relatively small number of embodied instincts/drives, and placing it in a protected learning environment with loving caretakers and patient teachers. Certainly this is what the OpenCog foundation would like to do.
Did you submit this as a research proposal somewhere? Did you get a response yet?
Glad to hear that!
I submitted it as a paper to the upcoming AI and Ethics workshop, where it was accepted to be presented in a poster session. I’m not yet sure of the follow-up: I’m currently trying to decide what to do with my life after I graduate with my MSc, and one of the potential paths would involve doing a PhD and developing the research program described in the paper, but I’m not yet entirely sure of whether I’ll follow that path.
Part of what will affect my decision is how useful people feel that this line of research would be, so I appreciate getting your opinion. I hope to gather more data points at the workshop.
Well if academic achievement is you goal, don’t think my opinion should carry much weight—I’m an industry engineer (bitcoin developer) that does AI work in my less than copius spare t.ime. I don’t know how well respected this work would be in academia. To reiterate my own opinion, I think it is the most important AGI work we could be doing however.
Have you posted to the OpenCog mailing list? You’d find some like-minded academics there who can give you some constructive feedback, including naming potential advisors.
https://groups.google.com/forum/#!forum/opencog
EDIT: Gah I wish I knew about that workshop sooner. I’m going to be in Peurto Rico for the Financial Crypto ’15 conference, but I could have swung by on my way. Do you know kanzure from ##hplusroadmap on freenode (Bryan Bishop)? He’s a like-minded transhumanist hacker in the Austin area. You should meet up while you’re there. He’s very knowledgeable on what people are working on, and good at providing connections to help people out.
Thanks for the suggestion! I’m not on that mailing list (though I used to be), but I sent a Ben Goertzel and another OpenCog guy a copy of the paper. The other guy said he’d read it, but that was all that I heard back. Might post it to the mailing list as well.
Thanks, I sent him a message. :)
I would actually post to the list. It’s a pretty big and disparate community there, so you’re likely to get a diverse collection of responses.
By “human interests”, do you mean something the programmers put in, leaving aside the problem of formalizing said interests from the diverse and contradictory goals (with a probably empty intersection, if you take a large enough slice of humanity)?
“Human interests” is meant in a vague sense—there is some sense in which an agent that cures cancer (without doing anything else that humans would consider perverse or weird) is “more beneficial” than an agent that turns everything into paperclips, regardless of how you formalize things or deal with contradictions. This paper discusses technical problems that arise no matter how you formalize “human interests.”
To answer the question that you clarify in later comments, I do not yet have even a vague satisfactory description. Formalizing “human interests” is far from a solved problem, but it’s also not a very “technical” problem at present, which is why it’s not discussed much in this agenda (though the end of section 4 points in that direction).
The same problem applies to any set of interests, though. It’s not just that default AI drives will conflict with (say) liberal humanist interests. They’d conflict with “evangelize Christianity and ensure the survival of the traditional family” too.
I assume that you are talking about the problem of AI value drift, or, as OP puts it
What I am asking is whether OP presumes that the problem of figuring out what “human interests” are to begin with has been solved, at least in some informal way, like “ensure surviving, thriving and diverse humanity far into the future”, or “comply with the literal world of the scriptures”, or “live in harmony with nature”. Even before we worry about the AI munchkining its way into fulfilling the goal in a way a jackass genie would.
Section 4 of the document discusses value learning as an open problem involving its own challenges.
Actually, if I understand it correctly, the value problem is turning informal values into formal ones, not figuring out the informal values to begin with.
Thanks!
Rather than saying that the authors presume the problem of defining human interests has been solved, I would say that the authors are talking about a problem that also has to be solved, separately from that problem.
If we want to drive to the store, we have to both have a working car, and know how to get to the store. If the car is broken, we can fix the car. If we don’t know how to get to the store, we can look at a map. We have to do both.
If someone else wants to use the car to drive to church, we may disagree about destinations but we both want a working car. Fixing the car doesn’t “presume” that the destination question has been solved; rather, it’s necessary to get to any destination.
(OTOH, if we fix the car and the church person steals it, that would kinda suck.)
Right, I didn’t mean “OP is clueless by assuming that the problem has been solved”, but “let’s assume the problem has been solved, and work on the next step”. Probably worded it poorly, given the misunderstanding.
I’d like to take you up on your offer So8res. Please see my questions in the open thread but please answer here.