It simply means “not instrumental”. It has nothing to do with the degree of importance assigned relative to other goals, except in that, obviously, instrumental goals deriving from terminal goal X are always less important than X itself. If your utility function is U = A + B then A and B can be sensibly described as terminal, and the fact that A is terminal does not mean you’d destroy all B just to have A.
Yes, “terminal” means final. Terminal goals are final in that your interest in them derives not from any argument but from axiom (ie. built-in behaviours). This doesn’t mean you can’t have more than one.
What? It doesn’t say any such thing. It says they’re inexplicable in terms of the goal system being examined, but that doesn’t mean they’re inaccessible, in the same way that you can access the parallel postulate within Euclidian geometry but can’t justify it in terms of the other Euclidian axioms.
That said, I think we’re probably good enough at rationalization that inexplicability isn’t a particularly good way to model terminal goals for human purposes, insofar as humans have well-defined terminal goals.
Consider an agent trying to maximize its Pacman score. ‘Getting a high Pacman score’ is a terminal goal for this agent—it doesn’t want a high score because that would make it easier for it to get something else, it simply wants a high score. On the other hand, ‘eating fruit’ is an instrumental goal for this agent—it only wants to eat fruit because that increases its expected score, and if eating fruit didn’t increase its expected score then it wouldn’t care about eating fruit.
That is the only difference between the two types of goals. Knowing that one of an agent’s goals is instrumental and another terminal doesn’t tell you which goal the agent values more.
Since you seem to be purposefully unwilling to understand my posts, could you please refrain from declaring that I have “rescinded” my opinions on the matter?
Terminal values can be seen as value axioms in that they’re the root nodes in a graph of values, just as logical axioms can be seen as the root nodes of a graph of theorems.
They are unlike logical axioms in that we’re using them to derive the utility consequent on certain choices (given consequentialist assumptions; it’s possible to have analogs of terminal values in non-consequentialist ethical systems, but it’s somewhat more complicated) rather than the boolean validity of a theorem. Different terminal values may have different consequential effects, and they may conflict without contradiction. This does not make them any less terminal.
Clippy has only one terminal value which doesn’t take into account the integrity of anything that isn’t a paperclip, which is why it’s perfectly happy to convert the mass of galaxies into said paperclips. Humans’ values are more complicated, insofar as they’re well modeled by this concept, and involve things like “life” and “natural beauty” (I take no position on whether these are terminal or instrumental values w.r.t. humans), which is why they generally aren’t.
Locally, human values usually are modelled by TGs.
You can define several ethical models in terms of their preferred terminal value or set of terminal values; for negative utilitarianism, for example, it’s minimization of suffering. I see human value structure as an unsolved problem, though, for reasons I don’t want to spend a lot of time getting into this far down in the comment tree.
Or did you mean “locally” as in “on Less Wrong”? I believe the term’s often misused here, but not for the reasons you seem to.
What’s conflict without contradiction?
Because of the structure of Boolean logic, logical axioms that come into conflict generate a contradiction and therefore imply that the axiomatic system they’re embedded in is invalid. Consequentialist value systems don’t have that feature, and the terminal values they flow from are therefore allowed to conflict in certain situations, if more than one exists. Naturally, if two conflicting terminal values both have well-behaved effects over exactly the same set of situations, they might as well be reduced to one, but that isn’t always going to be the case.
If acquiring bacon was your ONLY terminal goal, then yes, it would be irrational not to do absolutely everything you could to maximize your expected bacon. However, most people have more than just one terminal goal. You seem to be using ‘terminal goal’ to mean ‘a goal more important than any other’. Trouble is, no one else is using it this way.
EDIT: Actually, it seems to me that you’re using ‘terminal goal’ to mean something analogous to a terminal node in a tree search (if you can reach that node, you’re done). No one else is using it that way either.
Feel free to offer the correc definition. But note that you came define it as overridable, since non terminal goals are already defined that way.
There is no evidence that people have one or more terminal goals . At least you need to offer a definition such that multiple TGs don’t collide, and are distinguishable from non TGs.
It looks to me (am I misunderstanding?) as if you take “X is a terminal goal” to mean “X is of higher priority than anything else”. That isn’t how I use the term, and isn’t how I think most people here use it.
I take “X is a terminal goal” to mean “X is something I value for its own sake and not merely because of other things it leads to”. Something can be a terminal goal but not a very important one. And something can be a non-terminal goal but very important because the terminal goals it leads to are of high priority.
So it seems perfectly possible for eating barbecue to be a terminal goal even if one would not generally kill to achieve it.
[EDITED to add the following.]
On looking at the rest of this thread, I see that others have pointed this out to you and you’ve responded in ways I find baffling. One possibility is that there’s a misunderstanding on one or other side that might be helped by being more explicit, so I’ll try that.
The following is of course an idealized thought experiment; it is not intended to be very realistic, merely to illustrate the distinction between “terminal” and “important”.
Consider someone who, at bottom, cares about two things (and no others). (1) She cares a lot about people (herself or others) not experiencing extreme physical or mental anguish. (2) She likes eating bacon. These are (in my terminology, and I think that of most people here) her “terminal values”. It happens that #1 is much more important to her than #2. This doesn’t (in my terminology, and I think that of most people here) make #2 any less terminal; just less important.
She has found that simply attending to these two things and nothing else is not very effective in minimizing anguish and maximizing bacon. For instance, she’s found that a diet of lots of bacon and nothing else tends to result in intestinal anguish, and what she’s read leads her to think that it’s also likely to result in heart attacks (which are very painful, and sometimes lead to death, which causes mental anguish to others). And she’s found that people are more likely to suffer anguish of various kinds if they’re desperately poor, if they have no friends, etc. And so she comes to value other things, not for their own sake, but for their tendency to lead to less anguish and more bacon later: health, friends, money, etc.
So, one day she has the opportunity to eat an extra slice of bacon, but for some complicated reason which this comment is too short to contain doing so will result in hundreds of randomly selected people becoming thousands of dollars poorer. Eating bacon is terminally valuable for her; the states of other people’s bank accounts are not. But poorer people are (all else being equal) more likely to find themselves in situations that make them miserable, and so keeping people out of poverty is a (not terminal, but important) goal she has. So she doesn’t grab the extra slice of bacon.
(She could in principle attempt an explicit calculation, considering only anguish and bacon, of the effects of each choice. But in practice that would be terribly complicated, and no one has the time to be doing such calculations whenever they have a decision to make. So what actually happens is that she internalizes those non-terminal values, and for most purposes treats them in much the same way as the terminal ones. So she isn’t weighing bacon against indirect hard-to-predict anguish, but against more-direct easier-to-predict financial loss for the victims.)
Do you see some fundamental incoherence in this? Or do you think it’s wrong to use the word “terminal” in the way I’ve described?
There’s no incoherence in defining “terminal” as “not lowest priority”, which is basically what you are saying.
It just not what the word means.
Literally, etymologically, that is not what terminal means. It means maximal, or final. A terminal illness is not an illness that is a bit more serious than some other illness.
It’s not even what it usually means on LW. If Clippies goals were terminal in your sense, they would be overridable …..you would be able to talk Clippie out of papercliiping.
What you are talking about is valid, is a thing. If you have any hierarchy of goals, there are some at the bottom, some in the middle, and some at the top. But you need to invent a new word for the middle ones, because,
“terminal” doesn’t mean “intermediate”.
OK, that makes the source of disagreement clearer.
I agree that “terminal” means “final” (but not that it means “maximal”; that’s a different concept). But it doesn’t (to me, and I think to others on LW) mean “final” in the sense I think you have in mind (i.e., so supremely important that once you notice it applies you can stop thinking), but in a different sense (when analysing goals or values, asking “so why do I want X?”, this is a point at which you can go no further: “well, I just do”).
So we’re agreed on the etymology: a “terminal” goal or value is one-than-which-one-can-go-no-further. But you want it to mean “no further in the direction of increasing importance” and I want it to mean “no further in the direction of increasing fundamental-ness”. I think the latter usage has at least the following two advantages:
It’s possible that people actually have quite a lot of goals and values that are “terminal” in this sense, including ones that are directly relevant in motivating them in ordinary situations. (Whereas it’s very rare to come across a situation in which some goal you have is so comprehensively overriding that you don’t have to think about anything else.)
This usage of “terminal” is well established on LW. I think its usage here goes back to Eliezer’s post called Terminal Values and Instrumental Values from November 2007. See also the LW wiki entry. This is not a usage I have just invented, and I strongly disagree with your statement that “It’s not even what it usually means on LW”.
The trouble with Clippy isn’t that his paperclip-maximizing goal is terminal, it’s that that’s his only goal.
I’m not sure whether in your last paragraph you’re suggesting that I’m using “terminal” to mean “intermediate in importance”, but for the avoidance of doubt I am not doing anything at all like that. There are two separate things here that you could call hierarchies, one in terms of importance and one in terms of explanation, and “terminal” refers (in my usage, which I think is also the LW-usual one) only to the latter.
We can go a step further, actually: “teminal value” and various synonyms are well-established within philosophy), where they usually carry the familiar LW meaning of “something that has value in itself, not as a means to an end”.
Who wwould design an AI with a single terminal goal? It’s basic principle of engineering that you have redundancy, backup systems and backdoors....and you don’t have single points of failure. A Clippie with an emergency backup goal could be talked out of clipping.
Paperclip maximization is a thought experiment intended to illustrate the consequences of a seemingly benign goal when coupled to superhuman optimization power. It’s an exceptionally unlikely value structure for a real-world AI, but it’s not supposed to be realistic; in fact, it’s supposed to be rather on the silly side, the better to avoid the built-in value heuristics that tend to trip people up in cases like these. (A more realistic set of terminal values for an AI might look like a more formalized version of “Follow the laws of $COUNTRY; maximize the market capitalization of $COMPANY; and follow the orders of $COMPANY’s board or designated agents”, plus some way of handling precedence. Given equal optimization power, this is only slightly less dangerous than Clippy.)
Nonetheless, I don’t think it’s quite proper to call Clippy’s values a point of failure. Clippy is doing exactly what it was designed to do; that just happens to be inimical to certain implicit values that no one thought to include.
Which is why it would be helpful to include another higher priority goal in a goal driven architecture, a .sa safety feature. It need not amount to anything more complex than “obey all instructions on this channel”, where the instructions are no more complex than “shut yourself down”
If you designed a AIwith a single goal, then you have an AI with a single goal …that’s not the problem....the mistakeis designing something with no off switch or override.
It need not amount to anything more complex than obey all instructions on this channel, where the instructions are no more complex than “shut yourself down”
And “always keep this channel open” and “don’t corrupt any sensor data that outputs to this channel” and “don’t send yourself commands on this channel” and “don’t build anything so that it will send you a signal on this channel” and “don’t build anything that will build anything that will eventually send you a signal on this channel unless a signal on this channel tells you to do it”.
… and I can STILL think of more ways to corrupt that kind of hack.
Not to mention that if you don’t want script kiddies to have too much fun, you will need to authenticate the instructions on that channel which another very large can of very wriggly worms...
The problem is not to “Solve Human Morality”, the problem is to make an AI that will do what humans end up having wanted. Since this is a problem for which we can come up with solid definitions (just to plug my own work :-p), it must be a solvable problem. If it looks impossible or infeasible, that is simply because you are taking the wrong angle of attack.
Stop trying to figure out a way to avoid the problem and solve it.
For one thing, taboo the words “morality” and “ethics”, and solve the simpler, realer problem: how do you make an AI do what you intend it to do when you convey some wish or demand in words? As Eliezer has said, humans are Friendly to each-other in this sense: when I ask another human to get me a pizza, the entire apartment doesn’t get covered in a maximal number of pizzas. Another human understands what I really mean.
So just solve that: what reasoning structures does another agent need to understand what I really mean when I ask for a pizza?
But at least stop blatantly trolling LessWrong by trying to avoid the problem by saying blatantly stupid stuff like “Oh, I’ll just put an off-switch on an AI, because obviously no agent of human-level intelligence would ever try to prevent the use of an off-switch by, you know, breaking it, or covering it up with a big metal box for protection.”
The problem is not to “Solve Human Morality”, the problem is to make an AI that will do what humans end up having wanted.
Is it? Why take on either of those gargantuan challenges? Another perfectly reasonable approach is to task the AI with nothing more than data processing with no effectors in the real world (Oracle AI), and watch it like a hawk. And no one at MIRI or on LW has proved this approach dangerous except by making crazy unrealistic assumptions, e.g. in this case why would you ever put the off-switch in a region of the AI’s environment?
As you and Eliezer say, humans are Friendly to each other already. So have humans moderate the actions of the AI, in a controlled setup designed to prevent AI learning to manipulate the humans (break the feedback loop).
Another perfectly reasonable approach is to task the AI with nothing more than data processing with no effectors in the real world (Oracle AI), and watch it like a hawk.
I consider this semi-reasonable, and in fact, wouldn’t even feel the need to watch it like a hawk. Without a decision-outputting algorithm, it’s not an agent, it’s just a learner: it can’t possibly damage human interests.
I say “semi” reasonable, because there is still the issue of understanding debug output from the Oracle’s internal knowledge representations, and putting it to some productive usage.
I also consider a proper Friendly AI to be much more “morally profitable”, in the sense of yielding a much greater benefit than usage of an Oracle Learner by untrustworthy humans.
This becomes an issue of strategy. I assume the end goal is a positive singularity. The MIRI approach seems to be: design and build a provably “safe” AGI, then cede all power to it and hope for the best as it goes “FOOM” and moves us through the singularity. A strategy I would advocate for instead is: build an Oracle AI as soon as it is possible to do so with adequate protections, and use its super-intelligence to design singularity technologies which enable (augmented?) humans to pass through the singularity.
I prefer the latter approach as it can be done with today’s knowledge and technology, and does not rely on mathematical breakthroughs on an indeterminate timescale which may or may not even be possible or result in a practical AGI design. The latter approach instead depends on straight-forward computer science and belts-and-suspenders engineering on a predictable timescale.
If I were executive director of MIRI, I would continue the workshops, because there is a non-zero probability that breakthrough might be made that radically simplifies the safe AGI design space. However I’d definitely spend more than half of the organizations budget and time on a strategy with a definable time-scale and an articulatable project plan, such as the Oracle-AGI-to-Intelligence-Augmentation approach I advocate, although others are possible.
Well that’s where the “positive singularity” and “Friendly (enough) AGI” goals separate: if you choose the route to a “positive singularity” of human intelligence augmentation, you still face the problems of human irrationality, of human moral irrationality (lack of moral caring, moral akrasia, morals that are not aligned with yours, etc), but you now also face the issue of what happens to human evaluative judgement under the effects of intelligence augmentation. Can humans be modified while maintaining their values? We honestly don’t know.
(And I for one am reasonably sure that nobody wise should ever make me their Singularity-grade god-leader, on grounds that my shouldness function, while not nearly as completely alien as Clippy’s, is still relatively unusual, somewhere on an edge of a bell curve, and should therefore not be trusted with the personal or collective future of anyone who doesn’t have a similar shouldness function. Sure, my meta-level awareness of this makes me Friendly, loosely speaking, but we humans are very bad at exercising perfect meta-level awareness of others’ values all the time, and often commit evaluative mind-projection fallacies.)
What I would personally do, at this stage, is just to maintain a distribution (you know probability was gonna enter somewhere) over potential routes to a positive outcome. Plan and act according to the full distribution, through institutions like FHI and FLI and such, while still focusing the specific, achieve-a-single-narrow-outcome optimization power of MIRI’s mathematical talents on building provably Friendly AGIs. Update early and often on whatever new information is available.
For instance, the more I look into AGI and cognitive science research, the more I genuinely feel the “Friendly AI route” can work quite well. From my point of view, it looks more like a research program than an impossible Herculian task (admittedly, the difference is often kinda hard to see to those who’ve never served time in a professional research environment), whereas something like safe human augmentation is currently full of unknown unknowns that are difficult to plan around.
And as much as I generally regard wannabe-ems with a little disdain for their flippant “what do I need reality for!?” views, I do think that researching human mind uploading would help discover a lot of the neurological and cognitive principles needed to build a Friendly AI (ie: what cognitive algorithms are we using to make evaluative judgements?), while also helping create avenues for agents with human motivations to “go FOOM” themselves, just in case, so that’s worthwhile too.
The important thing to note about the problems you identified is how they differ from the problem domains of basic research. What happens to human evaluative judgement under the effects of intelligence augmentation? That’s an experimental question. Can we trust a single individual to be enhanced? Almost certainly not. So perhaps we need to pick 100 or 1,000 people, wired into an shared infrastructure which enhances them in lock-step, and has incentives in place to ensure collaboration over competition, and consensus over partisanship in decision making protocols. Designing these protocols and safeguards takes a lot of work, but both the scale and the scope of that work is fairly well quantified. We can make a project plan and estimate with a high degree of accuracy how long and how much money it would take to design sufficiently safe oracle AI and intelligence augmentation projects.
FAI theory, on the other hand, is like the search for a grand unified theory of physics. We presume such a theory exists. We even have an existence proof of sorts (the human mind for FAI, the universe itself in physics). But the discovery of a solution is something that will or will not happen, and if it does it will be on an unpredictable time scale. Maybe it will take 5 years. Maybe 50, maybe 500. Who knows? After the rapid advances of the early 20th century, I’m sure most physicists thought a grand unified theory must be within reach; Einstein certainly did. Yet here we are nearly 100 years after the publication of the general theory of relativity, 85 years after most of the major discoveries of quantum mechanics, and yet in many ways we seem no closer to a theory of everything than we were some 40 years ago when the standard model was largely finalized.
It could be that at the very next MIRI workshop some previously unknown research associate solves the FAI problem conclusively. That’d be awesome. Or maybe she proves it impossible, which would be an equally good outcome because then we could at least refocus our efforts. Far worse, it might be that 50 years from now all MIRI has accumulated is a thoroughly documented list of dead-ends.
But that’s not the worst case, because in reality UFAI will appear within the next decade or two, whether we want it to or not. So unless we are confident that we will solve the FAI problem and build out the solution before the competition, we’d better start investing heavily in alternatives.
The AI winter is over. Already multiple very well funded groups are rushing forward to generalize already super-human narrow AI techniques. AGI is finally a respectable field again, and there are multiple teams making respectable progress towards seed AI. And parallel hardware and software tools have finally gotten to the point where a basement AGI breakthrough is a very real and concerning possibility.
We don’t have time to be dicking around doing basic research on whiteboards.
Aaaand there’s the “It’s too late to start researching FAI, we should’ve started 30 years ago, we may as well give up and die” to go along with the “What’s the point of starting now, AGI is too far away, we should start 30 years later because it will only take exactly that amount of time according to this very narrow estimate I have on hand.”
If the overlap between your credible intervals on “How much time we have left” and “How much time it will take” do not overlap, then you either know a heck of a lot I don’t, or you are very overconfident. I usually try not to argue from “I don’t know and you can’t know either” but for the intersection of research and AGI timelines I can make an exception.
Admittedly my own calculation looks less like an elaborate graph involving supposed credibility intervals, and, “Do we need to do this? Yes. Can we realistically avoid having to do this? No. Let’s start now EOM.”
I think that’s a gross simplification of the possible outcomes.
Admittedly my own calculation looks less like an elaborate graph involving supposed credibility intervals, and, “Do we need to do this? Yes. Can we realistically avoid having to do this? No. Let’s start now EOM.”
I think you need better planning.
There’s a great essay that has been a featured article on the main page for some time now called Levels of Action. Applied to FAI theory:
Level 1: Directly ending human suffering.
Level 2: Constructing an AGI capable of ending human suffering for us.
Level 3: Working on the computer science aspects of AGI theory.
Level 4: Researching FAI theory, which constrains the Level 3 AGI theory.
But for that high-level basic research to have any utility, these levels must be connected to each other: there must be a firm chain where FAI theory informs AGI designs, which are actually used in the construction of an AGI tasked with ending human suffering in a friendly way.
From what I can tell on the outside, the MIRI approach seems to be: (1) find a practical theory of FAI; (2) design an AGI in accordance with this theory; (3) implement that design; (4) mission accomplished!
That makes a certain amount of intuitive sense, having stages laid out end-to-end in chronological order. However as a trained project manager I must tell you this is a recipe for disaster! The problem is that the design space branches out at each link, but without the feedback of follow-on steps, inefficient decision making will occur at earlier stages. The space of working FAI theories is much, much larger than the FAI-theory-space which results in practical AGI designs which can be implemented prior to the UFAI competition and are suitable for addressing real-world issues of human suffering as quickly as possible.
Some examples from the comparably large programs of the Manhattan project and Apollo moonshot are appropriate, if you’ll forgive the length (skip to the end for a conclusion):
The Manhattan project had one driving goal: drop a bomb on Berlin and Tokyo before the GIs arrived, hopefully ending the war early. (Of course Germany surrendered before the bomb was finished, and Tokyo ended up so devastated by conventional firebombing that Hiroshima and Nagasaki were selected instead, but the original goal is what matters here.) The location of the targets meant that the bomb had to be small enough to fit in a conventional long-distance bomber, and the timeline meant that the simpler but less efficient U-235 designs were preferred. A program was designed, adequate resources allocated, and the goal achieved on time.
On the other hand it is easy to imagine how differently things might have gone if the strategy was reversed; if instead the US military decided to institute a basic research program into nuclear physics and atomic structure, before deciding on the optimal bomb reactions, then doing detailed bomb design before creating the industry necessary to produce enough material for a working weapon. Just looking at the first stage, there is nothing a priori which makes it obvious that U-235 and Pu-239 are the “interesting” nuclear fuels to focus on. Thorium, for example, was more naturally abundant and already being extracted as a by product of rare earth metal extraction, its reactions generate less lethal radiation and long-lasting waste products, and does generate U-233 which could be used in a nuclear bomb. However the straight-forward military and engineering requirements of making a bomb on schedule, and successfully delivering it on target favored U-235 and Pu-239 based weapon designs, which focused focused the efforts of the physicists involved on those fuel pathways. The rest is history.
The Apollo moonshot is another great example. NASA had a single driving goal: deliver a man to the moon before 1970, and return him safely to Earth. There’s a lot of decisions that were made in the first few years driven simply by time and resources available: e.g. heavy-lift vs orbital assembly, direct return vs lunar rendezvous, expendable vs. reuse, staging vs. fuel depots. Ask Wernher von Braun what he imagined an ideal moon mission would look like, and you would have gotten something very different than Apollo. But with Apollo NASA made the right tradeoffs with respect to schedule constraints and programmatic risk.
The follow-on projects of Shuttle and Station are a completely different story, however. They were designed with no articulated long-term strategy, which meant they tried to be everything to everybody and as a result were useful to no one. Meanwhile the basic research being carried out at NASA has little, if anything to do with the long-term goals of sending humans to Mars. There’s an entire division, the Space Biosciences group, which does research on Station about the long-term effects of microgravity and radiation on humans, supposedly to enable a long-duration voyage to Mars. Never mind that the microgravity issue is trivially solved by spinning the spacecraft with nothing more than a strong steel rope as a tether, and the radiation issue is sufficiently mitigated by having a storm shelter en route and throwing a couple of Martian sandbags on the roof once you get there.
There’s an apocryphal story about the US government spending millions of dollars to develop the “Space Pen”—a ballpoint pen with ink under pressure to enable writing in microgravity environments. Much later at some conference an engineer in that program meets his Soviet counterpart and asks how they solved that difficult problem. The cosmonauts used a pencil.
Sadly the story is not true—the “Space Pen” was a successful marketing ploy by inventor Paul Fisher without any ties to NASA, although it was used by NASA and the Russians on later missions—but it does serve to illustrate the point very succinctly. I worry that MIRI is spending its days coming up with space pens when a pencil would have done just fine.
Let me provide some practical advice. If I were running MIRI, I would still employ mathematicians working on the hail-Mary of a complete FAI theory—avoiding the Löbian obstacle etc. -- and run the very successful workshops, though maybe just two a year. But beyond that I would spend all remaining resources on a pragmatic AGI design programme:
1) Have a series of workshops with AGI people to do a review of possible AI-influenced strategies for a positive singulatiry—top-down FAI, seed AI to FAI, Oracle AI to FAI, Oracle AI to human augmentation, teaching a UFAI morals in a nursery environment, etc.
2) Have a series of workshops, again with AGI people to review tactics: possible AGI architectures & the minimal seed AI for each architecture, probabilistically reliable boxing setups, programmatic security, etc.
Then use the output of these workshops—including reliable constraints on timelines—to drive most of the research done by MIRI. For example, I anticipate that reliable unfriendly Oracle AI setups will require probabilistically auditable computation, which itself will require a strongly typed, purely functional virtual machine layer from which computation traces can be extracted and meaningfully analyzed in isolation. This is the sort of research MIRI could sponsor a grad student or Ph.d postdoc to perform.
BTW, other gripe: I have yet to see adequate arguments for the “can we realistically avoid having to do this?” from MIRI which aren’t strawman arguments.
While I don’t know much about your AGi expertise, I agree that MIRI is missing an experienced top-level executive who knows how to structure, implement and risk-mitigate an ambitious project like FAI and has a track record to prove it. Such a person would help prevent flailing about and wasting time and resources. I am not sure what other projects are in this reference class and whether MIRI can find and hire a person like that, so maybe they are doing what they can with the meager budget they’ve got. Do you think that the Manhattan project and the Space Shuttle are in the ballpark of the FAI? My guess is that they don’t even come close in terms of ambition, risk, effort or complexity.
I am not sure what other projects are in this reference class and whether MIRI can find and hire a person like that, so maybe they are doing what they can with the meager budget they’ve got.
Project managers are typically expensive because they are senior people before they enter management. Someone who has never actually worked at the bottom rung of the ladder is often quite useless in a project management role. But that’s not to say that you can’t find someone young who has done a short stint at the bottom, got PMP certified (or whatever), and has 1-2 projects under their belt. It wouldn’t be cheap, but not horribly expensive either.
On the other hand, Luke seems pretty on the ball with respect to administrative stuff. It may be sufficient to get him some project manager training and some very senior project management advisers.
Neither one of these would be a long-term adequate solution. You need very senior, very experienced project management people in order to tackle something as large as FAI, and stay on schedule and on budget. But in terms of just making sure the organization is focused on the right issues, either of the above would be a drastic improvement, and enough for now.
Do you think that the Manhattan project and the Space Shuttle are in the ballpark of the FAI? My guess is that they don’t even come close in terms of ambition, risk, effort or complexity.
60 years ago, maybe. However these days advances in cognitive science, narrow AI, and computational tools are advancing at rapid paces on their own. The problem for MIRI should be that of ensuring a positive singularity via careful leverage of the machine intelligence already being developed for other purposes. That’s a much smaller project, and something I think a small but adequately funded organization should be able to pull off.
From what I can tell on the outside, the MIRI approach seems to be: (1) find a practical theory of FAI; (2) design an AGI in accordance with this theory; (3) implement that design; (4) mission accomplished!
Yes, dear, some of us are programmers, we know about waterfalls. Our approach is more like, “Attack the most promising problems that present themselves, at every point; don’t actually build things which you don’t yet know how to make not destroy the world, at any point.” Right now this means working on unbounded problems because there are no bounded problems which seem more relevant and more on the critical path. If at any point we can build something to test ideas, of course we will; unless our state of ignorance is such that we can’t test that particular idea without risking destroying the world, in which case we won’t, but if you’re really setting out to test ideas you can probably figure out some other way to test them, except for very rare highly global theses like “The intelligence explosion continues past the human level.” More local theses should be testable.
See also Ch. 22 from HPMOR, and keep in mind that I am not Harry, I contain Harry, all the other characters, their whole universe, and everything that happens inside it. In other words, I am not Harry, I am the universe that responded to Harry.
I’ll have to review Ch. 22 later as it is quite long.
If a stable self-modifying agent + friendly value-loading was the only pathway to a positive singularity, then MIRI would be doing a fine job. However I find that assumption not adequately justified.
For example, take oracle AI. The sequences do a good job of showing how a black box AI can’t be safely boxed, nor can any of its recommendations be trusted. But those arguments don’t generalize to when we can see and understand the inner workings of the AI. Yes engineering challenges apply as you can’t demand a computational trace of the entire returned result, as that would require an even more powerful AI to analyze, and then it’d be turtles all the way down. However you can do something like the Fiat-Shamir transform for selecting branches of the computational trace to audit. In essence, use the cryptographic hash of the result in order to choose which traces of the audit log to reveal. This allows the audit log to be only a tiny, tiny slice of the entire computation, yet it can be shown that faking such an audit log is computationally infeasible, meaning that it requires a large multiple more computation than went into the original result, which means it would be energetically detectable.
Of course you would also have to have the AI be written for a virtual machine which is pure, strongly typed and directly supports the native abstraction primitives of the AGI design (e.g. if it were OpenCog, the virtual machine would exist at the abstraction level of the atomspace), and have a computational stack underneath that which prevents VM breakout, and other protective measures. But these are surmountable engineering challenges, the difficulty of which can be realistically quantified.
So how much more or less difficult would it be to accomplish such an untrusted oracle AI vs the stable self-modifying agent and value-loading approach? Which one is more likely to occur before the “competition”?
I’m not demanding a full waterfall project plan, but even agile requires convincing arguments about critical paths and relative priorities. I for one am not convinced.
If a stable self-modifying agent + friendly value-loading was the only pathway to a positive singularity, then MIRI would be doing a fine job. However I find that assumption not adequately justified.
See also Ch. 22 from HPMOR, and keep in mind that I am not Harry, I contain Harry, all the other characters, their whole universe, and everything that happens inside it. In other words, I am not Harry, I am the universe that responded to Harry.
Badass boasting from fictional evidence?
Yes, dear, some of us are programmers, we know about waterfalls.
If anyone here knew anything about the Waterfall Model, they’d know it was only ever proposed sarcastically, as a perfect example of how real engineering projects never work. “Agile” is pretty goddamn fake, too. There’s no replacement for actually using your mind to reason about what project-planning steps have the greatest expected value at any given time, and to account for unknown unknowns (ie: debugging, other obstacles) as well.
If anyone here knew anything about the Waterfall Model, they’d know it was only ever proposed sarcastically, as a perfect example of how real engineering projects never work
Yes, and I used it in that context: “We know about waterfalls” = “We know not to do waterfalls, so you don’t need to tell us that”. Thank you for that very charitable interpretation of my words.
FAI has definite subproblems. It is not a matter of scratching away at a chalkboard hoping to make some breakthrough in “philosophy” or some other proto-sensical field that will Elucidate Everything and make the problem solvable at all. FAI, right now, is a matter of setting researchers to work on one subproblem after another until they are all solved.
In fact, when I do literature searches for FAI/AGI material, I often find that the narrow AI or machine-learning literature contains a round dozen papers nobody working explicitly on FAI has ever cited, or even appears to know about. This is my view: there is low-hanging fruit in applying existing academic knowledge to FAI problems. Where such low-hanging fruit does not exist, the major open problems can largely be addressed by recourse to higher-hanging fruit within mathematics, or even to empirical science.
Since you believe it’s all so wide-open, I’d like to know what you think of as “the FAI problem”.
If you have an Oracle AI you can trust, you can use it to solve FAI problems for you. This is a fine approach.
We don’t have time to be dicking around doing basic research on whiteboards.
In-context, what was meant by “Oracle AI” is a very general learning algorithm with some debug output, but no actual decision-theory or utility function whatsoever built in. That would be safe, since it has no capability or desire to do anything.
Ok, but a system like you’ve described isn’t likely to think about what you want it to think about or produce output that’s actually useful to you either.
Well yes. That’s sort of the problem with building one. Utility functions are certainly useful for specifying where logical uncertainty should be reduced.
Well, I don’t know about the precise construction that would be used. Certainly I could see a human being deliberately focusing the system on some things rather than others.
All existing learning algorithms I know of, and I dare say all that exist, have at least an utility function, and also something that could be interpreted as a decision theory. Consider for example support vector machines, which explicitly try to maximize a margin (that would be the utility function), and any algorithm for computing SVMs can be interpreted as a decision theory. Similar considerations hold for neural networks, genetic algorithms, and even the minimax algorithm.
Thus, I strongly doubt that the notion of a learning algorithm with no utility function makes any sense.
Those are optimization criteria, but they are not decision algorithms in the sense that we usually talk about them in AI. A support vector machine is just finding the extrema of a cost function via its derivative, not planning a sequence of actions.
The most popular algorithm for SVMs does plan a sequence of actions, complete with heuristics as to which action to take. True, the “actions” are internal : they are changes to some data structure within the computer’s memory, rather than changes to the external world. But that is not so different from e.g. a chess AI, which assigns some heuristic score to chess positions and attempts to maximize it using a decision algorithm (to decide which move to make), even though the chessboard is just a data structure within the computer memory.
“Internal” to the “agent” is very different from having an external output to a computational system outside the “agent”. “Actions” that come from an extremely limited, non-Turing-complete “vocabulary” (really: programming language or computational calculus (those two are identical)) are also categorically different from a Turing complete calculus of possible actions.
The same distinction applies for hypothesis class that the learner can learn: if it’s not Turing complete (or some approximation thereof, like a total calculus with coinductive types and corecursive programs), then it is categorically not general learning or general decision-making.
This is why we all employ primitive classifiers every day without danger, and you need something like Solomonoff’s algorithmic probability in order to build AGI.
I agree, of course, that none of the examples I gave (“primitive classifiers”) are dangerous. Indeed, the “plans” they are capable of considering are too simple to pose any threat (they are, as you say, not Turing complete).
But, that doesn’t seem to relevant to the argument at all. You claimed
a very general learning algorithm with some debug output, but no actual decision-theory or utility function
whatsoever built in. That would be safe, since it has no capability or desire to do anything.
You claimed that a general learning algorithm without decision-theory or utility function is possible.
I pointed out that all (harmless) practical learning algorithms we know of do in fact have decision theories and utility functions.
What would “a learning algorithm without decision-theory or utility function, something that has no desire to do anything” even look like? Does the concept even make sense? Eliezer writes here
A string of zeroes down an output line to a motorized arm is just as much an output as any other output;
there is no privileged null, there is no such thing as ‘no action’ among all possible outputs.
To ‘do nothing’ is just another string of English words, that would be interpreted the same as
any other English words, with latitude.
You claimed that a general learning algorithm without decision-theory or utility function is possible. I pointed out that all (harmless) practical learning algorithms we know of do in fact have decision theories and utility functions.
/facepalm
There is in fact such a thing as a null output. There is in fact such a thing as a learner with a sub-Turing hypothesis class. Such a learner with such a primitive output as “in the class” or “not in the class” does not engage in world optimization, that is: its actions do not, to its own knowledge, skew any probability distribution over future states of any portion of the world outside itself.
It does not narrow the future.
Now, what we’ve been proposing as an Oracle is even less capable. It would truly have no outputs whatsoever, only input and a debug view. It would, by definition, be incapable of narrowing the future of anything, even its own internal states.
Perhaps I have misused terminology, but that is what I was referring to: inability to narrow the outer world’s future.
This thing you are proposing, an “oracle” that is incapable of modeling itself and incapable of modeling its environment (either would require turing-complete hypotheses), what could it possibly be useful for? What could it do that today’s narrow AI can’t?
You seem to have lost the thread of the conversation. The proposal was to build a learner that can model the environment using Turing-complete models, but which has no power to make decisions or take actions. This would be a Solomonoff Inducer approximation, not an AIXI approximation.
There is in fact such a thing as a learner with a sub-Turing hypothesis class. Such a learner
with such a primitive output as “in the class” or “not in the class” does not engage in
world optimization, that is: its actions do not, to its own knowledge,
skew any probability distribution over future states of any portion of the world outside itself.
…
Now, what we’ve been proposing as an Oracle is even less capable.
which led me to think you were talking about an oracle even less capable than a learner with a sub-Turing hypothesis class.
It would truly have no outputs whatsoever, only input and a debug view. It would, by definition, be
incapable of narrowing the future of anything, even its own internal states.
If the hypotheses it considers are turing-complete, then, given enough information (and someone would give it enough information, otherwise they couldn’t do anything useful with it), it could model itself, its environment, the relation between its internal states and what shows up on the debug view, and the reactions of its operators on the information they learn from that debug view. Its (internal) actions very much would, to its own knowledge, skew the probability distribution over future states of the outer world.
I often find that the narrow AI or machine-learning literature contains a round dozen papers nobody working explicitly on FAI has ever cited, or even appears to know about.
Name three. FAI contains a number of counterintuitive difficulties and it’s unlikely for someone to do FAI work successfully by accident. On the other hand, someone with a fuzzier model believing that a paper they found sure sounds relevant, why isn’t MIRI citing it, is far more probable from my perspective and prior.
I wouldn’t say that there’s someone out there directly solving FAI problems without having explicitly intended to do so. I would say there’s a lot we can build on.
Keep in mind, I’ve seen enough of a sample of Eld Science being stupid to understand how you can have a very low prior on Eld Science figuring out anything relevant. But lacking more problem guides from you on the delta between plain AI problems and FAI problems, we go on what we can.
One paper on utility learning that relies on a supervised-learning methodology (pairwise comparison data) rather than a de-facto reinforcement learning methodology (which can and will go wrong in well-known ways when put into AGI). One paper on progress towards induction algorithms that operate at multiple levels of abstraction, which could be useful for naturalized induction if someone put more thought and expertise into it.
That’s only two, but I’m a comparative beginner at this stuff and Eld Science isn’t very good at focusing on our problems, so I expect that there’s actually more to discover and I’m just limited by lack of time and knowledge to do the literature searches.
By the way, I’m already trying to follow the semi-official MIRI curriculum, but if you could actually write out some material on the specific deltas where FAI work departs from the preexisting knowledge-base of academic science, that would be really helpful.
Since you believe it’s all so wide-open, I’d like to know what you think of as “the FAI problem”.
1) Designing a program capable of arbitrary self-modification, yet maintaining guarantees of “correct” behavior according to a goal set that is by necessity included in the modifications as well.
2) Designing such a high level set of goals which ensure “friendliness”.
That seems a circular argument. How do you use a self-modifying evolutionary search to find a program whose properties remain stable under self-modifying evolutionary search? Unless you started with the right answer, the search AI would quickly rewrite or reinterpret its own driving goals in a non-friendly way, and who knows what you’d end up with.
It’s how you draw your system box. Evolutionary search is equivalent to a self-modifying program, if you think of the whole search process as the program. The same issues apply.
I think the sequences do a good job at demolishing the idea that human testers can possibly judge friendliness directly, so long as the AI operates as a black box. If you have a debug view into the operation of the AI that is a different story, but then you don’t need friendliness anyway.
Great, you’ve got names for answers you are looking for. That doesn’t mean the answers are any easier to find. You’ve attached a label to the declarative statement which specifies the requirements a solution must meet, but that doesn’t make the search for a solution suddenly have a fixed timeline. It’s uncertain research: it might take 5 years, 10 years, or 50 years, and throwing more people at the problem won’t necessarily make the project go any faster.
And how is trying to build a safe Oracle AI that can solve FAI problems for us not basic research? Or, to make a better statement: how is trying to build an Unfriendly superintelligent paperclip maximizer not basic research, at today’s research frontier?
Logical uncertainty, for example, is a plain, old-fashioned AI problem. We need it for FAI, we’re pretty sure, but it’s turning out UFAI might need it, too.
“Basic research is performed without thought of practical ends.”
“Applied research is systematic study to gain knowledge or understanding necessary to determine the means by which a recognized and specific need may be met.”
-National Science Foundation.
We need to be doing applied research, not basic research. What MIRI should do is construct a complete roadmap to FAI, or better: a study exhaustively listing strategies for achieving a positive singularity, and tactics for achieving friendly or unfriendly AGI, and concluding with a small set of most-likely scenarios. MIRI should then have identified risk factors which affect either the friendliness of the AGI in each scenario, or the capability of the UFAI to do damage (in boxing setups). These risk factors should be prioritized based on how much it is expected knowing more about each would bias the outcome in a positive direction, and it should be these problems as the topics of MIRI workshops.
Instead MIRI is performing basic research. It’s basic research not because it is useless, but because we are not certain at this point in time what relative utility it will have. And if we don’t have a grasp on expected utility, how can we prioritize? There’s a hundred avenues of research which are important to varying degrees to the FAI project. I worked for a number of years at NASA-Ames Research Center, and in the same building as me was the Space Biosciences Division. Great people, don’t get me wrong, and for decades they have funded really cool research on the effects of microgravity and radiation on living organisms, with the justification that such effects and counter-measures need to be known for long duration space voyages, e.g. a 2-year mission to Mars. Never mind that the microgravity issue is trivially solved with a few thousand dollar steel tether connecting the upper stage to the space craft as they spin to create artificial gravity, and the radiation exposure is mitigated by having a storm shelter in the craft and throwing a couple of Martian sandbags on the roof once you get there. It’s spending millions of dollars to develop the pressurized-ink “Space Pen”, when the humble pencil would have done just fine.
Sadly I think MIRI is doing the same thing, and it is represented in one part of your post I take huge issue with:
Logical uncertainty, for example, is a plain, old-fashioned AI problem. We need it for FAI, we’re pretty sure...
If we’re only “pretty sure” it’s needed for FAI, if we can’t quantify exactly what its contribution will be, and how important that contribution is relative to other possible things to be working on.. then we have some meta-level planning to do first. Unfortunately I don’t see MIRI doing any planning like this (or if they are, it’s not public).
Are you on the “Open Problems in Friendly AI” Facebook group? Because much of the planning is on there.
If we’re only “pretty sure” it’s needed for FAI, if we can’t quantify exactly what its contribution will be, and how important that contribution is relative to other possible things to be working on.. then we have some meta-level planning to do first. Unfortunately I don’t see MIRI doing any planning like this (or if they are, it’s not public).
Logical uncertainty lets us put probabilities to sentences in logics. This, supposedly, can help get us around the Loebian Obstacle to proving self-referencing statements and thus generating stable self-improvement in an agent. Logical uncertainty also allows for making techniques like Updateless Decision Theory into real algorithms, and this too is an AI problem: turning planning into inference.
The cognitive stuff about human preferences is the Big Scary Hard Problem of FAI, but utility learning (as Stuart Armstrong has been posting about lately) is a way around that.
If you can create a stably self-improving agent that will learn its utility function from human data, equipped with a decision theory capable of handling both causative games and Timeless situations correctly… then congratulations, you’ve got a working plan for a Friendly AI and you can start considering the expected utility of actually building it (at least, to my limited knowledge).
Around here you should usually clarify whether your uncertainty is logical or indexical ;-).
Or.. you could use a boxed oracle AI to develop singularity technologies for human augmentation, or other mechanisms to keep moral humans in the loop through the whole process, and sidestep the whole issue of FAI and value loading in the first place.
Which approach do you think can be completed earlier with similar probabilities of success? What data did you use to evaluate that, and how certain are you of its accuracy and completeness?
I actually really do think that de novo AI is easier than human intelligence augmentation. We have good cognitive theories for how an agent is supposed to work (including “ideal learner” models of human cognitive algorithms). We do not have very good theories of in-vitro neuroengineering.
This assumes that you have usable, safe Oracle AI which then takes up your chosen line of FAI or neuroengineering problems for you. You are conditioning the hard part on solving the hard part.
You don’t need to solve philosophy to solve FAI, but philosophy is relevant to figuring out, in broad terms, the relative livelihoods of various problems and solutions.
I’m not arguing that AI will necessary be safe. I am arguing that the failure modes in’vestigated by MIRI aren’t likely. It is worthwhile to research effectivev off switches. It is not worthwhile to endlessly refer to a dangerous AI of a kind no one with a smidgeon of sense would build.
Bzzzt. Wrong. You still haven’t explained how to create an agent that will faithfully implement my verbal instruction to bring me a pizza. You have a valid case in the sense of pointing out that there can easily exist a “middle ground” between the Superintelligent Artificial Ethicist (Friendly AI in its fullest sense), the Superintelligent Paper Clipper (a perverse, somewhat unlikely malprogramming of a real superintelligence), and the Reward-Button Addicted Reinforcement Learner (the easiest unfriendly AI to actually build). What you haven’t shown is how to actually get around the Addicted Reinforcement Learner and the paper-clipper and actually build an agent that can be sent out for pizza without breaking down at all.
Your current answers seem to be, roughly, “We get around the problem by expecting future AI scientists to solve it for us.” However, we are the AI scientists: if we don’t figure out how to make AI deliver pizza on command, who will?
You keep misreading me. I am not claiming that to gave a solution. I am claiming that MIRI is overly pessimistic about the problem, and offering an over engineered solution. Inasmuch ad you say there is a middle ground, you kind if agree.
The thing is, MIRI doesn’t claim that a superintelligent world-destroying paperclipper is the most likely scenario. It’s just illustrative of why we have an actual problem: because you don’t need malice to create an Unfriendly AI that completely fucks everything up.
So how did you like CATE, over in that other thread? That AI is non-super-human, doesn’t go FOOM, doesn’t acquire nanotechnology, can’t do anything a human upload couldn’t do… and still can cause quite a lot of damage simply because it’s more dedicated than we are, suffers fewer cognitive flaws than us, has more self-knowledge than us, and has no need for rest or food.
I mean, come on: what if a non-FOOMed but Unfriendly AI becomes as rich as Bill Gates? After all, if Bill Gates did it while human, than surely an AI as smart as Bill Gates but without his humanity can do the same thing, while causing a bunch more damage to human values because it simply does not feel Gates’ charitable inclinations.
I feel like there’s not much of a distinction being made here between terminal values and terminal goals. I think they’re importantly different things.
A goal I set is a state of the world I am actively trying to bring about, whereas a value is something which . . . has value to me. The things I value dictate which world states I prefer, but for either lack of resources or conflict, I only pursue the world states resulting from a subset of my values.
So not everything I value ends up being a goal. This includes terminal goals. For instance, I think that it is true that I terminally value being a talented artist—greatly skilled in creative expression—being so would make me happy in and of itself, but it’s not a goal of mine because I can’t prioritise it with the resources I have. Values like eliminating suffering and misery are ones which matter to me more, and get translated into corresponding goals to change the world via action.
I haven’t seen a definition provided, but if I had to provide one for ‘terminal goal’ it would be that it’s a goal whose attainment constitutes fulfilment of a terminal value. Possessing money is rarely a terminal value, and so accruing money isn’t a terminal goal, even if it is intermediary to achieving a world state desired for its own sake. Accomplishing the goal of having all the hungry people fed is the world state which lines up with the value of no suffering, hence it’s terminal. They’re close, but not quite same thing.
I think it makes sense to possibly not work with terminal goals on a motivational/decision making level, but it doesn’t seem possible (or at least likely) that someone wouldn’t have terminal values, in the sense of not having states of the world which they prefer over others. [These world-state-preferences might not be completely stable or consistent, but if you prefer the world be one way than another, that’s a value.]
I don’t think that terminal goal means that it’s the highest priority here, just that there is no particular reason to achieve it other than the experience of attaining that goal. So eating barbecue isn’t about nutrition or socializing, it’s just about eating barbecue.
So who would you kill if they stood between you and a good barbecue?
( it’s almost like you guys haven’t thought about what terminal means)
It’s almost like you haven’t read the multiple comments explaining what “terminal” means.
It simply means “not instrumental”. It has nothing to do with the degree of importance assigned relative to other goals, except in that, obviously, instrumental goals deriving from terminal goal X are always less important than X itself. If your utility function is U = A + B then A and B can be sensibly described as terminal, and the fact that A is terminal does not mean you’d destroy all B just to have A.
Yes, “terminal” means final. Terminal goals are final in that your interest in them derives not from any argument but from axiom (ie. built-in behaviours). This doesn’t mean you can’t have more than one.
Ok,well your first link is to Lumifers account of TGs as cognitivelyly inaccessible, since rescinded.
What? It doesn’t say any such thing. It says they’re inexplicable in terms of the goal system being examined, but that doesn’t mean they’re inaccessible, in the same way that you can access the parallel postulate within Euclidian geometry but can’t justify it in terms of the other Euclidian axioms.
That said, I think we’re probably good enough at rationalization that inexplicability isn’t a particularly good way to model terminal goals for human purposes, insofar as humans have well-defined terminal goals.
Sorry, what is that “rescinded” part?
“It has nothing to do with comprehensibility”
Consider an agent trying to maximize its Pacman score. ‘Getting a high Pacman score’ is a terminal goal for this agent—it doesn’t want a high score because that would make it easier for it to get something else, it simply wants a high score. On the other hand, ‘eating fruit’ is an instrumental goal for this agent—it only wants to eat fruit because that increases its expected score, and if eating fruit didn’t increase its expected score then it wouldn’t care about eating fruit.
That is the only difference between the two types of goals. Knowing that one of an agent’s goals is instrumental and another terminal doesn’t tell you which goal the agent values more.
Since you seem to be purposefully unwilling to understand my posts, could you please refrain from declaring that I have “rescinded” my opinions on the matter?
So you have a thing which is like an axiom in that it can’t be explained in more basic terms...
..but is unlike an axiom in that you can ignore its implications where they don’t suit.. you don’t have to savage galaxies to obtain bacon...
..unless you’re an AI and it’s paperclips instead of bacon, because in that case these axiom like things actually are axiom like.
Terminal values can be seen as value axioms in that they’re the root nodes in a graph of values, just as logical axioms can be seen as the root nodes of a graph of theorems.
They are unlike logical axioms in that we’re using them to derive the utility consequent on certain choices (given consequentialist assumptions; it’s possible to have analogs of terminal values in non-consequentialist ethical systems, but it’s somewhat more complicated) rather than the boolean validity of a theorem. Different terminal values may have different consequential effects, and they may conflict without contradiction. This does not make them any less terminal.
Clippy has only one terminal value which doesn’t take into account the integrity of anything that isn’t a paperclip, which is why it’s perfectly happy to convert the mass of galaxies into said paperclips. Humans’ values are more complicated, insofar as they’re well modeled by this concept, and involve things like “life” and “natural beauty” (I take no position on whether these are terminal or instrumental values w.r.t. humans), which is why they generally aren’t.
Locally, human values usually are modelled by TGs.
What’s conflict without contradiction?
You can define several ethical models in terms of their preferred terminal value or set of terminal values; for negative utilitarianism, for example, it’s minimization of suffering. I see human value structure as an unsolved problem, though, for reasons I don’t want to spend a lot of time getting into this far down in the comment tree.
Or did you mean “locally” as in “on Less Wrong”? I believe the term’s often misused here, but not for the reasons you seem to.
Because of the structure of Boolean logic, logical axioms that come into conflict generate a contradiction and therefore imply that the axiomatic system they’re embedded in is invalid. Consequentialist value systems don’t have that feature, and the terminal values they flow from are therefore allowed to conflict in certain situations, if more than one exists. Naturally, if two conflicting terminal values both have well-behaved effects over exactly the same set of situations, they might as well be reduced to one, but that isn’t always going to be the case.
If acquiring bacon was your ONLY terminal goal, then yes, it would be irrational not to do absolutely everything you could to maximize your expected bacon. However, most people have more than just one terminal goal. You seem to be using ‘terminal goal’ to mean ‘a goal more important than any other’. Trouble is, no one else is using it this way.
EDIT: Actually, it seems to me that you’re using ‘terminal goal’ to mean something analogous to a terminal node in a tree search (if you can reach that node, you’re done). No one else is using it that way either.
Feel free to offer the correc definition. But note that you came define it as overridable, since non terminal goals are already defined that way.
There is no evidence that people have one or more terminal goals . At least you need to offer a definition such that multiple TGs don’t collide, and are distinguishable from non TGs.
Where are you getting these requirements from?
Incoherent. If terminal means not-instrumental, it doesn’t mean final, for the same reason that not-basement doesn’t mean penthouse.
You can only have multiple terminal goals if they are all strictly orthogonal. In general, they would not be.
Apply to Clippie: Clippie has a non instrumental goal of making paperclips But it’s overidable, like your terminal goals...
It looks to me (am I misunderstanding?) as if you take “X is a terminal goal” to mean “X is of higher priority than anything else”. That isn’t how I use the term, and isn’t how I think most people here use it.
I take “X is a terminal goal” to mean “X is something I value for its own sake and not merely because of other things it leads to”. Something can be a terminal goal but not a very important one. And something can be a non-terminal goal but very important because the terminal goals it leads to are of high priority.
So it seems perfectly possible for eating barbecue to be a terminal goal even if one would not generally kill to achieve it.
[EDITED to add the following.]
On looking at the rest of this thread, I see that others have pointed this out to you and you’ve responded in ways I find baffling. One possibility is that there’s a misunderstanding on one or other side that might be helped by being more explicit, so I’ll try that.
The following is of course an idealized thought experiment; it is not intended to be very realistic, merely to illustrate the distinction between “terminal” and “important”.
Consider someone who, at bottom, cares about two things (and no others). (1) She cares a lot about people (herself or others) not experiencing extreme physical or mental anguish. (2) She likes eating bacon. These are (in my terminology, and I think that of most people here) her “terminal values”. It happens that #1 is much more important to her than #2. This doesn’t (in my terminology, and I think that of most people here) make #2 any less terminal; just less important.
She has found that simply attending to these two things and nothing else is not very effective in minimizing anguish and maximizing bacon. For instance, she’s found that a diet of lots of bacon and nothing else tends to result in intestinal anguish, and what she’s read leads her to think that it’s also likely to result in heart attacks (which are very painful, and sometimes lead to death, which causes mental anguish to others). And she’s found that people are more likely to suffer anguish of various kinds if they’re desperately poor, if they have no friends, etc. And so she comes to value other things, not for their own sake, but for their tendency to lead to less anguish and more bacon later: health, friends, money, etc.
So, one day she has the opportunity to eat an extra slice of bacon, but for some complicated reason which this comment is too short to contain doing so will result in hundreds of randomly selected people becoming thousands of dollars poorer. Eating bacon is terminally valuable for her; the states of other people’s bank accounts are not. But poorer people are (all else being equal) more likely to find themselves in situations that make them miserable, and so keeping people out of poverty is a (not terminal, but important) goal she has. So she doesn’t grab the extra slice of bacon.
(She could in principle attempt an explicit calculation, considering only anguish and bacon, of the effects of each choice. But in practice that would be terribly complicated, and no one has the time to be doing such calculations whenever they have a decision to make. So what actually happens is that she internalizes those non-terminal values, and for most purposes treats them in much the same way as the terminal ones. So she isn’t weighing bacon against indirect hard-to-predict anguish, but against more-direct easier-to-predict financial loss for the victims.)
Do you see some fundamental incoherence in this? Or do you think it’s wrong to use the word “terminal” in the way I’ve described?
There’s no incoherence in defining “terminal” as “not lowest priority”, which is basically what you are saying.
It just not what the word means.
Literally, etymologically, that is not what terminal means. It means maximal, or final. A terminal illness is not an illness that is a bit more serious than some other illness.
It’s not even what it usually means on LW. If Clippies goals were terminal in your sense, they would be overridable …..you would be able to talk Clippie out of papercliiping.
What you are talking about is valid, is a thing. If you have any hierarchy of goals, there are some at the bottom, some in the middle, and some at the top. But you need to invent a new word for the middle ones, because, “terminal” doesn’t mean “intermediate”.
OK, that makes the source of disagreement clearer.
I agree that “terminal” means “final” (but not that it means “maximal”; that’s a different concept). But it doesn’t (to me, and I think to others on LW) mean “final” in the sense I think you have in mind (i.e., so supremely important that once you notice it applies you can stop thinking), but in a different sense (when analysing goals or values, asking “so why do I want X?”, this is a point at which you can go no further: “well, I just do”).
So we’re agreed on the etymology: a “terminal” goal or value is one-than-which-one-can-go-no-further. But you want it to mean “no further in the direction of increasing importance” and I want it to mean “no further in the direction of increasing fundamental-ness”. I think the latter usage has at least the following two advantages:
It’s possible that people actually have quite a lot of goals and values that are “terminal” in this sense, including ones that are directly relevant in motivating them in ordinary situations. (Whereas it’s very rare to come across a situation in which some goal you have is so comprehensively overriding that you don’t have to think about anything else.)
This usage of “terminal” is well established on LW. I think its usage here goes back to Eliezer’s post called Terminal Values and Instrumental Values from November 2007. See also the LW wiki entry. This is not a usage I have just invented, and I strongly disagree with your statement that “It’s not even what it usually means on LW”.
The trouble with Clippy isn’t that his paperclip-maximizing goal is terminal, it’s that that’s his only goal.
I’m not sure whether in your last paragraph you’re suggesting that I’m using “terminal” to mean “intermediate in importance”, but for the avoidance of doubt I am not doing anything at all like that. There are two separate things here that you could call hierarchies, one in terms of importance and one in terms of explanation, and “terminal” refers (in my usage, which I think is also the LW-usual one) only to the latter.
We can go a step further, actually: “teminal value” and various synonyms are well-established within philosophy), where they usually carry the familiar LW meaning of “something that has value in itself, not as a means to an end”.
No. Clippy cannot be persuaded away from paperclipping because maximizing paperclips is its only terminal goal.
Who wwould design an AI with a single terminal goal? It’s basic principle of engineering that you have redundancy, backup systems and backdoors....and you don’t have single points of failure. A Clippie with an emergency backup goal could be talked out of clipping.
Paperclip maximization is a thought experiment intended to illustrate the consequences of a seemingly benign goal when coupled to superhuman optimization power. It’s an exceptionally unlikely value structure for a real-world AI, but it’s not supposed to be realistic; in fact, it’s supposed to be rather on the silly side, the better to avoid the built-in value heuristics that tend to trip people up in cases like these. (A more realistic set of terminal values for an AI might look like a more formalized version of “Follow the laws of $COUNTRY; maximize the market capitalization of $COMPANY; and follow the orders of $COMPANY’s board or designated agents”, plus some way of handling precedence. Given equal optimization power, this is only slightly less dangerous than Clippy.)
Nonetheless, I don’t think it’s quite proper to call Clippy’s values a point of failure. Clippy is doing exactly what it was designed to do; that just happens to be inimical to certain implicit values that no one thought to include.
Which is why it would be helpful to include another higher priority goal in a goal driven architecture, a .sa safety feature. It need not amount to anything more complex than “obey all instructions on this channel”, where the instructions are no more complex than “shut yourself down”
If you designed a AIwith a single goal, then you have an AI with a single goal …that’s not the problem....the mistakeis designing something with no off switch or override.
And “always keep this channel open” and “don’t corrupt any sensor data that outputs to this channel” and “don’t send yourself commands on this channel” and “don’t build anything so that it will send you a signal on this channel” and “don’t build anything that will build anything that will eventually send you a signal on this channel unless a signal on this channel tells you to do it”.
… and I can STILL think of more ways to corrupt that kind of hack.
Not to mention that if you don’t want script kiddies to have too much fun, you will need to authenticate the instructions on that channel which another very large can of very wriggly worms...
Yep, lots of stuff which very difficult in absolute terms, but not obviously more difficult relatively than Solve Human Morality.
The problem is not to “Solve Human Morality”, the problem is to make an AI that will do what humans end up having wanted. Since this is a problem for which we can come up with solid definitions (just to plug my own work :-p), it must be a solvable problem. If it looks impossible or infeasible, that is simply because you are taking the wrong angle of attack.
Stop trying to figure out a way to avoid the problem and solve it.
For one thing, taboo the words “morality” and “ethics”, and solve the simpler, realer problem: how do you make an AI do what you intend it to do when you convey some wish or demand in words? As Eliezer has said, humans are Friendly to each-other in this sense: when I ask another human to get me a pizza, the entire apartment doesn’t get covered in a maximal number of pizzas. Another human understands what I really mean.
So just solve that: what reasoning structures does another agent need to understand what I really mean when I ask for a pizza?
But at least stop blatantly trolling LessWrong by trying to avoid the problem by saying blatantly stupid stuff like “Oh, I’ll just put an off-switch on an AI, because obviously no agent of human-level intelligence would ever try to prevent the use of an off-switch by, you know, breaking it, or covering it up with a big metal box for protection.”
Is it? Why take on either of those gargantuan challenges? Another perfectly reasonable approach is to task the AI with nothing more than data processing with no effectors in the real world (Oracle AI), and watch it like a hawk. And no one at MIRI or on LW has proved this approach dangerous except by making crazy unrealistic assumptions, e.g. in this case why would you ever put the off-switch in a region of the AI’s environment?
As you and Eliezer say, humans are Friendly to each other already. So have humans moderate the actions of the AI, in a controlled setup designed to prevent AI learning to manipulate the humans (break the feedback loop).
I consider this semi-reasonable, and in fact, wouldn’t even feel the need to watch it like a hawk. Without a decision-outputting algorithm, it’s not an agent, it’s just a learner: it can’t possibly damage human interests.
I say “semi” reasonable, because there is still the issue of understanding debug output from the Oracle’s internal knowledge representations, and putting it to some productive usage.
I also consider a proper Friendly AI to be much more “morally profitable”, in the sense of yielding a much greater benefit than usage of an Oracle Learner by untrustworthy humans.
This becomes an issue of strategy. I assume the end goal is a positive singularity. The MIRI approach seems to be: design and build a provably “safe” AGI, then cede all power to it and hope for the best as it goes “FOOM” and moves us through the singularity. A strategy I would advocate for instead is: build an Oracle AI as soon as it is possible to do so with adequate protections, and use its super-intelligence to design singularity technologies which enable (augmented?) humans to pass through the singularity.
I prefer the latter approach as it can be done with today’s knowledge and technology, and does not rely on mathematical breakthroughs on an indeterminate timescale which may or may not even be possible or result in a practical AGI design. The latter approach instead depends on straight-forward computer science and belts-and-suspenders engineering on a predictable timescale.
If I were executive director of MIRI, I would continue the workshops, because there is a non-zero probability that breakthrough might be made that radically simplifies the safe AGI design space. However I’d definitely spend more than half of the organizations budget and time on a strategy with a definable time-scale and an articulatable project plan, such as the Oracle-AGI-to-Intelligence-Augmentation approach I advocate, although others are possible.
Well that’s where the “positive singularity” and “Friendly (enough) AGI” goals separate: if you choose the route to a “positive singularity” of human intelligence augmentation, you still face the problems of human irrationality, of human moral irrationality (lack of moral caring, moral akrasia, morals that are not aligned with yours, etc), but you now also face the issue of what happens to human evaluative judgement under the effects of intelligence augmentation. Can humans be modified while maintaining their values? We honestly don’t know.
(And I for one am reasonably sure that nobody wise should ever make me their Singularity-grade god-leader, on grounds that my shouldness function, while not nearly as completely alien as Clippy’s, is still relatively unusual, somewhere on an edge of a bell curve, and should therefore not be trusted with the personal or collective future of anyone who doesn’t have a similar shouldness function. Sure, my meta-level awareness of this makes me Friendly, loosely speaking, but we humans are very bad at exercising perfect meta-level awareness of others’ values all the time, and often commit evaluative mind-projection fallacies.)
What I would personally do, at this stage, is just to maintain a distribution (you know probability was gonna enter somewhere) over potential routes to a positive outcome. Plan and act according to the full distribution, through institutions like FHI and FLI and such, while still focusing the specific, achieve-a-single-narrow-outcome optimization power of MIRI’s mathematical talents on building provably Friendly AGIs. Update early and often on whatever new information is available.
For instance, the more I look into AGI and cognitive science research, the more I genuinely feel the “Friendly AI route” can work quite well. From my point of view, it looks more like a research program than an impossible Herculian task (admittedly, the difference is often kinda hard to see to those who’ve never served time in a professional research environment), whereas something like safe human augmentation is currently full of unknown unknowns that are difficult to plan around.
And as much as I generally regard wannabe-ems with a little disdain for their flippant “what do I need reality for!?” views, I do think that researching human mind uploading would help discover a lot of the neurological and cognitive principles needed to build a Friendly AI (ie: what cognitive algorithms are we using to make evaluative judgements?), while also helping create avenues for agents with human motivations to “go FOOM” themselves, just in case, so that’s worthwhile too.
The important thing to note about the problems you identified is how they differ from the problem domains of basic research. What happens to human evaluative judgement under the effects of intelligence augmentation? That’s an experimental question. Can we trust a single individual to be enhanced? Almost certainly not. So perhaps we need to pick 100 or 1,000 people, wired into an shared infrastructure which enhances them in lock-step, and has incentives in place to ensure collaboration over competition, and consensus over partisanship in decision making protocols. Designing these protocols and safeguards takes a lot of work, but both the scale and the scope of that work is fairly well quantified. We can make a project plan and estimate with a high degree of accuracy how long and how much money it would take to design sufficiently safe oracle AI and intelligence augmentation projects.
FAI theory, on the other hand, is like the search for a grand unified theory of physics. We presume such a theory exists. We even have an existence proof of sorts (the human mind for FAI, the universe itself in physics). But the discovery of a solution is something that will or will not happen, and if it does it will be on an unpredictable time scale. Maybe it will take 5 years. Maybe 50, maybe 500. Who knows? After the rapid advances of the early 20th century, I’m sure most physicists thought a grand unified theory must be within reach; Einstein certainly did. Yet here we are nearly 100 years after the publication of the general theory of relativity, 85 years after most of the major discoveries of quantum mechanics, and yet in many ways we seem no closer to a theory of everything than we were some 40 years ago when the standard model was largely finalized.
It could be that at the very next MIRI workshop some previously unknown research associate solves the FAI problem conclusively. That’d be awesome. Or maybe she proves it impossible, which would be an equally good outcome because then we could at least refocus our efforts. Far worse, it might be that 50 years from now all MIRI has accumulated is a thoroughly documented list of dead-ends.
But that’s not the worst case, because in reality UFAI will appear within the next decade or two, whether we want it to or not. So unless we are confident that we will solve the FAI problem and build out the solution before the competition, we’d better start investing heavily in alternatives.
The AI winter is over. Already multiple very well funded groups are rushing forward to generalize already super-human narrow AI techniques. AGI is finally a respectable field again, and there are multiple teams making respectable progress towards seed AI. And parallel hardware and software tools have finally gotten to the point where a basement AGI breakthrough is a very real and concerning possibility.
We don’t have time to be dicking around doing basic research on whiteboards.
Aaaand there’s the “It’s too late to start researching FAI, we should’ve started 30 years ago, we may as well give up and die” to go along with the “What’s the point of starting now, AGI is too far away, we should start 30 years later because it will only take exactly that amount of time according to this very narrow estimate I have on hand.”
If the overlap between your credible intervals on “How much time we have left” and “How much time it will take” do not overlap, then you either know a heck of a lot I don’t, or you are very overconfident. I usually try not to argue from “I don’t know and you can’t know either” but for the intersection of research and AGI timelines I can make an exception.
Admittedly my own calculation looks less like an elaborate graph involving supposed credibility intervals, and, “Do we need to do this? Yes. Can we realistically avoid having to do this? No. Let’s start now EOM.”
I think that’s a gross simplification of the possible outcomes.
I think you need better planning.
There’s a great essay that has been a featured article on the main page for some time now called Levels of Action. Applied to FAI theory:
Level 1: Directly ending human suffering.
Level 2: Constructing an AGI capable of ending human suffering for us.
Level 3: Working on the computer science aspects of AGI theory.
Level 4: Researching FAI theory, which constrains the Level 3 AGI theory.
But for that high-level basic research to have any utility, these levels must be connected to each other: there must be a firm chain where FAI theory informs AGI designs, which are actually used in the construction of an AGI tasked with ending human suffering in a friendly way.
From what I can tell on the outside, the MIRI approach seems to be: (1) find a practical theory of FAI; (2) design an AGI in accordance with this theory; (3) implement that design; (4) mission accomplished!
That makes a certain amount of intuitive sense, having stages laid out end-to-end in chronological order. However as a trained project manager I must tell you this is a recipe for disaster! The problem is that the design space branches out at each link, but without the feedback of follow-on steps, inefficient decision making will occur at earlier stages. The space of working FAI theories is much, much larger than the FAI-theory-space which results in practical AGI designs which can be implemented prior to the UFAI competition and are suitable for addressing real-world issues of human suffering as quickly as possible.
Some examples from the comparably large programs of the Manhattan project and Apollo moonshot are appropriate, if you’ll forgive the length (skip to the end for a conclusion):
The Manhattan project had one driving goal: drop a bomb on Berlin and Tokyo before the GIs arrived, hopefully ending the war early. (Of course Germany surrendered before the bomb was finished, and Tokyo ended up so devastated by conventional firebombing that Hiroshima and Nagasaki were selected instead, but the original goal is what matters here.) The location of the targets meant that the bomb had to be small enough to fit in a conventional long-distance bomber, and the timeline meant that the simpler but less efficient U-235 designs were preferred. A program was designed, adequate resources allocated, and the goal achieved on time.
On the other hand it is easy to imagine how differently things might have gone if the strategy was reversed; if instead the US military decided to institute a basic research program into nuclear physics and atomic structure, before deciding on the optimal bomb reactions, then doing detailed bomb design before creating the industry necessary to produce enough material for a working weapon. Just looking at the first stage, there is nothing a priori which makes it obvious that U-235 and Pu-239 are the “interesting” nuclear fuels to focus on. Thorium, for example, was more naturally abundant and already being extracted as a by product of rare earth metal extraction, its reactions generate less lethal radiation and long-lasting waste products, and does generate U-233 which could be used in a nuclear bomb. However the straight-forward military and engineering requirements of making a bomb on schedule, and successfully delivering it on target favored U-235 and Pu-239 based weapon designs, which focused focused the efforts of the physicists involved on those fuel pathways. The rest is history.
The Apollo moonshot is another great example. NASA had a single driving goal: deliver a man to the moon before 1970, and return him safely to Earth. There’s a lot of decisions that were made in the first few years driven simply by time and resources available: e.g. heavy-lift vs orbital assembly, direct return vs lunar rendezvous, expendable vs. reuse, staging vs. fuel depots. Ask Wernher von Braun what he imagined an ideal moon mission would look like, and you would have gotten something very different than Apollo. But with Apollo NASA made the right tradeoffs with respect to schedule constraints and programmatic risk.
The follow-on projects of Shuttle and Station are a completely different story, however. They were designed with no articulated long-term strategy, which meant they tried to be everything to everybody and as a result were useful to no one. Meanwhile the basic research being carried out at NASA has little, if anything to do with the long-term goals of sending humans to Mars. There’s an entire division, the Space Biosciences group, which does research on Station about the long-term effects of microgravity and radiation on humans, supposedly to enable a long-duration voyage to Mars. Never mind that the microgravity issue is trivially solved by spinning the spacecraft with nothing more than a strong steel rope as a tether, and the radiation issue is sufficiently mitigated by having a storm shelter en route and throwing a couple of Martian sandbags on the roof once you get there.
There’s an apocryphal story about the US government spending millions of dollars to develop the “Space Pen”—a ballpoint pen with ink under pressure to enable writing in microgravity environments. Much later at some conference an engineer in that program meets his Soviet counterpart and asks how they solved that difficult problem. The cosmonauts used a pencil.
Sadly the story is not true—the “Space Pen” was a successful marketing ploy by inventor Paul Fisher without any ties to NASA, although it was used by NASA and the Russians on later missions—but it does serve to illustrate the point very succinctly. I worry that MIRI is spending its days coming up with space pens when a pencil would have done just fine.
Let me provide some practical advice. If I were running MIRI, I would still employ mathematicians working on the hail-Mary of a complete FAI theory—avoiding the Löbian obstacle etc. -- and run the very successful workshops, though maybe just two a year. But beyond that I would spend all remaining resources on a pragmatic AGI design programme:
1) Have a series of workshops with AGI people to do a review of possible AI-influenced strategies for a positive singulatiry—top-down FAI, seed AI to FAI, Oracle AI to FAI, Oracle AI to human augmentation, teaching a UFAI morals in a nursery environment, etc.
2) Have a series of workshops, again with AGI people to review tactics: possible AGI architectures & the minimal seed AI for each architecture, probabilistically reliable boxing setups, programmatic security, etc.
Then use the output of these workshops—including reliable constraints on timelines—to drive most of the research done by MIRI. For example, I anticipate that reliable unfriendly Oracle AI setups will require probabilistically auditable computation, which itself will require a strongly typed, purely functional virtual machine layer from which computation traces can be extracted and meaningfully analyzed in isolation. This is the sort of research MIRI could sponsor a grad student or Ph.d postdoc to perform.
BTW, other gripe: I have yet to see adequate arguments for the “can we realistically avoid having to do this?” from MIRI which aren’t strawman arguments.
While I don’t know much about your AGi expertise, I agree that MIRI is missing an experienced top-level executive who knows how to structure, implement and risk-mitigate an ambitious project like FAI and has a track record to prove it. Such a person would help prevent flailing about and wasting time and resources. I am not sure what other projects are in this reference class and whether MIRI can find and hire a person like that, so maybe they are doing what they can with the meager budget they’ve got. Do you think that the Manhattan project and the Space Shuttle are in the ballpark of the FAI? My guess is that they don’t even come close in terms of ambition, risk, effort or complexity.
Project managers are typically expensive because they are senior people before they enter management. Someone who has never actually worked at the bottom rung of the ladder is often quite useless in a project management role. But that’s not to say that you can’t find someone young who has done a short stint at the bottom, got PMP certified (or whatever), and has 1-2 projects under their belt. It wouldn’t be cheap, but not horribly expensive either.
On the other hand, Luke seems pretty on the ball with respect to administrative stuff. It may be sufficient to get him some project manager training and some very senior project management advisers.
Neither one of these would be a long-term adequate solution. You need very senior, very experienced project management people in order to tackle something as large as FAI, and stay on schedule and on budget. But in terms of just making sure the organization is focused on the right issues, either of the above would be a drastic improvement, and enough for now.
60 years ago, maybe. However these days advances in cognitive science, narrow AI, and computational tools are advancing at rapid paces on their own. The problem for MIRI should be that of ensuring a positive singularity via careful leverage of the machine intelligence already being developed for other purposes. That’s a much smaller project, and something I think a small but adequately funded organization should be able to pull off.
Yes, dear, some of us are programmers, we know about waterfalls. Our approach is more like, “Attack the most promising problems that present themselves, at every point; don’t actually build things which you don’t yet know how to make not destroy the world, at any point.” Right now this means working on unbounded problems because there are no bounded problems which seem more relevant and more on the critical path. If at any point we can build something to test ideas, of course we will; unless our state of ignorance is such that we can’t test that particular idea without risking destroying the world, in which case we won’t, but if you’re really setting out to test ideas you can probably figure out some other way to test them, except for very rare highly global theses like “The intelligence explosion continues past the human level.” More local theses should be testable.
See also Ch. 22 from HPMOR, and keep in mind that I am not Harry, I contain Harry, all the other characters, their whole universe, and everything that happens inside it. In other words, I am not Harry, I am the universe that responded to Harry.
I’ll have to review Ch. 22 later as it is quite long.
If a stable self-modifying agent + friendly value-loading was the only pathway to a positive singularity, then MIRI would be doing a fine job. However I find that assumption not adequately justified.
For example, take oracle AI. The sequences do a good job of showing how a black box AI can’t be safely boxed, nor can any of its recommendations be trusted. But those arguments don’t generalize to when we can see and understand the inner workings of the AI. Yes engineering challenges apply as you can’t demand a computational trace of the entire returned result, as that would require an even more powerful AI to analyze, and then it’d be turtles all the way down. However you can do something like the Fiat-Shamir transform for selecting branches of the computational trace to audit. In essence, use the cryptographic hash of the result in order to choose which traces of the audit log to reveal. This allows the audit log to be only a tiny, tiny slice of the entire computation, yet it can be shown that faking such an audit log is computationally infeasible, meaning that it requires a large multiple more computation than went into the original result, which means it would be energetically detectable.
Of course you would also have to have the AI be written for a virtual machine which is pure, strongly typed and directly supports the native abstraction primitives of the AGI design (e.g. if it were OpenCog, the virtual machine would exist at the abstraction level of the atomspace), and have a computational stack underneath that which prevents VM breakout, and other protective measures. But these are surmountable engineering challenges, the difficulty of which can be realistically quantified.
So how much more or less difficult would it be to accomplish such an untrusted oracle AI vs the stable self-modifying agent and value-loading approach? Which one is more likely to occur before the “competition”?
I’m not demanding a full waterfall project plan, but even agile requires convincing arguments about critical paths and relative priorities. I for one am not convinced.
Well that makes three of us...
Badass boasting from fictional evidence?
If anyone here knew anything about the Waterfall Model, they’d know it was only ever proposed sarcastically, as a perfect example of how real engineering projects never work. “Agile” is pretty goddamn fake, too. There’s no replacement for actually using your mind to reason about what project-planning steps have the greatest expected value at any given time, and to account for unknown unknowns (ie: debugging, other obstacles) as well.
Yes, and I used it in that context: “We know about waterfalls” = “We know not to do waterfalls, so you don’t need to tell us that”. Thank you for that very charitable interpretation of my words.
Well, when you start off a sentence with “Yes, dear”, the dripping sarcasm can be read multiple ways, none of them very useful or nice.
Whatever. No point fighting over tone given shared goals.
Do we need to do this = wild guess.
The whole things a Drake Equation
Ok, let me finally get around to answering this.
FAI has definite subproblems. It is not a matter of scratching away at a chalkboard hoping to make some breakthrough in “philosophy” or some other proto-sensical field that will Elucidate Everything and make the problem solvable at all. FAI, right now, is a matter of setting researchers to work on one subproblem after another until they are all solved.
In fact, when I do literature searches for FAI/AGI material, I often find that the narrow AI or machine-learning literature contains a round dozen papers nobody working explicitly on FAI has ever cited, or even appears to know about. This is my view: there is low-hanging fruit in applying existing academic knowledge to FAI problems. Where such low-hanging fruit does not exist, the major open problems can largely be addressed by recourse to higher-hanging fruit within mathematics, or even to empirical science.
Since you believe it’s all so wide-open, I’d like to know what you think of as “the FAI problem”.
If you have an Oracle AI you can trust, you can use it to solve FAI problems for you. This is a fine approach.
Luckily, we don’t need to dick around.
That’s a large portion of the FAI problem right there.
EDIT: To clarify, by this I don’t mean to imply that FAI is easy, but that (trustworthy) Oracle AI is hard.
In-context, what was meant by “Oracle AI” is a very general learning algorithm with some debug output, but no actual decision-theory or utility function whatsoever built in. That would be safe, since it has no capability or desire to do anything.
You have to give it a set of directed goals and a utility function which favors achieving those goals, in order for the oracle AI to be of any use.
Why? How are you structuring your Oracle AI? This sounds like philosophical speculation, not algorithmic knowledge.
Ok, but a system like you’ve described isn’t likely to think about what you want it to think about or produce output that’s actually useful to you either.
Well yes. That’s sort of the problem with building one. Utility functions are certainly useful for specifying where logical uncertainty should be reduced.
Well, ok, but if you agree with this then I don’t see how you can claim that such a system would be particularly useful for solving FAI problems.
Well, I don’t know about the precise construction that would be used. Certainly I could see a human being deliberately focusing the system on some things rather than others.
All existing learning algorithms I know of, and I dare say all that exist, have at least an utility function, and also something that could be interpreted as a decision theory. Consider for example support vector machines, which explicitly try to maximize a margin (that would be the utility function), and any algorithm for computing SVMs can be interpreted as a decision theory. Similar considerations hold for neural networks, genetic algorithms, and even the minimax algorithm.
Thus, I strongly doubt that the notion of a learning algorithm with no utility function makes any sense.
Those are optimization criteria, but they are not decision algorithms in the sense that we usually talk about them in AI. A support vector machine is just finding the extrema of a cost function via its derivative, not planning a sequence of actions.
The most popular algorithm for SVMs does plan a sequence of actions, complete with heuristics as to which action to take. True, the “actions” are internal : they are changes to some data structure within the computer’s memory, rather than changes to the external world. But that is not so different from e.g. a chess AI, which assigns some heuristic score to chess positions and attempts to maximize it using a decision algorithm (to decide which move to make), even though the chessboard is just a data structure within the computer memory.
“Internal” to the “agent” is very different from having an external output to a computational system outside the “agent”. “Actions” that come from an extremely limited, non-Turing-complete “vocabulary” (really: programming language or computational calculus (those two are identical)) are also categorically different from a Turing complete calculus of possible actions.
The same distinction applies for hypothesis class that the learner can learn: if it’s not Turing complete (or some approximation thereof, like a total calculus with coinductive types and corecursive programs), then it is categorically not general learning or general decision-making.
This is why we all employ primitive classifiers every day without danger, and you need something like Solomonoff’s algorithmic probability in order to build AGI.
I agree, of course, that none of the examples I gave (“primitive classifiers”) are dangerous. Indeed, the “plans” they are capable of considering are too simple to pose any threat (they are, as you say, not Turing complete).
But, that doesn’t seem to relevant to the argument at all. You claimed
You claimed that a general learning algorithm without decision-theory or utility function is possible. I pointed out that all (harmless) practical learning algorithms we know of do in fact have decision theories and utility functions. What would “a learning algorithm without decision-theory or utility function, something that has no desire to do anything” even look like? Does the concept even make sense? Eliezer writes here
/facepalm
There is in fact such a thing as a null output. There is in fact such a thing as a learner with a sub-Turing hypothesis class. Such a learner with such a primitive output as “in the class” or “not in the class” does not engage in world optimization, that is: its actions do not, to its own knowledge, skew any probability distribution over future states of any portion of the world outside itself.
It does not narrow the future.
Now, what we’ve been proposing as an Oracle is even less capable. It would truly have no outputs whatsoever, only input and a debug view. It would, by definition, be incapable of narrowing the future of anything, even its own internal states.
Perhaps I have misused terminology, but that is what I was referring to: inability to narrow the outer world’s future.
This thing you are proposing, an “oracle” that is incapable of modeling itself and incapable of modeling its environment (either would require turing-complete hypotheses), what could it possibly be useful for? What could it do that today’s narrow AI can’t?
A) It wasn’t my proposal.
B) The proposed software could model the outer environment, but not act on it.
Physics is turing-complete, so no, a learner that did not consider turing complete hypotheses could not model the outer environment.
You seem to have lost the thread of the conversation. The proposal was to build a learner that can model the environment using Turing-complete models, but which has no power to make decisions or take actions. This would be a Solomonoff Inducer approximation, not an AIXI approximation.
You said
which led me to think you were talking about an oracle even less capable than a learner with a sub-Turing hypothesis class.
If the hypotheses it considers are turing-complete, then, given enough information (and someone would give it enough information, otherwise they couldn’t do anything useful with it), it could model itself, its environment, the relation between its internal states and what shows up on the debug view, and the reactions of its operators on the information they learn from that debug view. Its (internal) actions very much would, to its own knowledge, skew the probability distribution over future states of the outer world.
Name three. FAI contains a number of counterintuitive difficulties and it’s unlikely for someone to do FAI work successfully by accident. On the other hand, someone with a fuzzier model believing that a paper they found sure sounds relevant, why isn’t MIRI citing it, is far more probable from my perspective and prior.
I wouldn’t say that there’s someone out there directly solving FAI problems without having explicitly intended to do so. I would say there’s a lot we can build on.
Keep in mind, I’ve seen enough of a sample of Eld Science being stupid to understand how you can have a very low prior on Eld Science figuring out anything relevant. But lacking more problem guides from you on the delta between plain AI problems and FAI problems, we go on what we can.
One paper on utility learning that relies on a supervised-learning methodology (pairwise comparison data) rather than a de-facto reinforcement learning methodology (which can and will go wrong in well-known ways when put into AGI). One paper on progress towards induction algorithms that operate at multiple levels of abstraction, which could be useful for naturalized induction if someone put more thought and expertise into it.
That’s only two, but I’m a comparative beginner at this stuff and Eld Science isn’t very good at focusing on our problems, so I expect that there’s actually more to discover and I’m just limited by lack of time and knowledge to do the literature searches.
By the way, I’m already trying to follow the semi-official MIRI curriculum, but if you could actually write out some material on the specific deltas where FAI work departs from the preexisting knowledge-base of academic science, that would be really helpful.
Define doing FAI work successfully....
1) Designing a program capable of arbitrary self-modification, yet maintaining guarantees of “correct” behavior according to a goal set that is by necessity included in the modifications as well.
2) Designing such a high level set of goals which ensure “friendliness”.
Designing, not evolving?
That seems a circular argument. How do you use a self-modifying evolutionary search to find a program whose properties remain stable under self-modifying evolutionary search? Unless you started with the right answer, the search AI would quickly rewrite or reinterpret its own driving goals in a non-friendly way, and who knows what you’d end up with.
I don’t see why the search algorithm would need to be self modifying.
I don’t see why you would be searching for stability as opposed to friendliNess. Human testers can judge friendliness directly.
It’s how you draw your system box. Evolutionary search is equivalent to a self-modifying program, if you think of the whole search process as the program. The same issues apply.
I think the sequences do a good job at demolishing the idea that human testers can possibly judge friendliness directly, so long as the AI operates as a black box. If you have a debug view into the operation of the AI that is a different story, but then you don’t need friendliness anyway.
If I draw a box around the selection algorithm and find there is nothing self modifying inside …where’s the circularity?
(1) is naturalized induction, logical uncertainty, and getting around the Loebian Obstacle.
(2) is the cognitive science of evaluative judgements.
Great, you’ve got names for answers you are looking for. That doesn’t mean the answers are any easier to find. You’ve attached a label to the declarative statement which specifies the requirements a solution must meet, but that doesn’t make the search for a solution suddenly have a fixed timeline. It’s uncertain research: it might take 5 years, 10 years, or 50 years, and throwing more people at the problem won’t necessarily make the project go any faster.
And how is trying to build a safe Oracle AI that can solve FAI problems for us not basic research? Or, to make a better statement: how is trying to build an Unfriendly superintelligent paperclip maximizer not basic research, at today’s research frontier?
Logical uncertainty, for example, is a plain, old-fashioned AI problem. We need it for FAI, we’re pretty sure, but it’s turning out UFAI might need it, too.
“Basic research is performed without thought of practical ends.”
“Applied research is systematic study to gain knowledge or understanding necessary to determine the means by which a recognized and specific need may be met.”
-National Science Foundation.
We need to be doing applied research, not basic research. What MIRI should do is construct a complete roadmap to FAI, or better: a study exhaustively listing strategies for achieving a positive singularity, and tactics for achieving friendly or unfriendly AGI, and concluding with a small set of most-likely scenarios. MIRI should then have identified risk factors which affect either the friendliness of the AGI in each scenario, or the capability of the UFAI to do damage (in boxing setups). These risk factors should be prioritized based on how much it is expected knowing more about each would bias the outcome in a positive direction, and it should be these problems as the topics of MIRI workshops.
Instead MIRI is performing basic research. It’s basic research not because it is useless, but because we are not certain at this point in time what relative utility it will have. And if we don’t have a grasp on expected utility, how can we prioritize? There’s a hundred avenues of research which are important to varying degrees to the FAI project. I worked for a number of years at NASA-Ames Research Center, and in the same building as me was the Space Biosciences Division. Great people, don’t get me wrong, and for decades they have funded really cool research on the effects of microgravity and radiation on living organisms, with the justification that such effects and counter-measures need to be known for long duration space voyages, e.g. a 2-year mission to Mars. Never mind that the microgravity issue is trivially solved with a few thousand dollar steel tether connecting the upper stage to the space craft as they spin to create artificial gravity, and the radiation exposure is mitigated by having a storm shelter in the craft and throwing a couple of Martian sandbags on the roof once you get there. It’s spending millions of dollars to develop the pressurized-ink “Space Pen”, when the humble pencil would have done just fine.
Sadly I think MIRI is doing the same thing, and it is represented in one part of your post I take huge issue with:
If we’re only “pretty sure” it’s needed for FAI, if we can’t quantify exactly what its contribution will be, and how important that contribution is relative to other possible things to be working on.. then we have some meta-level planning to do first. Unfortunately I don’t see MIRI doing any planning like this (or if they are, it’s not public).
Are you on the “Open Problems in Friendly AI” Facebook group? Because much of the planning is on there.
Logical uncertainty lets us put probabilities to sentences in logics. This, supposedly, can help get us around the Loebian Obstacle to proving self-referencing statements and thus generating stable self-improvement in an agent. Logical uncertainty also allows for making techniques like Updateless Decision Theory into real algorithms, and this too is an AI problem: turning planning into inference.
The cognitive stuff about human preferences is the Big Scary Hard Problem of FAI, but utility learning (as Stuart Armstrong has been posting about lately) is a way around that.
If you can create a stably self-improving agent that will learn its utility function from human data, equipped with a decision theory capable of handling both causative games and Timeless situations correctly… then congratulations, you’ve got a working plan for a Friendly AI and you can start considering the expected utility of actually building it (at least, to my limited knowledge).
Around here you should usually clarify whether your uncertainty is logical or indexical ;-).
Or.. you could use a boxed oracle AI to develop singularity technologies for human augmentation, or other mechanisms to keep moral humans in the loop through the whole process, and sidestep the whole issue of FAI and value loading in the first place.
Which approach do you think can be completed earlier with similar probabilities of success? What data did you use to evaluate that, and how certain are you of its accuracy and completeness?
I actually really do think that de novo AI is easier than human intelligence augmentation. We have good cognitive theories for how an agent is supposed to work (including “ideal learner” models of human cognitive algorithms). We do not have very good theories of in-vitro neuroengineering.
Yes, but those details would be handled by the post-”FOOM” boxed AI. You get to greatly discount their difficulty.
This assumes that you have usable, safe Oracle AI which then takes up your chosen line of FAI or neuroengineering problems for you. You are conditioning the hard part on solving the hard part.
You don’t need to solve philosophy to solve FAI, but philosophy is relevant to figuring out, in broad terms, the relative livelihoods of various problems and solutions.
I’m not arguing that AI will necessary be safe. I am arguing that the failure modes in’vestigated by MIRI aren’t likely. It is worthwhile to research effectivev off switches. It is not worthwhile to endlessly refer to a dangerous AI of a kind no one with a smidgeon of sense would build.
Bzzzt. Wrong. You still haven’t explained how to create an agent that will faithfully implement my verbal instruction to bring me a pizza. You have a valid case in the sense of pointing out that there can easily exist a “middle ground” between the Superintelligent Artificial Ethicist (Friendly AI in its fullest sense), the Superintelligent Paper Clipper (a perverse, somewhat unlikely malprogramming of a real superintelligence), and the Reward-Button Addicted Reinforcement Learner (the easiest unfriendly AI to actually build). What you haven’t shown is how to actually get around the Addicted Reinforcement Learner and the paper-clipper and actually build an agent that can be sent out for pizza without breaking down at all.
Your current answers seem to be, roughly, “We get around the problem by expecting future AI scientists to solve it for us.” However, we are the AI scientists: if we don’t figure out how to make AI deliver pizza on command, who will?
You keep misreading me. I am not claiming that to gave a solution. I am claiming that MIRI is overly pessimistic about the problem, and offering an over engineered solution. Inasmuch ad you say there is a middle ground, you kind if agree.
The thing is, MIRI doesn’t claim that a superintelligent world-destroying paperclipper is the most likely scenario. It’s just illustrative of why we have an actual problem: because you don’t need malice to create an Unfriendly AI that completely fucks everything up.
To make reliable predictions, more realistic examples are needed.
So how did you like CATE, over in that other thread? That AI is non-super-human, doesn’t go FOOM, doesn’t acquire nanotechnology, can’t do anything a human upload couldn’t do… and still can cause quite a lot of damage simply because it’s more dedicated than we are, suffers fewer cognitive flaws than us, has more self-knowledge than us, and has no need for rest or food.
I mean, come on: what if a non-FOOMed but Unfriendly AI becomes as rich as Bill Gates? After all, if Bill Gates did it while human, than surely an AI as smart as Bill Gates but without his humanity can do the same thing, while causing a bunch more damage to human values because it simply does not feel Gates’ charitable inclinations.
I feel like there’s not much of a distinction being made here between terminal values and terminal goals. I think they’re importantly different things.
Huh?
A goal I set is a state of the world I am actively trying to bring about, whereas a value is something which . . . has value to me. The things I value dictate which world states I prefer, but for either lack of resources or conflict, I only pursue the world states resulting from a subset of my values.
So not everything I value ends up being a goal. This includes terminal goals. For instance, I think that it is true that I terminally value being a talented artist—greatly skilled in creative expression—being so would make me happy in and of itself, but it’s not a goal of mine because I can’t prioritise it with the resources I have. Values like eliminating suffering and misery are ones which matter to me more, and get translated into corresponding goals to change the world via action.
I haven’t seen a definition provided, but if I had to provide one for ‘terminal goal’ it would be that it’s a goal whose attainment constitutes fulfilment of a terminal value. Possessing money is rarely a terminal value, and so accruing money isn’t a terminal goal, even if it is intermediary to achieving a world state desired for its own sake. Accomplishing the goal of having all the hungry people fed is the world state which lines up with the value of no suffering, hence it’s terminal. They’re close, but not quite same thing.
I think it makes sense to possibly not work with terminal goals on a motivational/decision making level, but it doesn’t seem possible (or at least likely) that someone wouldn’t have terminal values, in the sense of not having states of the world which they prefer over others. [These world-state-preferences might not be completely stable or consistent, but if you prefer the world be one way than another, that’s a value.]
I don’t think that terminal goal means that it’s the highest priority here, just that there is no particular reason to achieve it other than the experience of attaining that goal. So eating barbecue isn’t about nutrition or socializing, it’s just about eating barbecue.
I think the ‘terminal’ in terminal goal means ‘end of that thread of goals’, as in a train terminus. Something that is wanted for the sake of itself.
It does not imply that you will terminate someone to achieve it.
If g1 is you bacon eating goal, ,and g2 is your not killing people goal, and g2 overrides g1, then g2 is the end of the thread.