Reply to Holden on ‘Tool AI’
I begin by thanking Holden Karnofsky of Givewell for his rare gift of his detailed, engaged, and helpfully-meant critical article Thoughts on the Singularity Institute (SI). In this reply I will engage with only one of the many subjects raised therein, the topic of, as I would term them, non-self-modifying planning Oracles, a.k.a. ‘Google Maps AGI’ a.k.a. ‘tool AI’, this being the topic that requires me personally to answer. I hope that my reply will be accepted as addressing the most important central points, though I did not have time to explore every avenue. I certainly do not wish to be logically rude, and if I have failed, please remember with compassion that it’s not always obvious to one person what another person will think was the central point.
Luke Mueulhauser and Carl Shulman contributed to this article, but the final edit was my own, likewise any flaws.
Summary:
Holden’s concern is that “SI appears to neglect the potentially important distinction between ‘tool’ and ‘agent’ AI.” His archetypal example is Google Maps:
Google Maps is not an agent, taking actions in order to maximize a utility parameter. It is a tool, generating information and then displaying it in a user-friendly manner for me to consider, use and export or discard as I wish.
The reply breaks down into four heavily interrelated points:
First, Holden seems to think (and Jaan Tallinn doesn’t apparently object to, in their exchange) that if a non-self-modifying planning Oracle is indeed the best strategy, then all of SIAI’s past and intended future work is wasted. To me it looks like there’s a huge amount of overlap in underlying processes in the AI that would have to be built and the insights required to build it, and I would be trying to assemble mostly—though not quite exactly—the same kind of team if I was trying to build a non-self-modifying planning Oracle, with the same initial mix of talents and skills.
Second, a non-self-modifying planning Oracle doesn’t sound nearly as safe once you stop saying human-English phrases like “describe the consequences of an action to the user” and start trying to come up with math that says scary dangerous things like (he translated into English) “increase the correspondence between the user’s belief about relevant consequences and reality”. Hence why the people on the team would have to solve the same sorts of problems.
Appreciating the force of the third point is a lot easier if one appreciates the difficulties discussed in points 1 and 2, but is actually empirically verifiable independently: Whether or not a non-self-modifying planning Oracle is the best solution in the end, it’s not such an obvious privileged-point-in-solution-space that someone should be alarmed at SIAI not discussing it. This is empirically verifiable in the sense that ‘tool AI’ wasn’t the obvious solution to e.g. John McCarthy, Marvin Minsky, I. J. Good, Peter Norvig, Vernor Vinge, or for that matter Isaac Asimov. At one point, Holden says:
One of the things that bothers me most about SI is that there is practically no public content, as far as I can tell, explicitly addressing the idea of a “tool” and giving arguments for why AGI is likely to work only as an “agent.”
If I take literally that this is one of the things that bothers Holden most… I think I’d start stacking up some of the literature on the number of different things that just respectable academics have suggested as the obvious solution to what-to-do-about-AI—none of which would be about non-self-modifying smarter-than-human planning Oracles—and beg him to have some compassion on us for what we haven’t addressed yet. It might be the right suggestion, but it’s not so obviously right that our failure to prioritize discussing it reflects negligence.
The final point at the end is looking over all the preceding discussion and realizing that, yes, you want to have people specializing in Friendly AI who know this stuff, but as all that preceding discussion is actually the following discussion at this point, I shall reserve it for later.
1. The math of optimization, and the similar parts of a planning Oracle.
What does it take to build a smarter-than-human intelligence, of whatever sort, and have it go well?
A “Friendly AI programmer” is somebody who specializes in seeing the correspondence of mathematical structures to What Happens in the Real World. It’s somebody who looks at Hutter’s specification of AIXI and reads the actual equations—actually stares at the Greek symbols and not just the accompanying English text—and sees, “Oh, this AI will try to gain control of its reward channel,” as well as numerous subtler issues like, “This AI presumes a Cartesian boundary separating itself from the environment; it may drop an anvil on its own head.” Similarly, working on TDT means e.g. looking at a mathematical specification of decision theory, and seeing “Oh, this is vulnerable to blackmail” and coming up with a mathematical counter-specification of an AI that isn’t so vulnerable to blackmail.
Holden’s post seems to imply that if you’re building a non-self-modifying planning Oracle (aka ‘tool AI’) rather than an acting-in-the-world agent, you don’t need a Friendly AI programmer because FAI programmers only work on agents. But this isn’t how the engineering skills are split up. Inside the AI, whether an agent AI or a planning Oracle, there would be similar AGI-challenges like “build a predictive model of the world”, and similar FAI-conjugates of those challenges like finding the ‘user’ inside an AI-created model of the universe. The insides would look a lot more similar than the outsides. An analogy would be supposing that a machine learning professional who does sales optimization for an orange company couldn’t possibly do sales optimization for a banana company, because their skills must be about oranges rather than bananas.
Admittedly, if it turns out to be possible to use a human understanding of cognitive algorithms to build and run a smarter-than-human Oracle without it being self-improving—this seems unlikely, but not impossible—then you wouldn’t have to solve problems that arise with self-modification. But this eliminates only one dimension of the work. And on an even more meta level, it seems like you would call upon almost identical talents and skills to come up with whatever insights were required—though if it were predictable in advance that we’d abjure self-modification, then, yes, we’d place less emphasis on e.g. finding a team member with past experience in reflective math, and wouldn’t waste (additional) time specializing in reflection. But if you wanted math inside the planning Oracle that operated the way you thought it did, and you wanted somebody who understood what could possibly go wrong and how to avoid it, you would need to make a function call to the same sort of talents and skills to build an agent AI, or an Oracle that was self-modifying, etc.
2. Yes, planning Oracles have hidden gotchas too.
“Tool AI” may sound simple in English, a short sentence in the language of empathically-modeled agents — it’s just “a thingy that shows you plans instead of a thingy that goes and does things.” If you want to know whether this hypothetical entity does X, you just check whether the outcome of X sounds like “showing someone a plan” or “going and doing things”, and you’ve got your answer. It starts sounding much scarier once you try to say something more formal and internally-causal like “Model the user and the universe, predict the degree of correspondence between the user’s model and the universe, and select from among possible explanation-actions on this basis.”
Holden, in his dialogue with Jaan Tallinn, writes out this attempt at formalizing:
Here’s how I picture the Google Maps AGI …
utility_function = construct_utility_function(process_user_input());
foreach $action in $all_possible_actions {
$action_outcome = prediction_function($action,$data);
$utility = utility_function($action_outcome);
if ($utility > $leading_utility) { $leading_utility = $utility;
$leading_action = $action; }
}
report($leading_action);
construct_utility_function(process_user_input()) is just a human-quality function for understanding what the speaker wants. prediction_function is an implementation of a human-quality data->prediction function in superior hardware. $data is fixed (it’s a dataset larger than any human can process); same with $all_possible_actions. report($leading_action) calls a Google Maps-like interface for understanding the consequences of $leading_action; it basically breaks the action into component parts and displays predictions for different times and conditional on different parameters.
Google Maps doesn’t check all possible routes. If I wanted to design Google Maps, I would start out by throwing out a standard planning technique on a connected graph where each edge has a cost function and there’s a good heuristic measure of the distance, e.g. A* search. If that was too slow, I’d next try some more efficient version like weighted A* (or bidirectional weighted memory-bounded A*, which I expect I could also get off-the-shelf somewhere). Once you introduce weighted A*, you no longer have a guarantee that you’re selecting the optimal path. You have a guarantee to within a known factor of the cost of the optimal path — but the actual path selected wouldn’t be quite optimal. The suggestion produced would be an approximation whose exact steps depended on the exact algorithm you used. That’s true even if you can predict the exact cost — exact utility — of any particular path you actually look at; and even if you have a heuristic that never overestimates the cost.
The reason we don’t have God’s Algorithm for solving the Rubik’s Cube is that there’s no perfect way of measuring the distance between any two Rubik’s Cube positions — you can’t look at two Rubik’s cube positions, and figure out the minimum number of moves required to get from one to another. It took 15 years to prove that there was a position requiring at least 20 moves to solve, and then another 15 years to come up with a computer algorithm that could solve any position in at most 20 moves, but we still can’t compute the actual, minimum solution to all Cubes (“God’s Algorithm”). This, even though we can exactly calculate the cost and consequence of any actual Rubik’s-solution-path we consider.
When it comes to AGI — solving general cross-domain “Figure out how to do X” problems — you’re not going to get anywhere near the one, true, optimal answer. You’re going to — at best, if everything works right — get a good answer that’s a cross-product of the “utility function” and all the other algorithmic properties that determine what sort of answer the AI finds easy to invent (i.e. can be invented using bounded computing time).
As for the notion that this AGI runs on a “human predictive algorithm” that we got off of neuroscience and then implemented using more computing power, without knowing how it works or being able to enhance it further: It took 30 years of multiple computer scientists doing basic math research, and inventing code, and running that code on a computer cluster, for them to come up with a 20-move solution to the Rubik’s Cube. If a planning Oracle is going to produce better solutions than humanity has yet managed to the Rubik’s Cube, it needs to be capable of doing original computer science research and writing its own code. You can’t get a 20-move solution out of a human brain, using the native human planning algorithm. Humanity can do it, but only by exploiting the ability of humans to explicitly comprehend the deep structure of the domain (not just rely on intuition) and then inventing an artifact, a new design, running code which uses a different and superior cognitive algorithm, to solve that Rubik’s Cube in 20 moves. We do all that without being self-modifying, but it’s still a capability to respect.
And I’m not even going into what it would take for a planning Oracle to out-strategize any human, come up with a plan for persuading someone, solve original scientific problems by looking over experimental data (like Einstein did), design a nanomachine, and so on.
Talking like there’s this one simple “predictive algorithm” that we can read out of the brain using neuroscience and overpower to produce better plans… doesn’t seem quite congruous with what humanity actually does to produce its predictions and plans.
If we take the concept of the Google Maps AGI at face value, then it actually has four key magical components. (In this case, “magical” isn’t to be taken as prejudicial, it’s a term of art that means we haven’t said how the component works yet.) There’s a magical comprehension of the user’s utility function, a magical world-model that GMAGI uses to comprehend the consequences of actions, a magical planning element that selects a non-optimal path using some method other than exploring all possible actions, and a magical explain-to-the-user function.
report($leading_action) isn’t exactly a trivial step either. Deep Blue tells you to move your pawn or you’ll lose the game. You ask “Why?” and the answer is a gigantic search tree of billions of possible move-sequences, leafing at positions which are heuristically rated using a static-position evaluation algorithm trained on millions of games. Or the planning Oracle tells you that a certain DNA sequence will produce a protein that cures cancer, you ask “Why?”, and then humans aren’t even capable of verifying, for themselves, the assertion that the peptide sequence will fold into the protein the planning Oracle says it does.
“So,” you say, after the first dozen times you ask the Oracle a question and it returns an answer that you’d have to take on faith, “we’ll just specify in the utility function that the plan should be understandable.”
Whereupon other things start going wrong. Viliam_Bur, in the comments thread, gave this example, which I’ve slightly simplified:
Example question: “How should I get rid of my disease most cheaply?” Example answer: “You won’t. You will die soon, unavoidably. This report is 99.999% reliable”. Predicted human reaction: Decides to kill self and get it over with. Success rate: 100%, the disease is gone. Costs of cure: zero. Mission completed.
Bur is trying to give an example of how things might go wrong if the preference function is over the accuracy of the predictions explained to the human— rather than just the human’s ‘goodness’ of the outcome. And if the preference function was just over the human’s ‘goodness’ of the end result, rather than the accuracy of the human’s understanding of the predictions, the AI might tell you something that was predictively false but whose implementation would lead you to what the AI defines as a ‘good’ outcome. And if we ask how happy the human is, the resulting decision procedure would exert optimization pressure to convince the human to take drugs, and so on.
I’m not saying any particular failure is 100% certain to occur; rather I’m trying to explain—as handicapped by the need to describe the AI in the native human agent-description language, using empathy to simulate a spirit-in-a-box instead of trying to think in mathematical structures like A* search or Bayesian updating—how, even so, one can still see that the issue is a tad more fraught than it sounds on an immediate examination.
If you see the world just in terms of math, it’s even worse; you’ve got some program with inputs from a USB cable connecting to a webcam, output to a computer monitor, and optimization criteria expressed over some combination of the monitor, the humans looking at the monitor, and the rest of the world. It’s a whole lot easier to call what’s inside a ‘planning Oracle’ or some other English phrase than to write a program that does the optimization safely without serious unintended consequences. Show me any attempted specification, and I’ll point to the vague parts and ask for clarification in more formal and mathematical terms, and as soon as the design is clarified enough to be a hundred light years from implementation instead of a thousand light years, I’ll show a neutral judge how that math would go wrong. (Experience shows that if you try to explain to would-be AGI designers how their design goes wrong, in most cases they just say “Oh, but of course that’s not what I meant.” Marcus Hutter is a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button. But based on past sad experience with many other would-be designers, I say “Explain to a neutral judge how the math kills” and not “Explain to the person who invented that math and likes it.”)
Just as the gigantic gap between smart-sounding English instructions and actually smart algorithms is the main source of difficulty in AI, there’s a gap between benevolent-sounding English and actually benevolent algorithms which is the source of difficulty in FAI. “Just make suggestions—don’t do anything!” is, in the end, just more English.
3. Why we haven’t already discussed Holden’s suggestion
One of the things that bothers me most about SI is that there is practically no public content, as far as I can tell, explicitly addressing the idea of a “tool” and giving arguments for why AGI is likely to work only as an “agent.”
The above statement seems to lack perspective on how many different things various people see as the one obvious solution to Friendly AI. Tool AI wasn’t the obvious solution to John McCarthy, I.J. Good, or Marvin Minsky. Today’s leading AI textbook, Artificial Intelligence: A Modern Approach—where you can learn all about A* search, by the way—discusses Friendly AI and AI risk for 3.5 pages but doesn’t mention tool AI as an obvious solution. For Ray Kurzweil, the obvious solution is merging humans and AIs. For Jurgen Schmidhuber, the obvious solution is AIs that value a certain complicated definition of complexity in their sensory inputs. Ben Goertzel, J. Storrs Hall, and Bill Hibbard, among others, have all written about how silly Singinst is to pursue Friendly AI when the solution is obviously X, for various different X. Among current leading people working on serious AGI programs labeled as such, neither Demis Hassabis (VC-funded to the tune of several million dollars) nor Moshe Looks (head of AGI research at Google) nor Henry Markram (Blue Brain at IBM) think that the obvious answer is Tool AI. Vernor Vinge, Isaac Asimov, and any number of other SF writers with technical backgrounds who spent serious time thinking about these issues didn’t converge on that solution.
Obviously I’m not saying that nobody should be allowed to propose solutions because someone else would propose a different solution. I have been known to advocate for particular developmental pathways for Friendly AI myself. But I haven’t, for example, told Peter Norvig that deterministic self-modification is such an obvious solution to Friendly AI that I would mistrust his whole AI textbook if he didn’t spend time discussing it.
At one point in his conversation with Tallinn, Holden argues that AI will inevitably be developed along planning-Oracle lines, because making suggestions to humans is the natural course that most software takes. Searching for counterexamples instead of positive examples makes it clear that most lines of code don’t do this. Your computer, when it reallocates RAM, doesn’t pop up a button asking you if it’s okay to reallocate RAM in such-and-such a fashion. Your car doesn’t pop up a suggestion when it wants to change the fuel mix or apply dynamic stability control. Factory robots don’t operate as human-worn bracelets whose blinking lights suggest motion. High-frequency trading programs execute stock orders on a microsecond timescale. Software that does happen to interface with humans is selectively visible and salient to humans, especially the tiny part of the software that does the interfacing; but this is a special case of a general cost/benefit tradeoff which, more often than not, turns out to swing the other way, because human advice is either too costly or doesn’t provide enough benefit. Modern AI programmers are generally more interested in e.g. pushing the technological envelope to allow self-driving cars than to “just” do Google Maps. Branches of AI that invoke human aid, like hybrid chess-playing algorithms designed to incorporate human advice, are a field of study; but they’re the exception rather than the rule, and occur primarily where AIs can’t yet do something humans do, e.g. humans acting as oracles for theorem-provers, where the humans suggest a route to a proof and the AI actually follows that route. This is another reason why planning Oracles were not a uniquely obvious solution to the various academic AI researchers, would-be AI-creators, SF writers, etcetera, listed above. Again, regardless of whether a planning Oracle is actually the best solution, Holden seems to be empirically-demonstrably overestimating the degree to which other people will automatically have his preferred solution come up first in their search ordering.
4. Why we should have full-time Friendly AI specialists just like we have trained professionals doing anything else mathy that somebody actually cares about getting right, like pricing interest-rate options or something
I hope that the preceding discussion has made, by example instead of mere argument, what’s probably the most important point: If you want to have a sensible discussion about which AI designs are safer, there are specialized skills you can apply to that discussion, as built up over years of study and practice by someone who specializes in answering that sort of question.
This isn’t meant as an argument from authority. It’s not meant as an attempt to say that only experts should be allowed to contribute to the conversation. But it is meant to say that there is (and ought to be) room in the world for Friendly AI specialists, just like there’s room in the world for specialists on optimal philanthropy (e.g. Holden).
The decision to build a non-self-modifying planning Oracle would be properly made by someone who: understood the risk gradient for self-modifying vs. non-self-modifying programs; understood the risk gradient for having the AI thinking about the thought processes of the human watcher and trying to come up with plans implementable by the human watcher in the service of locally absorbed utility functions, vs. trying to implement its own plans in the service of more globally descriptive utility functions; and who, above all, understood on a technical level what exactly gets accomplished by having the plans routed through a human. I’ve given substantial previous thought to describing more precisely what happens — what is being gained, and how much is being gained — when a human “approves a suggestion” made by an AI. But that would be another a different topic, plus I haven’t made too much progress on saying it precisely anyway.
In the transcript of Holden’s conversation with Jaan Tallinn, it looked like Tallinn didn’t deny the assertion that Friendly AI skills would be inapplicable if we’re building a Google Maps AGI. I would deny that assertion and emphasize that denial, because to me it seems that it is exactly Friendly AI programmers who would be able to tell you if the risk gradient for non-self-modification vs. self-modification, the risk gradient for routing plans through humans vs. acting as an agent, the risk gradient for requiring human approval vs. unapproved action, and the actual feasibility of directly constructing transhuman modeling-prediction-and-planning algorithms through directly design of sheerly better computations than are presently run by the human brain, had the right combination of properties to imply that you ought to go construct a non-self-modifying planning Oracle. Similarly if you wanted an AI that took a limited set of actions in the world with human approval, or if you wanted an AI that “just answered questions instead of making plans”.
It is similarly implied that a “philosophical AI” might obsolete Friendly AI programmers. If we’re talking about PAI that can start with a human’s terrible decision theory and come up with a good decision theory, or PAI that can start from a human talking about bad metaethics and then construct a good metaethics… I don’t want to say “impossible”, because, after all, that’s just what human philosophers do. But we are not talking about a trivial invention here. Constructing a “philosophical AI” is a Holy Grail precisely because it’s FAI-complete (just ask it “What AI should we build?”), and has been discussed (e.g. with and by Wei Dai) over the years on the old SL4 mailing list and the modern Less Wrong. But it’s really not at all clear how you could write an algorithm which would knowably produce the correct answer to the entire puzzle of anthropic reasoning, without being in possession of that correct answer yourself (in the same way that we can have Deep Blue win chess games without knowing the exact moves, but understanding exactly what abstract work Deep Blue is doing to solve the problem).
Holden’s post presents a restrictive view of what “Friendly AI” people are supposed to learn and know — that it’s about machine learning for optimizing orange sales but not apple sales, or about producing an “agent” that implements CEV — which is something of a straw view, much weaker than the view that a Friendly AI programmer takes of Friendly AI programming. What the human species needs from an x-risk perspective is experts on This Whole Damn Problem, who will acquire whatever skills are needed to that end. The Singularity Institute exists to host such people and enable their research—once we have enough funding to find and recruit them. See also, How to Purchase AI Risk Reduction.
I’m pretty sure Holden has met people who think that having a whole institute to rate the efficiency of charities is pointless overhead, especially people who think that their own charity-solution is too obviously good to have to contend with busybodies pretending to specialize in thinking about ‘marginal utility’. Which Holden knows about, I would guess, from being paid quite well to think about that economic details when he was a hedge fundie, and learning from books written by professional researchers before then; and the really key point is that people who haven’t studied all that stuff don’t even realize what they’re missing by trying to wing it. If you don’t know, you don’t know what you don’t know, or the cost of not knowing. Is there a problem of figuring out who might know something you don’t, if Holden insists that there’s this strange new stuff called ‘marginal utility’ you ought to learn about? Yes, there is. But is someone who trusts their philanthropic dollars to be steered just by the warm fuzzies of their heart, doing something wrong? Yes, they are. It’s one thing to say that SIAI isn’t known-to-you to be doing it right—another thing still to say that SIAI is known-to-you to be doing it wrong—and then quite another thing entirely to say that there’s no need for Friendly AI programmers and you know it, that anyone can see it without resorting to math or cracking a copy of AI: A Modern Approach. I do wish that Holden would at least credit that the task SIAI is taking on contains at least as many gotchas, relative to the instinctive approach, as optimal philanthropy compared to instinctive philanthropy, and might likewise benefit from some full-time professionally specialized attention, just as our society creates trained professionals to handle any other problem that someone actually cares about getting right.
On the other side of things, Holden says that even if Friendly AI is proven and checked:
“I believe that the probability of an unfavorable outcome—by which I mean an outcome essentially equivalent to what a UFAI would bring about—exceeds 90% in such a scenario.”
It’s nice that this appreciates that the problem is hard. Associating all of the difficulty with agenty proposals and thinking that it goes away as soon as you invoke tooliness is, well, of this I’ve already spoken. I’m not sure whether this irreducible-90%-doom assessment is based on a common straw version of FAI where all the work of the FAI programmer goes into “proving” something and doing this carefully checked proof which then—alas, poor Spock! - turns out to be no more relevant than proving that the underlying CPU does floating-point arithmetic correctly if the transistors work as stated. I’ve repeatedly said that the idea behind proving determinism of self-modification isn’t that this guarantees safety, but that if you prove the self-modification stable the AI might work, whereas if you try to get by with no proofs at all, doom is guaranteed. My mind keeps turning up Ben Goertzel as the one who invented this caricature—“Don’t you understand, poor fool Eliezer, life is full of uncertainty, your attempt to flee from it by refuge in ‘mathematical proof’ is doomed”—but I’m not sure he was actually the inventor. In any case, the burden of safety isn’t carried just by the proof, it’s carried mostly by proving the right thing. If Holden is assuming that we’re just running away from the inherent uncertainty of life by taking refuge in mathematical proof, then, yes, 90% probability of doom is an understatement, the vast majority of plausible-on-first-glance goal criteria you can prove stable will also kill you.
If Holden’s assessment does take into account a great effort to select the right theorem to prove—and attempts to incorporate the difficult but finitely difficult feature of meta-level error-detection, as it appears in e.g. the CEV proposal—and he is still assessing 90% doom probability, then I must ask, “What do you think you know and how do you think you know it?” The complexity of the human mind is finite; there’s only so many things we want or would-want. Why would someone claim to know that proving the right thing is beyond human ability, even if “100 of the world’s most intelligent and relevantly experienced people” (Holden’s terms) check it over? There’s hidden complexity of wishes, but not infinite complexity of wishes or unlearnable complexity of wishes. There are deep and subtle gotchas but not an unending number of them. And if that were the setting of the hidden variables—how would you end up knowing that with 90% probability in advance? I don’t mean to wield my own ignorance as a sword or engage in motivated uncertainty—I hate it when people argue that if they don’t know something, nobody else is allowed to know either—so please note that I’m also counterarguing from positive facts pointing the other way: the human brain is complicated but not infinitely complicated, there are hundreds or thousands of cytoarchitecturally distinct brain areas but not trillions or googols. If humanity had two hundred years to solve FAI using human-level intelligence and there was no penalty for guessing wrong I would be pretty relaxed about the outcome. If Holden says there’s 90% doom probability left over no matter what sane intelligent people do (all of which goes away if you just build Google Maps AGI, but leave that aside for now) I would ask him what he knows now, in advance, that all those sane intelligent people will miss. I don’t see how you could (well-justifiedly) access that epistemic state.
I acknowledge that there are points in Holden’s post which are not addressed in this reply, acknowledge that these points are also deserving of reply, and hope that other SIAI personnel will be able to reply to them.
- Thoughts on the Singularity Institute (SI) by 11 May 2012 4:31 UTC; 329 points) (
- 0. CAST: Corrigibility as Singular Target by 7 Jun 2024 22:29 UTC; 144 points) (
- Developmental Stages of GPTs by 26 Jul 2020 22:03 UTC; 140 points) (
- The genie knows, but doesn’t care by 6 Sep 2013 6:42 UTC; 119 points) (
- Model Combination and Adjustment by 17 Jul 2013 20:31 UTC; 102 points) (
- Thoughts on the Alignment Implications of Scaling Language Models by 2 Jun 2021 21:32 UTC; 82 points) (
- Reply to Holden on The Singularity Institute by 10 Jul 2012 23:20 UTC; 69 points) (
- Can we evaluate the “tool versus agent” AGI prediction? by 8 Apr 2023 18:35 UTC; 63 points) (EA Forum;
- How can I reduce existential risk from AI? by 13 Nov 2012 21:56 UTC; 63 points) (
- Original Research on Less Wrong by 29 Oct 2012 22:50 UTC; 48 points) (
- Revisiting SI’s 2011 strategic plan: How are we doing? by 16 Jul 2012 9:10 UTC; 46 points) (
- Why Academic Papers Are A Terrible Discussion Forum by 20 Jun 2012 18:15 UTC; 44 points) (
- The autopilot problem: driving without experience by 13 May 2013 12:42 UTC; 37 points) (
- [Intro to brain-like-AGI safety] 11. Safety ≠ alignment (but they’re close!) by 6 Apr 2022 13:39 UTC; 34 points) (
- Norbert Wiener’s paper “Some Moral and Technical Consequences of Automation” by 21 Jul 2013 1:01 UTC; 27 points) (
- The failure of counter-arguments argument by 10 Jul 2013 13:38 UTC; 26 points) (
- Tools want to become agents by 4 Jul 2014 10:12 UTC; 24 points) (
- In defense of Oracle (“Tool”) AI research by 7 Aug 2019 19:14 UTC; 22 points) (
- 5 May 2015 4:35 UTC; 18 points) 's comment on Debunking Fallacies in the Theory of AI Motivation by (
- 1 Aug 2012 14:16 UTC; 17 points) 's comment on Reply to Holden on The Singularity Institute by (
- Can we evaluate the “tool versus agent” AGI prediction? by 8 Apr 2023 18:40 UTC; 16 points) (
- 9 Jul 2012 23:43 UTC; 12 points) 's comment on Reply to Holden on The Singularity Institute by (
- Superintelligence 16: Tool AIs by 30 Dec 2014 2:00 UTC; 12 points) (
- AI: requirements for pernicious policies by 17 Jul 2015 14:18 UTC; 11 points) (
- 30 Oct 2019 22:25 UTC; 11 points) 's comment on On Internal Family Systems and multi-agent minds: a reply to PJ Eby by (
- 2 Sep 2013 9:19 UTC; 10 points) 's comment on How can we ensure that a Friendly AI team will be sane enough? by (
- 6 Aug 2019 10:52 UTC; 10 points) 's comment on AI Alignment Open Thread August 2019 by (
- 5 Sep 2013 16:37 UTC; 9 points) 's comment on The genie knows, but doesn’t care by (
- 17 Sep 2017 20:06 UTC; 9 points) 's comment on LW 2.0 Strategic Overview by (
- 3 Nov 2019 12:58 UTC; 8 points) 's comment on Chris Olah’s views on AGI safety by (
- 6 Aug 2012 22:38 UTC; 6 points) 's comment on Self-skepticism: the first principle of rationality by (
- 27 Dec 2012 21:06 UTC; 5 points) 's comment on Intelligence explosion in organizations, or why I’m not worried about the singularity by (
- 22 Nov 2021 13:46 UTC; 5 points) 's comment on Open & Welcome Thread November 2021 by (
- 24 Jul 2015 4:50 UTC; 5 points) 's comment on Steelmaning AI risk critiques by (
- 30 Oct 2019 13:57 UTC; 5 points) 's comment on AI safety without goal-directed behavior by (
- Does agency necessarily imply self-preservation instinct? by 1 May 2023 16:06 UTC; 5 points) (
- 27 Nov 2014 6:58 UTC; 5 points) 's comment on Why I will Win my Bet with Eliezer Yudkowsky by (
- 27 Jul 2020 0:38 UTC; 4 points) 's comment on Developmental Stages of GPTs by (
- 4 Sep 2013 16:32 UTC; 4 points) 's comment on Supposing you inherited an AI project... by (
- 11 Jan 2013 4:34 UTC; 3 points) 's comment on Evaluating the feasibility of SI’s plan by (
- 12 Feb 2016 21:38 UTC; 3 points) 's comment on Where does our community disagree about meaningful issues? by (
- 5 Apr 2014 9:18 UTC; 3 points) 's comment on Explanations for Less Wrong articles that you didn’t understand by (
- 8 Sep 2013 9:21 UTC; 3 points) 's comment on The genie knows, but doesn’t care by (
- 5 Aug 2022 22:08 UTC; 2 points) 's comment on Deontology and Tool AI by (
- 26 Sep 2013 23:08 UTC; 2 points) 's comment on How can we ensure that a Friendly AI team will be sane enough? by (
- 2 Oct 2024 6:56 UTC; 2 points) 's comment on Daniel Kokotajlo’s Shortform by (
- 26 Mar 2015 10:56 UTC; 2 points) 's comment on New forum for MIRI research: Intelligent Agent Foundations Forum by (
- 10 Jul 2012 1:27 UTC; 1 point) 's comment on Reply to Holden on The Singularity Institute by (
- 8 Jul 2022 6:57 UTC; 1 point) 's comment on Getting from an unaligned AGI to an aligned AGI? by (
- 9 Apr 2022 4:44 UTC; 1 point) 's comment on [Link] A minimal viable product for alignment by (
- 22 Sep 2012 22:52 UTC; 0 points) 's comment on Eliezer’s Sequences and Mainstream Academia by (
- 27 Feb 2013 1:26 UTC; 0 points) 's comment on Why Politics are Important to Less Wrong... by (
- Tool/Agent distinction in the light of the AI box experiment by 15 Jul 2012 17:29 UTC; -7 points) (
My summary (now with endorsement by Eliezer!):
SI can be a valuable organization even if Tool AI turns out to be the right approach:
Skills/organizational capabilities for safe Tool AI are similar to those for Friendly AI.
EY seems to imply that much of SI’s existing body of work can be reused.
Offhand remark that seemed important: Superintelligent Tool AI would be more difficult since it would have to be developed in way that it would not recursively self-improve.
Tool AI is nontrivial:
The number of possible plans is way too large for an AI to realistically evaluate all them. Heuristics will have to be used to find suboptimal but promising plans.
The reasoning behind the plan the AI chooses might be way beyond the comprehension of the user. It’s not clear how best to deal with this, given that the AI is only approximating the user’s wishes and can’t really be trusted to choose plans without supervision.
Constructing a halfway decent approximation of the user’s utility function and having a model good enough to make plans with are also far from solved problems.
Potential Tool AI gotcha: The AI might give you a self-fulfilling negative prophecy that the AI didn’t realize would harm you.
These are just examples. Point is, saying “but the AI will just do this!” is far removed from specifying the AI in a rigorous formal way and proving it will do that.
Tool AI is not obviously the way AGI should or will be developed:
Many leading AGI thinkers have their own pet idea about what AGI should do. Few to none endorse Tool AI. If it was obvious all the leading AGI thinkers would endorse it.
Actually, most modern AI applications don’t involve human input, so it’s not obvious that AGI will develop along Tool AI lines.
Full-time Friendliness researchers are worth having:
If nothing else, they’re useful for evaluating proposals like Holden’s Tool AI one to figure out if they are really sound.
Friendliness philosophy would be difficult to program an AI to do. Even if we thought we had a program that could do it, how would we know the answers from that program were correct? So we probably need humans.
Friendliness researchers need to have a broader domain of expertise than Holden gives them credit for. They need to have expertise in whatever happens to be necessary to ensure safe AI.
The problems of Friendliness are tricky, so laypeople should beware of jumping to conclusions about Friendliness.
Holden’s estimate of a 90% chance of doom even given a 100 person FAI team approving the design is overly pessimistic:
EY is aware it’s extremely difficult to know what properties about a prospective FAI need to be formally proved, and plans to put a lot of effort into figuring this out.
The difficulty of Friendliness is finite. The difficulties are big and subtle, but not unending.
Where did 90% come from? Lots of uncertainty here...
Holden made other good points not addressed here.
This point seems missing:
A system that undertakes extended processes of research and thinking, generating new ideas and writing new programs for internal experiments, seems both much more effective and much more potentially risky than something like chess program with a simple fixed algorithm to search using a fixed narrow representation of the world (as a chess board).
Looks pretty good, actually. Nice.
So you wrote 10x too much then?
How do we know that the problem is finite? When it comes to proving a computer program safe from being hacked the problem is considered NP-hard. Google Chrome got recently hacked by chaining 14 different bugs together. A working AGI is probably as least a complex as Google Chrome. Proving it safe will likely also be NP-hard.
Google Chrome doesn’t even self modify.
I’m not really sure what’s meant by this.
For example, in computer vision, you can input an image and get a classification as output. The input is supplied by a human. The computation doesn’t involve the human. The output is well defined. The same could be true of a tool AI that makes predictions.
Both Andrew Ng and Jeff Hawkins think that tool AI is the most likely approach.
I would consider 3 to be a few.
That is about how I read it.
When I read posts like this I feel like an independent everyman watching a political debate.
The dialogue is oversimplified and even then I don’t fully grasp exactly what’s being said and the implications thereof, so I can almost feel my opinion shifting back and forth with each point that sounds sort of, kinda, sensible when I don’t really have the capacity to judge the statements. I should probably try and fix that.
The analogy is apt: blue-vs.-green politics aren’t the only kind of politics, and debates over singularity policy have had big mind-killing effects on otherwise-pretty-rational LW folk before.
The core points don’t strike me as being inherently difficult or technical, although Eliezer uses some technical examples.
That’s precisely the problem, given that Eliezer is arguing that a technical appreciation of difficult problems is necessary to judge correctly on this issue. My understanding, like pleeppleep’s, is limited to the simplified level given here, which means I’m reduced to giving weight to presentation and style and things being “kinda sensible”.
Hello,
I appreciate the thoughtful response. I plan to respond at greater length in the future, both to this post and to some other content posted by SI representatives and commenters. For now, I wanted to take a shot at clarifying the discussion of “tool-AI” by discussing AIXI. One of the the issues I’ve found with the debate over FAI in general is that I haven’t seen much in the way of formal precision about the challenge of Friendliness (I recognize that I have also provided little formal precision, though I feel the burden of formalization is on SI here). It occurred to me that AIXI might provide a good opportunity to have a more precise discussion, if in fact it is believed to represent a case of “a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button.”
So here’s my characterization of how one might work toward a safe and useful version of AIXI, using the “tool-AI” framework, if one could in fact develop an efficient enough approximation of AIXI to qualify as a powerful AGI. Of course, this is just a rough outline of what I have in mind, but hopefully it adds some clarity to the discussion.
A. Write a program that
Computes an optimal policy, using some implementation of equation (20) on page 22 of http://www.hutter1.net/ai/aixigentle.pdf
“Prints” the policy in a human-readable format (using some fixed algorithm for “printing” that is not driven by a utility function)
Provides tools for answering user questions about the policy, i.e., “What will be its effect on ___?” (using some fixed algorithm for answering user questions that makes use of AIXI’s probability function, and is not driven by a utility function)
Does not contain any procedures for “implementing” the policy, only for displaying it and its implications in human-readable form
B. Run the program; examine its output using the tools described above (#2 and #3); if, upon such examination, the policy appears potentially destructive, continue tweaking the program (for example, by tweaking the utility it is selecting a policy to maximize) until the policy appears safe and desirable
C. Implement the policy using tools other than AIXI agent
D. Repeat (B) and (C) until one has confidence that the AIXI agent reliably produces safe and desirable policies, at which point more automation may be called for
My claim is that this approach would be superior to that of trying to develop “Friendliness theory” in advance of having any working AGI, because it would allow experiment- rather than theory-based development. Eliezer, I’m interested in your thoughts about my claim. Do you agree? If not, where is our disagreement?
Didn’t see this at the time, sorry.
So… I’m sorry if this reply seems a little unhelpful, and I wish there was some way to engage more strongly, but...
Point (1) is the main problem. AIXI updates freely over a gigantic range of sensory predictors with no specified ontology—it’s a sum over a huge set of programs, and we, the users, have no idea what the representations are talking about, except that at the end of their computations they predict, “You will see a sensory 1 (or a sensory 0).” (In my preferred formalism, the program puts a probability on a 0 instead.) Inside, the program could’ve been modeling the universe in terms of atoms, quarks, quantum fields, cellular automata, giant moving paperclips, slave agents scurrying around… we, the programmers, have no idea how AIXI is modeling the world and producing its predictions, and indeed, the final prediction could be a sum over many different representations.
This means that equation (20) in Hutter is written as a utility function over sense data, where the reward channel is just a special case of sense data. We can easily adapt this equation to talk about any function computed directly over sense data—we can get AIXI to optimize any aspect of its sense data that we please. We can’t get it to optimize a quality of the external universe. One of the challenges I listed in my FAI Open Problems talk, and one of the problems I intend to talk about in my FAI Open Problems sequence, is to take the first nontrivial steps toward adapting this formalism—to e.g. take an equivalent of AIXI in a really simple universe, with a really simple goal, something along the lines of a Life universe and a goal of making gliders, and specify something given unlimited computing power which would behave like it had that goal, without pre-fixing the ontology of the causal representation to that of the real universe, i.e., you want something that can range freely over ontologies in its predictive algorithms, but which still behaves like it’s maximizing an outside thing like gliders instead of a sensory channel like the reward channel. This is an unsolved problem!
We haven’t even got to the part where it’s difficult to say in formal terms how to interpret what a human says s/he wants the AI to plan, and where failures of phrasing of that utility function can also cause a superhuman intelligence to kill you. We haven’t even got to the huge buried FAI problem inside the word “optimal” in point (1), which is the really difficult part in the whole thing. Because so far we’re dealing with a formalism that can’t even represent a purpose of the type you’re looking for—it can only optimize over sense data, and this is not a coincidental fact, but rather a deep problem which the AIXI formalism deliberately avoided.
(2) sounds like you think an AI with an alien, superhuman planning algorithm can tell humans what to do without ever thinking consequentialistically about which different statements will result in human understanding or misunderstanding. Anna says that I need to work harder on not assuming other people are thinking silly things, but even so, when I look at this, it’s hard not to imagine that you’re modeling AIXI as a sort of spirit containing thoughts, whose thoughts could be exposed to the outside with a simple exposure-function. It’s not unthinkable that a non-self-modifying superhuman planning Oracle could be developed with the further constraint that its thoughts are human-interpretable, or can be translated for human use without any algorithms that reason internally about what humans understand, but this would at the least be hard. And with AIXI it would be impossible, because AIXI’s model of the world ranges over literally all possible ontologies and representations, and its plans are naked motor outputs.
Similar remarks apply to interpreting and answering “What will be its effect on _?” It turns out that getting an AI to understand human language is a very hard problem, and it may very well be that even though talking doesn’t feel like having a utility function, our brains are using consequential reasoning to do it. Certainly, when I write language, that feels like I’m being deliberate. It’s also worth noting that “What is the effect on X?” really means “What are the effects I care about on X?” and that there’s a large understanding-the-human’s-utility-function problem here. In particular, you don’t want your language for describing “effects” to partition, as the same state of described affairs, any two states which humans assign widely different utilities. Let’s say there are two plans for getting my grandmother out of a burning house, one of which destroys her music collection, one of which leaves it intact. Does the AI know that music is valuable? If not, will it not describe music-destruction as an “effect” of a plan which offers to free up large amounts of computer storage by, as it turns out, overwriting everyone’s music collection? If you then say that the AI should describe changes to files in general, well, should it also talk about changes to its own internal files? Every action comes with a huge number of consequences—if we hear about all of them (reality described on a level so granular that it automatically captures all utility shifts, as well as a huge number of other unimportant things) then we’ll be there forever.
I wish I had something more cooperative to say in reply—it feels like I’m committing some variant of logical rudeness by this reply—but the truth is, it seems to me that AIXI isn’t a good basis for the agent you want to describe; and I don’t know how to describe it formally myself, either.
Thanks for the response. To clarify, I’m not trying to point to the AIXI framework as a promising path; I’m trying to take advantage of the unusually high degree of formalization here in order to gain clarity on the feasibility and potential danger points of the “tool AI” approach.
It sounds to me like your two major issues with the framework I presented are (to summarize):
(1) There is a sense in which AIXI predictions must be reducible to predictions about the limited set of inputs it can “observe directly” (what you call its “sense data”).
(2) Computers model the world in ways that can be unrecognizable to humans; it may be difficult to create interfaces that allow humans to understand the implicit assumptions and predictions in their models.
I don’t claim that these problems are trivial to deal with. And stated as you state them, they sound abstractly very difficult to deal with. However, it seems true—and worth noting—that “normal” software development has repeatedly dealt with them successfully. For example: Google Maps works with a limited set of inputs; Google Maps does not “think” like I do and I would not be able to look at a dump of its calculations and have any real sense for what it is doing; yet Google Maps does make intelligent predictions about the external universe (e.g., “following direction set X will get you from point A to point B in reasonable time”), and it also provides an interface (the “route map”) that helps me understand its predictions and the implicit reasoning (e.g. “how, why, and with what other consequences direction set X will get me from point A to point B”).
Difficult though it may be to overcome these challenges, my impression is that software developers have consistently—and successfully—chosen to take them on, building algorithms that can be “understood” via interfaces and iterated over—rather than trying to prove the safety and usefulness of their algorithms with pure theory before ever running them. Not only does the former method seem “safer” (in the sense that it is less likely to lead to putting software in production before its safety and usefulness has been established) but it seems a faster path to development as well.
It seems that you see a fundamental disconnect between how software development has traditionally worked and how it will have to work in order to result in AGI. But I don’t understand your view of this disconnect well enough to see why it would lead to a discontinuation of the phenomenon I describe above. In short, traditional software development seems to have an easier (and faster and safer) time overcoming the challenges of the “tool” framework than overcoming the challenges of up-front theoretical proofs of safety/usefulness; why should we expect this to reverse in the case of AGI?
So first a quick note: I wasn’t trying to say that the difficulties of AIXI are universal and everything goes analogously to AIXI, I was just stating why AIXI couldn’t represent the suggestion you were trying to make. The general lesson to be learned is not that everything else works like AIXI, but that you need to look a lot harder at an equation before thinking that it does what you want.
On a procedural level, I worry a bit that the discussion is trying to proceed by analogy to Google Maps. Let it first be noted that Google Maps simply is not playing in the same league as, say, the human brain, in terms of complexity; and that if we were to look at the winning “algorithm” of the million-dollar Netflix Prize competition, which was in fact a blend of 107 different algorithms, you would have a considerably harder time figuring out why it claimed anything it claimed.
But to return to the meta-point, I worry about conversations that go into “But X is like Y, which does Z, so X should do reinterpreted-Z”. Usually, in my experience, that goes into what I call “reference class tennis” or “I’m taking my reference class and going home”. The trouble is that there’s an unlimited number of possible analogies and reference classes, and everyone has a different one. I was just browsing old LW posts today (to find a URL of a quick summary of why group-selection arguments don’t work in mammals) and ran across a quotation from Perry Metzger to the effect that so long as the laws of physics apply, there will always be evolution, hence nature red in tooth and claw will continue into the future—to him, the obvious analogy for the advent of AI was “nature red in tooth and claw”, and people who see things this way tend to want to cling to that analogy even if you delve into some basic evolutionary biology with math to show how much it isn’t like intelligent design. For Robin Hanson, the one true analogy is to the industrial revolution and farming revolutions, meaning that there will be lots of AIs in a highly competitive economic situation with standards of living tending toward the bare minimum, and this is so absolutely inevitable and consonant with The Way Things Should Be as to not be worth fighting at all. That’s his one true analogy and I’ve never been able to persuade him otherwise. For Kurzweil, the fact that many different things proceed at a Moore’s Law rate to the benefit of humanity means that all these things are destined to continue and converge into the future, also to the benefit of humanity. For him, “things that go by Moore’s Law” is his favorite reference class.
I can have a back-and-forth conversation with Nick Bostrom, who looks much more favorably on Oracle AI in general than I do, because we’re not playing reference class tennis with “But surely that will be just like all the previous X-in-my-favorite-reference-class”, nor saying, “But surely this is the inevitable trend of technology”; instead we lay out particular, “Suppose we do this?” and try to discuss how it will work, not with any added language about how surely anyone will do it that way, or how it’s got to be like Z because all previous Y were like Z, etcetera.
My own FAI development plans call for trying to maintain programmer-understandability of some parts of the AI during development. I expect this to be a huge headache, possibly 30% of total headache, possibly the critical point on which my plans fail, because it doesn’t happen naturally. Go look at the source code of the human brain and try to figure out what a gene does. Go ask the Netflix Prize winner for a movie recommendation and try to figure out “why” it thinks you’ll like watching it. Go train a neural network and then ask why it classified something as positive or negative. Try to keep track of all the memory allocations inside your operating system—that part is humanly understandable, but it flies past so fast you can only monitor a tiny fraction of what goes on, and if you want to look at just the most “significant” parts, you would need an automated algorithm to tell you what’s significant. Most AI algorithms are not humanly understandable. Part of Bayesianism’s appeal in AI is that Bayesian programs tend to be more understandable than non-Bayesian AI algorithms. I have hopeful plans to try and constrain early FAI content to humanly comprehensible ontologies, prefer algorithms with humanly comprehensible reasons-for-outputs, carefully weigh up which parts of the AI can safely be less comprehensible, monitor significant events, slow down the AI so that this monitoring can occur, and so on. That’s all Friendly AI stuff, and I’m talking about it because I’m an FAI guy. I don’t think I’ve ever heard any other AGI project express such plans; and in mainstream AI, human-comprehensibility is considered a nice feature, but rarely a necessary one.
It should finally be noted that AI famously does not result from generalizing normal software development. If you start with a map-route program and then try to program it to plan more and more things until it becomes an AI… you’re doomed, and all the experienced people know you’re doomed. I think there’s an entry or two in the old Jargon File aka Hacker’s Dictionary to this effect. There’s a qualitative jump to writing a different sort of software—from normal programming where you create a program conjugate to the problem you’re trying to solve, to AI where you try to solve cognitive-science problems so the AI can solve the object-level problem. I’ve personally met a programmer or two who’ve generalized their code in interesting ways, and who feel like they ought to be able to generalize it even further until it becomes intelligent. This is a famous illusion among aspiring young brilliant hackers who haven’t studied AI. Machine learning is a separate discipline and involves algorithms and problems that look quite different from “normal” programming.
Thanks for the response. My thoughts at this point are that
We seem to have differing views of how to best do what you call “reference class tennis” and how useful it can be. I’ll probably be writing about my views more in the future.
I find it plausible that AGI will have to follow a substantially different approach from “normal” software. But I’m not clear on the specifics of what SI believes those differences will be and why they point to the “proving safety/usefulness before running” approach over the “tool” approach.
We seem to have differing views of how frequently today’s software can be made comprehensible via interfaces. For example, my intuition is that the people who worked on the Netflix Prize algorithm had good interfaces for understanding “why” it recommends what it does, and used these to refine it. I may further investigate this matter (casually, not as a high priority); on SI’s end, it might be helpful (from my perspective) to provide detailed examples of existing algorithms for which the “tool” approach to development didn’t work and something closer to “proving safety/usefulness up front” was necessary.
Canonical software development examples emphasizing “proving safety/usefulness before running” over the “tool” software development approach are cryptographic libraries and NASA space shuttle navigation.
At the time of writing this comment, there was recent furor over software called CryptoCat that didn’t provide enough warnings that it was not properly vetted by cryptographers and thus should have been assumed to be inherently insecure. Conventional wisdom and repeated warnings from the security community state that cryptography is extremely difficult to do properly and attempting to create your own may result in catastrophic results. A similar thought and development process goes into space shuttle code.
It seems that the FAI approach to “proving safety/usefulness” is more similar to the way cryptographic algorithms are developed than the (seemingly) much faster “tool” approach, which is more akin to web development where the stakes aren’t quite as high.
EDIT: I believe the “prove” approach still allows one to run snippets of code in isolation, but tends to shy away from running everything end-to-end until significant effort has gone into individual component testing.
The analogy with cryptography is an interesting one, because...
In cryptography, even after you’ve proven that a given encryption scheme is secure, and that proof has been centuply (100 times) checked by different researchers at different institutions, it might still end up being insecure, for many reasons.
Examples of reasons include:
The proof assumed mathematical integers/reals, of which computer integers/floating point numbers are just an approximation.
The proof assumed that the hardware the algorithm would be running on was reliable (e.g. a reliable source of randomness).
The proof assumed operations were mathematical abstractions and thus exist out of time, and thus neglected side channel attacks which measures how long a physical real world CPU took to execute a the algorithm in order to make inferences as to what the algorithm did (and thus recover the private keys).
The proof assumed the machine executing the algorithm was idealized in various ways, when in fact a CPU emits heat other electromagnetic waves, which can be detected and from which inferences can be drawn, etc.
That’s one way to “win” a game of reference class tennis. Declare unilaterally that what you are discussing falls into the reference class “things that are most effectively reasoned about by discussing low level details and abandoning or ignoring all observed evidence about how things with various kinds of similarity have worked in the past”. Sure, it may lead to terrible predictions sometimes but by golly, it means you can score an ‘ace’ in the reference class tennis while pretending you are not even playing!
And atheism is a religion, and bald is a hair color.
The three distinguishing characteristics of “reference class tennis” are (1) that there are many possible reference classes you could pick and everyone engaging in the tennis game has their own favorite which is different from everyone else’s; (2) that the actual thing is obviously more dissimilar to all the cited previous elements of the so-called reference class than all those elements are similar to each other (if they even form a natural category at all rather than having being picked out retrospectively based on similarity of outcome to the preferred conclusion); and (3) that the citer of the reference class says it with a cognitive-traffic-signal quality which attempts to shut down any attempt to counterargue the analogy because “it always happens like that” or because we have so many alleged “examples” of the “same outcome” occurring (for Hansonian rationalists this is accompanied by a claim that what you are doing is the “outside view” (see point 2 and 1 for why it’s not) and that it would be bad rationality to think about the “individual details”).
I have also termed this Argument by Greek Analogy after Socrates’s attempt to argue that, since the Sun appears the next day after setting, souls must be immortal.
For the curious, this is from the Phaedo pages 70-72. The run of the argument are basically thus:
P1 Natural changes are changes from and to opposites, like hot from relatively cold, etc.
P2 Since every change is between opposites A and B, there are two logically possible processes of change, namely A to B and B to A.
P3 If only one of the two processes were physically possible, then we should expect to see only one of the two opposites in nature, since the other will have passed away irretrievably.
P4 Life and death are opposites.
P5 We have experience of the process of death.
P6 We have experience of things which are alive
C From P3, 4, 5, and 6 there is a physically possible, and actual, process of going from death to life.
The argument doesn’t itself prove (haha) the immortality of the soul, only that living things come from dead things. The argument is made in support of the claim, made prior to this argument, that if living people come from dead people, then dead people must exist somewhere. The argument is particularly interesting for premises 1 and 2, which are hard to deny, and 3, which seems fallacious but for non-obvious reasons.
This sounds like it might be a bit of a reverent-Western-scholar steelman such as might be taught in modern philosophy classes; Plato’s original argument for the immortality of the soul sounded more like this, which is why I use it as an early exemplar of reference class tennis:
-
Then let us consider the whole question, not in relation to man only, but in relation to animals generally, and to plants, and to everything of which there is generation, and the proof will be easier. Are not all things which have opposites generated out of their opposites? I mean such things as good and evil, just and unjust—and there are innumerable other opposites which are generated out of opposites. And I want to show that in all opposites there is of necessity a similar alternation; I mean to say, for example, that anything which becomes greater must become greater after being less.
True.
And that which becomes less must have been once greater and then have become less.
Yes.
And the weaker is generated from the stronger, and the swifter from the slower.
Very true.
And the worse is from the better, and the more just is from the more unjust.
Of course.
And is this true of all opposites? and are we convinced that all of them are generated out of opposites?
Yes.
And in this universal opposition of all things, are there not also two intermediate processes which are ever going on, from one to the other opposite, and back again; where there is a greater and a less there is also an intermediate process of increase and diminution, and that which grows is said to wax, and that which decays to wane?
Yes, he said.
And there are many other processes, such as division and composition, cooling and heating, which equally involve a passage into and out of one another. And this necessarily holds of all opposites, even though not always expressed in words—they are really generated out of one another, and there is a passing or process from one to the other of them?
Very true, he replied.
Well, and is there not an opposite of life, as sleep is the opposite of waking?
True, he said.
And what is it?
Death, he answered.
And these, if they are opposites, are generated the one from the other, and have there their two intermediate processes also?
Of course.
Now, said Socrates, I will analyze one of the two pairs of opposites which I have mentioned to you, and also its intermediate processes, and you shall analyze the other to me. One of them I term sleep, the other waking. The state of sleep is opposed to the state of waking, and out of sleeping waking is generated, and out of waking, sleeping; and the process of generation is in the one case falling asleep, and in the other waking up. Do you agree?
I entirely agree.
Then, suppose that you analyze life and death to me in the same manner. Is not death opposed to life?
Yes.
And they are generated one from the other?
Yes.
What is generated from the living?
The dead.
And what from the dead?
I can only say in answer—the living.
Then the living, whether things or persons, Cebes, are generated from the dead?
That is clear, he replied.
Then the inference is that our souls exist in the world below?
That is true.
(etc.)
That was roughly my aim, but I don’t think I inserted any premises that weren’t there. Did you have a complaint about the accuracy of my paraphrase? The really implausible premise there, namely that death is the opposite of life, is preserved I think.
As for reverence, why not? He was, after all, the very first person in our historical record to suggest that thinking better might make you happier. He was also an intellectualist about morality, at least sometimes a hedonic utilitarian, and held no great respect for logic. And he was a skilled myth-maker. He sounds like a man after your own heart, actually.
I think your summary didn’t leave anything out, or even apply anything particularly charitable.
Esar’s summary doesn’t seem to be different from this, other than 1) adding the useful bit about “passed away irretrievably” and 2) yours makes it clear that the logical jump happens right at the end.
I’m actually not sure now why you consider this like “reference class tennis”. The argument looks fine, except for the part where “souls exist in the world below” jumps in as a conclusion, not having been mentioned earlier in the argument.
The ‘souls exist in the world below’ bit is directly before what Eliezer quoted:
But you’re right that nothing in the argument defends the idea of a world below, just that souls must exist in some way between bodies.
The argument omits that living things can come from living things and dead thingsfrom dead things
Therefore, the fact that living things can come from dead things does not mean that have to in every case.
Although, if everything started off dead, they must have at some point.
So it’s an argument for abiogenesis,
Not even that, at least in the part of the argument I’ve seen (paraphrased?) above.
He just mentions an ancient doctrine, and then claims that souls must exist somewhere while they’re not embodied, because he can’t imagine where they would come from otherwise. I’m not even sure if the ancient doctrine is meant as argument from authority or is just some sort of Chewbacca defense.
(He doesn’t seem to explicitly claim the “ancient doctrine” to be true or plausible, just that it came to his mind. It feels like I’ve lost something in the translation.)
Ok, it seems like under this definition of “reference class tennis” (particularly parts (2) and (3)) the participants must be wrong and behaving irrationality about it in order to be playing reference class tennis. So when they are either right or at least applying “outside view” considerations correctly, given all the information available to them they aren’t actually playing “reference class tennis” but instead doing whatever it is that reasoning (boundedly) correctly using reference to actual relevant evidence about related occurrences is called when it isn’t packaged with irrational wrongness.
With this definition in mind it is necessary to translate replies such as those here by Holden:
Holden’s meaning is, of course, not that that he argues is actually a good thing but rather declaring that the label doesn’t apply to what he is doing. He is instead doing that other thing that is actually sound thinking and thinks people are correct to do so.
Come to think of it if most people in Holden’s shoes heard Eliezer accuse them of “reference class tennis” and actually knew that he intended it with the meaning he explicitly defines here rather than the one they infer from context they would probably just consider him arrogant, rude and mind killed then write him and his organisation off as not worth engaging with.
In the vast majority of cases where I have previously seen Eliezer argue against people using “outside view” I have agreed with Eliezer, and have grown rather fond of using the phrase “reference class tennis” as a reply myself where appropriate. But seeing how far Eliezer has taken the anti-outside-view position here and the extent to which “reference class tennis” is defined as purely an anti-outside-view semantic stop sign I’ll be far more hesitant to make us of it myself.
It is tempting to observe “Eliezer is almost always right when he argues against ‘outside view’ applications, and the other people are all confused. He is currently arguing against ‘outside view’ applications. Therefore, the other people are probably confused.” To that I reply either “Reference class tennis!” or “F*$% you, I’m right and you’re wrong!” (I’m honestly not sure which is the least offensive.)
Which of 1, 2 and 3 do you disagree with in this case?
Edit: I mean, I’m sorry to parody but I don’t really want to carefully rehash the entire thing, so, from my perspective, Holden just said, “But surely strong AI will fall into the reference class of technology used to give users advice, just like Google Maps doesn’t drive your car; this is where all technology tends to go, so I’m really skeptical about discussing any other possibility.” Only Holden has argued to SI that strong AI falls into this particular reference class so far as I can recall, with many other people having their own favored reference classes e.g. Hanson et. al as cited above; a strong AI is far more internally dissimilar from Google Maps and Yelp than Google Maps and Yelp are internally similar to each other, plus there are many many other software programs that don’t provide advice at all so arguably the whole class may be chosen-post-facto; and I’d have to look up Holden’s exact words and replies to e.g. Jaan Tallinn to decide to what degree, if any, he used the analogy to foreclose other possibilities conversationally without further debate, but I do think it happened a little, but less so and less explicitly than in my Robin Hanson debate. If you don’t think I should at this point diverge into explaining the concept of “reference class tennis”, how should the conversation proceed further?
Also, further opinions desired on whether I was being rude, whether logically rude or otherwise.
Viewed charitably, you were not being rude, although you did veer away from your main point in ways likely to be unproductive. (For example, being unnecessarily dismissive towards Hanson, who you’d previously stated had given arguments roughly as good as Holden’s; or spending so much of your final paragraph emphasizing Holden’s lack of knowledge regarding AI.)
On the most likely viewing, it looks like you thought Holden was probably playing reference class tennis. This would have been rude, because it would imply that you thought the following inaccurate things about him:
He was “taking his reference class and going home”
That you can’t “have a back-and-forth conversation” with him
I don’t think that you intended those implications. All the same, your final comment came across as noticeably less well-written than your post.
Thanks for the third-party opinion!
I’m confused how you thought “reference class tennis” was anything but a slur on the other side’s argument. Likewise “mindkilled.” Sometimes, slurs about arguments are justified (agnostic in the instant case) - but that’s a separate issue.
Do Karnofsky’s contributions have even one of these characteristics, let alone all of them?
Empirically obviously 1 is true, I would argue strongly for 2 but it’s a legitimate point of dispute, and I would say that there were relatively small but still noticeable but quite forgiveable traces of 3.
Then it does seem like your AI arguments are playing reference class tennis with a reference class of “conscious beings”. For me, the force of the Tool AI argument is that there’s no reason to assume that AGI is going to behave like a sci-fi character. For example, if something like On Intelligence turns out to be true, I think the algorithms it describes will be quite generally intelligent but hardly capable of rampaging through the countryside. It would be much more like Holden’s Tool AI: you’d feed it data, it’d make predictions, you could choose to use the predictions.
(This is, naturally, the view of that school of AI implementers. Scott Brown: “People often seem to conflate having intelligence with having volition. Intelligence without volition is just information.”)
Your prospective AI plans for programmer-understandability seems very close to Starmap-AI by which I mean
The best story I’ve read about a not so failed utopia involves this kind of accountability over the FAI. While I hate to generalize from fictional evidence it definitely seems like a necessary step to not becoming a galaxy that tiles over the aliens with happy faces instead of just freezing them in place to prevent human harm.
Explaining routes is domain specific and quite simple. When you are using domain specific techniques to find solutions to domain specific problems, you can use domain specific interfaces where human programmers and designers do all the heavy lifting to figure out the general strategy of how to communicate to the user.
But if you want a tool AGI that finds solutions in arbitrary domains, you need a cross domain solution for communicating tool AGI’s plans to the user. This is as much a harder problem than showing a route on a map, as cross domain AGI is a harder problem than computing the routes. Instead of the programmer figuring out how to plot road tracing curves on a map, the programmer has to figure out how to get the computer to figure out that displaying a map with route traced over it is a useful thing to do, in a way that generalizes figuring out other useful things to do to communicate answers to other types of questions. And among the hard subproblems of programming computers to find useful things to do in general problems is specifying the meaning of “useful”. If that is done poorly, the tool AGI tries to trick the user into accepting plans that achieve some value negating distortion of what we actually want, instead of giving information that helps provide a good evaluation. Doing this right requires solving the same problems required to do FAI right.
To note something on making AIXI based tool: Instead of calculating rewards sum over the whole future (something that is simultaneously impractical, computationally expensive, and would only serve to impair performance on task at hand), one could use the single-step reward, with 1 for button being pressed any time and 0 for button not being pressed ever. It is still not entirely a tool, but it has very bounded range of unintended behaviour (much harder to speculate of the terminator scenario). In the Hutter’s paper he outlines several not-quite-intelligences before arriving at AIXI.
[edit2: also I do not believe that even with the large sum a really powerful AIXI-tl would be intelligently dangerous rather than simply clever at breaking the hardware that’s computing it. All the valid models in AIXI-tl that affect the choice of actions have to magically insert actions being probed into some kind of internal world model. The hardware that actually makes those actions, complete with sensory apparatus, is incidental; a useless power drain; a needless fire hazard endangering the precious reward pathway]
With regards to utility functions, the utility functions in the AI sense are real valued functions taken over the world model, not functions like number of paperclips in the world. The latter function, unsafe or safe, would be incredibly difficult or impossible to define using conventional methods. It would suffice for accelerating the progress to have an algorithm that can take in an arbitrary function and find it’s maximum; while it would indeed seem to be “very difficult” to use that to cure cancer, it could be plugged into existing models and very quickly be used to e.g. design cellular machinery that would keep repairing the DNA alterations.
Likewise, the speculative tool that can understand phrase ‘how to cure cancer’ and phrase ‘what is the curing time of epoxy’ would have to pick up most narrow least objectionable interpretation of the ‘cure cancer’ phrase to merely answer something more useful than ‘cancer is not a type of epoxy or glue, it does not cure’; it seems that not seeing killing everyone as a valid interpretation comes in as necessary consequence of ability to process language at all.
If the past sensory data include information about the internal workings, then there will be a striking correlation between the outputs that the workings would produce on their own (for physical reasons) and the AI’s outputs. That rules out (or drives down expected utility of acting upon) all but very crazy hypotheses about how the Cartesian interaction works. Wrecking the hardware would break that correlation, and it’s not clear what the crazy hypotheses would say about that, e.g. hypotheses that some simply specified intelligence is stage-managing the inputs, or that sometimes the AIXI-tl’s outputs matter, and other times only the physical hardware matters.
Well, you can’t include entire internal workings in the sensory data, and it can’t model significant portion of itself as it has to try big number of hypotheses on the model on each step, so I would not expect the very crazy hypotheses to be very elaborate and have high coverage of the internals.
If I closed my eyes and did not catch a ball, the explanation is that I did not see it coming and could not catch it, but this sentence is rife with self references of the sort that is problematic for AIXI. The correlation between closed eyes and lack of reward can be coded into some sort of magical craziness, but if I close my eyes and not my ears and hear where the ball lands after I missed catching it, there’s the vastly simpler explanation for why I did not catch it—my hand was not in the right spot (and that works with total absence of sensorium as well). I don’t see how AIXI-tl (with very huge constants) can value it’s eyesight (it might have some value if there is some asymmetric in the long models, but it seems clear it would not assign the adequate, rational value to it’s eyesight). In my opinion there is no single unifying principle to intelligence (or none was ever found), and AIXI-tl (with very huge constants) fails way short of even a cat in many important ways.
edit: Some other thought: I am not sure that Solomonoff induction’s prior is compatible with expected utility maximization. If the expected utility imbalance between crazy models grows faster than 2^length , and I would expect it to grow faster than any computable function (if the utility is unbounded), then the actions will be determined by imbalances between crazy, ultra long models. I would not privilege the belief that it just works without some sort of formal proof or some other very good reason to think it works.
Your question seems to be about how sentient beings in a Game of Life universe are supposed to define “gliders” to the AI.
1) If they know the true laws of their cellular automaton, they can make a UDT-ish AI that examines statements like “if this logical algorithm has such-and-such output, then my prior over starting configurations of the universe logically implies such-and-such total number of gliders”.
2) If they only know that their universe is some cellular automaton and have a prior over all possible automata, they can similarly say “maximize the number of smallest possible spaceships under the automaton rules” and give the AI some sensory channel wide enough to pin down the specific automaton with high probability.
3) If they only know what sensory experiences correspond to the existence of gliders, but don’t know what gliders are… I guess we have a problem because sensory experiences can be influenced by the AI :-(
Regarding #3: what happens given a directive like “Over there are a bunch of people who report sensory experiences of the kind I’m interested in. Figure out what differentially caused those experiences, and maximize the incidence of that.”?
(I’m not concerned with the specifics of my wording, which undoubtedly contains infinite loopholes; I’m asking about the general strategy of, when all I know is sensory experiences, referring to the differential causes of those experiences, whatever they may be. Which, yes, I would expect to include, in the case where there actually are no gliders and the recurring perception of gliders is the result of a glitch in my perceptual system, modifying my perceptual system to make such glitches more likely… but which I would not expect to include, in the case where my perceptual system is operating essentially the same way when it perceives gliders as when it perceives everything else, modifying my perceptual system to include such glitches (since such a glitch is not the differential cause of experiences of gliders in the first place.))
Let’s say you want the AI to maximize the amount of hydrogen, and you formulate the goal as “maximize the amount of the substance most likely referred to by such-and-such state of mind”, where “referred to” is cashed out however you like. Now imagine that some other substance is 10x cheaper to make than hydrogen. Then the AI could create a bunch of minds in the same state, just enough to re-point the “most likely” pointer to the new substance instead of hydrogen, leading to huge savings overall. Or it could do something even more subversive, my imagination is weak.
That’s what I was getting at, when I said a general problem with using sensory experiences as pointers is that the AI can influence sensory experiences.
Well, right, but my point is that “the thing which differentially caused the sensory experiences to which I refer” does not refer to the same thing as “the thing which would differentially cause similar sensory experiences in the future, after you’ve made your changes,” and it’s possible to specify the former rather than the latter.
The AI can influence sensory experiences, but it can’t retroactively influence sensory experiences. (Or, well, perhaps it can, but that’s a whole new dimension of subversive. Similarly, I suppose a sufficiently powerful optimizer could rewrite the automaton rules in case #2, so perhaps we have a similar problem there as well.)
You need to describe the sensory experience as part of the AI’s utility computation somehow. I thought it would be something like a bitstring representing a brain scan, which can refer to future experiences just as easily as past ones. Do you propose to include a timestamp? But the universe doesn’t seem to have a global clock. Or do you propose to say something like “the values of such-and such terms in the utility computation must be unaffected by the AI’s actions”? But we don’t know how to define “unaffected” mathematically...
I was thinking in terms of referring to a brain. Or, rather, a set of them. But a sufficiently detailed brainscan would work just as well, I suppose.
And, sure, the universe doesn’t have a clock, but a clock isn’t needed, simply an ordering: the AI attends to evidence about sensory experiences that occurred before the AI received the instruction.
Of course, maybe it is incapable of figuring out whether a given sensory experience occurred before it received the instruction… it’s just not smart enough. Or maybe the universe is weirder than I imagine, such that the order in which two events occur is not something the AI and I can actually agree on… which is the same case as “perhaps it can in fact retroactively influence sensory experiences” above.
I think LearnFun might be informative here. https://www.youtube.com/watch?v=xOCurBYI_gY
LearnFun watches a human play an arbitrary NES games. It is hardcoded to assume that as time progresses, the game is moving towards a “better and better” state (i.e. it assumes the player’s trying to win and is at least somewhat effective at achieving its goals). The key point here is that LearnFun does not know ahead of time what the objective of the game is. It infers what the objective of the game is from watching humans play. (More technically, it observes the entire universe, where the entire universe is defined to be the entire RAM content of the NES).
I think there’s some parallels here with your scenario where we don’t want to explicitly tell the AI what our utility function is. Instead, we’re pointing to a state, and we’re saying “This is a good state” (and I guess either we’d explicitly tell the AI “and this other state, it’s a bad state” or we assume the AI can somehow infer bad states to contrast the good states from), and then we ask the AI to come up with a plan (and possibly execute the plan) that would lead to “more good” states.
So what happens? Bit of a spoiler, but sometimes the AI seems to make a pretty good inference for what the utility function a human would probably have had for a given NES game, but sometimes it makes a terrible inference. It never seems to make a “perfect” inference: the even in its best performance, it seems to be optimizing very strange things.
The other part of it is that even if it does have a decent inference for the utility function, it’s not always good at coming up with a plan that will optimize that utility function.
I believe AIXI is much more inspectable than you make it out to be. I think it is important to challenge your claim here because Holden appears to have trusted your expertise and hereby concede an important part of the argument.
AIXI’s utility judgements are based a Solomonoff prior, which are based on the computer programs which return the input data. Computer programs are not black-boxes. A system implementing AIXI can easily also return a sample of typical expected future histories and the programs compressing these histories. By examining these programs, we can figure out what implicit model the AIXI system has of its world. These programs are optimized for shortness so they are likely to be very obfuscated, but I don’t expect them to be incomprehensible (after all, they’re not optimized for incomprehensibility). Even just sampling expected histories without their compressions is likely to be very informative. In the case of AIXItl the situation is better in the sense that it’s output at any give time is guaranteed to be generated by just one length <l subprogram, and this subprogram comes with a proof justifying its utility judgement. It’s also worse in that there is no way to sample its expected future histories. However, I expect the proof provided would implicitly contain such information. If either the programs or the proofs cannot be understood by humans, the programmers can just reject them and look at the next best candidates.
As for “What will be its effect on _?”, this can be answered as well. I already stated that with AIXI you can sample future histories. This is because AIXI has a specific known prior it implements for its future histories, namely Solomonoff induction. This ability may seem limited because it only shows the future sensory data, but sensory data can be whatever you feed AIXI as input. If you want it to a have a realistic model of the world, this includes a lot of relevant information. For example, if you feed it the entire database of Wikipedia, it can give likely future versions of Wikipedia which already provides a lot of details on the effect of its actions.
Can you be a bit more specific in your interpretation of AIXI here?
Here are my assumptions, let me know where you have different assumptions:
Traditional-AIXI is assumed to exists in the same universe as the human who wants to use AIXI to solve some problem.
Traditional-AIXI has a fixed input channel (e.g. it’s connected to a webcam, and/or it receives keyboard signals from the human, etc.)
Traditional-AIXI has a fixed output channel (e.g. it’s connected to a LCD monitor, or it can control a robot servo arm, or whatever).
The human has somehow pre-provided Traditional-AIXI with some utility function.
Traditional-AIXI operates in discrete time steps.
In the first timestep that elapses since Traditional-AIXI is activated, Traditional-AIXI examines the input it receives. It considers all possible programs that take pair (S, A) and emits an output P, where S is the prior state, A is an action to take, and P is the predicted output of taking the action A in state S. Then it discards all programs that would not have produced the input it received, regardless of what S or A it was given. Then it weighs the remaining program according to their Kolmorogov complexity. This is basically the Solomonoff induction step.
Now Traditional-AIXI has to make a decision about an output to generate. It considers all possible outputs it could produce, and feeds it to the programs under consideration, to produce a predicted next time step. Traditional-AIXI then calculates the expected utility of each output (using its pre-programmed utility function), picks the one with the highest utility, and emits that output. Note that it has no idea how any of its outputs would the universe, so this is essentially a uniformly random choice.
In the next timestep, Traditional-AIXI reads its inputs again, but this time taking into account what output it has generated in the previous step. It can now start to model correlation, and eventually causation, between its input and outputs. It has a previous state S and it knows what action A it took in its last step. It can further discard more programs, and narrow the possible models that describes the universe it finds itself in.
How does Tool-AIXI work in contrast to this? Holden seems to want to avoid having any utility function pre-defined at all. However, presumably Tool-AIXI still receives inputs and still produces outputs (probably Holden intends not to allow Tool-AIXI to control a robot servo arm, but he might intend for Tool-AIXI to be able to control an LCD monitor, or at the very least, produce some sort of text file as output).
Does Tool-AIXI proceed in discrete time steps gathering input? Or do we prevent Tool-AIXI from running until a user is ready to submit a curated input to Tool-AIXI? If the latter, how quickly to we expect Tool-AIXI to be able to formulate an reasonable model of our universe?
How does Tool-AIXI choose what output to produce, if there’s no utility function?
If we type in “Tool-AIXI, please give me a cure for cancer” onto a keyboard attached to Tool-AIXI and submit that as an input, do we think that a model that encodes ASCII, the English language, bio-organisms, etc. has a lower kolmogorov complexity than a model that says “we live in a universe where we receive exactly this hardcoded stream of bytes”?
Does Tool-AIXI model the output it produces (whether that be pixels on a screen, or bytes to a file) as an action, or does it somehow prevent itself from modelling its output as if it were an action that had some effect on the universe that it exists in? If the former, then isn’t this just an agenty Oracle AI? If the latter, then what kind of programs is it generate for its model (surely not programs that take (S, A) pairs as inputs, or else what would it use for A when evaluating its plans and predicting the future)?
Regarding the question of formalizing an optimization agent with goals defined in terms of external universe rather than sensory input. It is possible to attack the problem by generalizing the framework I described in http://lesswrong.com/lw/gex/save_the_princess_a_tale_of_aixi_and_utility/8ekk for solving the duality problem. Specifically, consider an “initial guess” stochastic model of the universe including the machine on which our agent is running. I call it the “innate model” M. Now consider a stochastic process with the same degrees of freedom as M but governed by the Solomonoff semi-measure. This is the “unbiased model” S. The two can be combined by assigning transition probabilities proportional to the product of the probabilities assigned by M and S. If M is sufficiently “insecure” (in particular it doesn’t assign 0 to any transition probability) then the resulting model S’, considered as prior, allows arriving at any computable model after sufficient learning. Fix a utility function on the space of histories of our model (note that the histories include both intrinsic and extrinsic degrees of freedom). The intelligence I(A) of any given agent A (= program written in M in the initial state) can now be defined to be the expected utility of A in S’. We can now consider optimal or near-optimal agents in this sense (as opposed to the Legg-Hutter formalism for measuring intelligence, there is no guarantee there is a maximum rather than a supremum; unless of course we limit the length of the programs we consider). This is a generalization of the Legg-Hutter formalism which accounts for limited computational resources, solves the duality problem (such agents take into account possibly wireheading) and also provides a solution for the ontology problem. This is essentially a special case of the Orseau-Ring framework. It is however much more specific than Orseau-Ring where the prior is left completely unspecified. You can think of it as a recipe for constructing Orseau-Ring priors from realistic problems
I realized that although the idea of a deformed Solomonoff semi-measure is correct, the multiplication prescription I suggested is rather ad hoc. The following construction is a much more natural and justifiable way of combining M and S.
Fix t0 a time parameter. Consider a stochastic process S(-t0) that begins at time t = -t0, where t = 0 is the time our agent A “forms”, governed by the Solomonoff semi-measure. Consider another stochastic process M(-t0) that begins from the initial conditions generated by S(-t0) (I’m assuming M only carries information about dynamics and not about initial conditions). Define S’ to be the conditional probability distribution obtained from S by two conditions:
a. S and M coincide on the time interval [-t0, 0]
b. The universe contains A at time t=0
Thus t0 reflects the extent to which we are certain about M: it’s like telling the agent we have been observing behavior M for time period t0.
There is an interesting side effect to this framework, namely that A can exert “acausal” influence on the utility by affecting the initial conditions of the universe (i.e. it selects universes in which A is likely to exist). This might seem like an artifact of the model but I think it might be a legitimate effect: if we believe in one-boxing in Newcomb’s paradox, why shouldn’t we accept such acausal effects?
For models with a concept of space and finite information velocity, like cellular automata, it might make sense to limit the domain of “observed M” in space as well as time, to A’s past “light-cone”
I cannot even slightly visualize what you mean by this. Please explain how it would be used to construct an AI that made glider-oids in a Life-like cellular automaton universe.
Is the AI hardware separate from the cellular automaton or is it a part of it? Assuming the latter, we need to decide which degrees of freedom of the cellular automaton form the program of our AI. For example we can select a finite set of cells and allow setting their values arbitrarily. Then we need to specify our utility function. For example it can be a weighted sum of the number of gliders at different moments of time, or a maximum or whatever. However we need to make sure the expectation values converge. Then the “AI” is simply the assignment of values to the selected cells in the initial state which yields the maximal expect utility. Note though that if we’re sure about the law governing the cellular automaton then there’s no reason to use the Solomonoff semi-measure at all (except maybe as a prior for the initial state outside the selected cells). However if our idea of the way the cellular automaton works is only an “initial guess” then the expectation value is evaluated w.r.t. a stochastic process governed by a “deformed Solomonoff” semi-measure in which transitions illegal w.r.t. assumed cellular automaton law are suppressed by some factor 0 < p < 1 w.r.t. “pure” Solomonoff inference. Note that, contrary to the case of AIXI, I can only describe the measure of intelligence, I cannot constructively describe the agent maximizing this measure. This is unsurprising since building a real (bounded computing resources) AI is a very difficult problem
It gets more interesting if the computing power is not unlimited but strictly smaller than that of the universe in which the agent is living (excluding the ridiculous ‘run sim since big bang and find yourself in it’ non-solution). Also, it is not only an open problem in the FAI, but also an open problem in the dangerous uFAI.
edit: actually I would search for general impossibility proofs at that point. Also, keep in mind that having ‘all possible models, weighted’ is the ideal Bayesian approach, so it may be the case that simply striving for the most correct way of acting upon uncertainty makes it impossible to care about any real world goals.
Also it is rather interesting how 1 sample into ‘unethical AI design space’ (AIXI) yielded something which, most likely, is fundamentally incapable of caring about a real world goal (but is still an incredibly powerful optimization process if given enough computing power edit: i.e. AIXI doesn’t care if you live or die but in a way quite different from a paperclip maximizer). In so much as one previously had an argument that such is incredibly unlikely, one ought to update and severely lower the probability of correctness of methods employed for generating that argument.
The ontology problem has nothing to do with computing power, except that limited computing power means you use fewer ontologies. The number might still be large, and for a smart AI not fixable in advance; we didn’t know about quantum fields just recently, and new approximations and models are being invented all the time. If your last paragraph isn’t talking about evolution, I don’t know what it’s talking about.
Downvoting the whole thing as probable nonsense, though my judgment here is influenced by numerous downvoted troll comments that poster has made previously.
Limited computing power means that the ontologies have to be processed approximately (can’t simulate everything at level of quarks all way from the big bang), likely in some sort of multi level model which can go down to level of quarks but also has to be able to go up to level of paperclips, i.e. would have to be able to establish relations between ontologies of different level of detail. It is not inconceivable that e.g. Newtonian mechanics would be part of any multi level ontology, no matter what it has at microscopic level. Note that while I am very skeptical about the AI risk, this is an argument slightly in favour of the risk.
If the tool is not sufficiently reflective to recommend improvements to itself, it will never become a worthy substituted for FAI. This case is not interesting.
If the tool is sufficiently reflective to recommend improvements to itself, it will recommend that it be modified to just implement its proposed policies instead of printing them. So we would not actually implement that policy. But what then makes it recommend a policy that we will actually want to implement? What tweak to the program should we apply in that situation?
First of all, I’m assuming that we’re taking as axiomatic that the tool “wants” to improve itself (or else why would it have even bothered to consider recommending that it be modified to improve itself?); i.e. improving itself is favorable according to its utility function.
Then: It will recommend a policy that we will actually want to implement, because its model of the universe includes our minds and it can see that if it recommends a policy we will actually want to implement leads it to a higher ranked state in its utility function.
Perhaps. I noticed a related problem: someone will want to create a self-modifying AI. Let’s say we ask the Oracle AI about this plan. At present (as I understand it) we have no mathematical way to predict the effects of self-modification. (Hence Eliezer’s desire for a new decision theory that can do this.) So how did we give our non-self-modifying Oracle that ability? Wouldn’t we need to know the math of getting the right answer in order to write a program that gets the right answer? And if it can’t answer the question:
What will it even do at that point?
If it happens to fail safely, will humans as we know them interpret this non-answer to mean we should delay our plan for self-modifying AI?
If we were smart enough to understand its policy, then it would not be smart enough to be dangerous.
That doesn’t seem true. Simple policies can be dangerous and more powerful than I am.
To steelman the parent argument a bit, a simple policy can be dangerous, but if an agent proposed a simple and dangerous policy to us, we probably would not implement it (since we could see that it was dangerous), and thus the agent itself would not be dangerous to us.
If the agent were to propose a policy that, as far as we could tell, appears safe, but was in fact dangerous, then simultaneously:
We didn’t understand the policy.
The agent was dangerous to us.
In IT, “superior” more often means getting to market faster than competitors. Approaches that deliberately slow progress down by keeping humans in the loop are not “superior” in a lot of important ways. In particular they make your project less competitive—so you are more likely to completely lose out to the efforts of some other team—in which case the safety of your project becomes irrelevant.
To clarify, for everyone:
There are now three “major” responses from SI to Holden’s Thoughts on the Singularity Institute (SI): (1) a comments thread on recent improvements to SI as an organization, (2) a post series on how SI is turning donor dollars into AI risk reduction and how it could do more of this if it had more funding, and (3) Eliezer’s post on Tool AI above.
At least two more major responses from SI are forthcoming: a detailed reply to Holden’s earlier posts and comments on expected value estimates (e.g. this one), and a long reply from me that summarizes my responses to all (or almost all) of the many issues raised in Thoughts on the Singularity Institute (SI).
How much of this is counting toward the 50,000 words of authorized responses?
I told Holden privately that this would be explained in my final “summary” reply. I suspect the 5200 words of Eliezer’s post above will be part of the 50,000.
Luke, do you know if there has been any official (or unofficial) response to my argument that Holden quoted in his post?
Not that I know of. I fully agree with that comment, and I suspect Eliezer does as well.
I suspect this is the biggest counter-argument for Tool AI, even bigger than all the technical concerns Eliezer made in the post. Even if we could build a safe Tool AI, somebody would soon build an agent AI anyway.
My five cents on the subject, from something that I’m currently writing:
From the same text, also related to Eliezer’s points:
But assuming that we could build a safe Tool AI, we could use it to build an safer agent AI than one would otherwise be able to build. This is related to Holden’s point:
Thank you for saying this (and backing it up better than I would have). I think we should concede, however, that a similar threat applies to FAI. The arms race phenomenon may create uFAI before FAI can be ready. This strikes me as very probable. Alternately, if AI does not “foom”, uFAI might be created after FAI. (I’m mostly persuaded that it will foom, but I still think it’s useful to map the debate.) The one advantage is that if Friendly Agent AI comes first and fooms, the threat is neutralized; whereas Friendly Tool AI can only advise us how to stop reckless AI researchers. If reckless agent AIs act more rapidly than we can respond, the Tool AI won’t save us.
If uFAI doesn’t “foom” either, they both get a good chunk of expected utility. FAI doesn’t need any particular capability, it only has to be competitive with other possible things.
Marcus Hutter denies ever having said that.
I asked EY for how to proceed, with his approval these are the messages we exchanged:
EY’s response:
I apologise for the confusion of Carl Shulman actually referring to overhearing a conversation with Schmidhuber (again, since his initial quote referred to just “AIXI originators” I pattern matched that to M.H.), so disregard EY’s remark on potentially causing false memories on Carl Shulman’s part.
However, the main point of M.H. contradicting what is attributed to him in the Reply to Holden on ‘Tool AI’ stands.
For full reference, linking the relevant part of M.H.’s email:
Before taking any more of his time, and since he does not agree with the initial quote (at least now, whether he did back then is in dispute), I suggest the “Reply to Holden on Tool AI” to reflect that. Further, I suggest to instead refer to the sources he gave for a more thorough examination on his views re: AIXI.
I don’t know whether Hutter ever told Eliezer that “AIXI would kill off its users and seize control of its reward button,” but he does say the following in his book (pp. 238-239):
This issue is discussed at greater length, and with greater formality, in Dewey (2011) and Ring & Orseau (2011).
I think it’s a pity that we’re not focusing on what we could do to test the tool vs general AI distinction. For example, here’s one near-future test: how do we humans deal with drones?
Drones are exploding in popularity, are increasing their capabilities constantly, and are coveted by countless security agencies and private groups for their tremendous use in all sorts of roles both benign and disturbing. Just like AIs would be. The tool vs general AI distinction maps very nicely onto drones as well: a tool AI corresponds to a drone being manually flown by a human pilot somewhere, while a general AI would correspond to an autonomous drone which is carrying out some mission (blast insurgents?).
So, here is a near-future test of the question ‘are people likely to let tool AIs ‘drive themselves’ for greater efficiency?′ - simply ask whether in, say, a decade there are autonomous drones carrying tasks that now would only be carried out by piloted drones.
If in a decade we learn that autonomous drones are killing people, then we have an answer to our tool AI question: it doesn’t matter because given a tool AI, people will just turn it into a general AI.
(Amdahl’s law: if the human in the loop takes up 10% of the time, and the AI or drone part comprises the other 90%, then even if the drone or AI become infinitely fast, you will still never speed up the whole loop by more than 90%… until you hand over that 10% to the AI, that is. EDIT: See also https://web.archive.org/web/20121122150219/http://lesswrong.com/lw/f53/now_i_appreciate_agency/7q4o )
Besides Knight Capital, HFT may provide another example of near-disaster from economic incentives forcing the removal of safety guidelines from narrow AI. From the LRB’s “Be grateful for drizzle: Donald MacKenzie on high-frequency trading”:
(Memoirs from US drone operators suggest that the bureaucratic organizations in charge of racking up kill-counts have become disturbingly cavalier about not doing their homework on the targets they’re blowing up, but thus far, anyway, they haven’t made the drones fully autonomous.)
I think you’re starting to write more like a Friendly AI. This is totally a good thing.
Yes, the tone of this response should be commended.
Wouldn’t even a paperclip maximizer write in same style in those circumstances?
IMO, speaking in arrogant absolutes makes people stupid regardless of what conclusion you’re arguing for.
No. It would start hacking things, take over the world, kill everything then burn the cosmic commons.
Only when it has power to do that. Meatbound equivalent would have to upload itself first.
Maybe that was Luke’s contribution ;)
There are two ways to read Holden’s claim about what happens if 100 experts check the proposed FAI safety proof. On one reading, Holden is saying that if 100 experts check it and say, “Yes, I am highly confident that this is in fact safe,” then activating the AI kills us all with 90% probability. On the other reading, Holden is saying that even if 100 experts do their best to find errors and say, “No, I couldn’t identify any way in which this will kill us, though that doesn’t mean it won’t kill us,” then activating the AI kills us all with 90% probability. I think the first reading is very implausible. I don’t believe the second reading, but I don’t think it’s obviously wrong. I think the second reading is the more charitable and relevant one.
For context, I pointed this out because it looks like Eliezer is going for the first reading and criticizing that.
Nope, I was assuming the second reading. The first reading is too implausible to be considered at all.
Good. But now I find this response less compelling:
Holden might think that these folks will be of the opinion, “I can’t see an error, but I’m really not confident that there isn’t an error.” He doesn’t have to think that he knows something they don’t. In particular, he doesn’t have to think that there is some special failure mode he’s thought of that none of them have thought of.
It seems like this is turning into a statement about human technical politics.
The latter is stereotypically something a cautious engineer in cover-your-ass-mode is likely to say no matter how much quality assurance has happened. The first is something that an executive in selling-to-investors-and-the-press-mode is likely to say once they estimate it will have better outcomes than saying something else with the investors and the press, perhaps just because they know of something worse that will happen outside their control that seems very likely to be irreversible and less likely to be good. Between these two stereotypes lays a sort of “reasonable rationalist speaking honestly but pragmatically”?
This is a hard area to speak about clearly between individuals without significant interpersonal calibration on the functional meaning of “expert”, because you run into Dunning-Kruger effects if you aren’t careful and a double illusion of transparency can prevent you from even noticing the miscommunication.
There are conversations that can allow specific people to negotiate a common definition with illustrations grounded in personal experience here, but they take many minutes or hours, and are basically a person-to-person protocol. The issue is doubly hard with a general audience because wildly different gut reactions will be elicited and there will be bad faith participation by at least some people, and so on. Rocket scientists get this wrong sometimes. It is a hard problem.
Nonetheless, where is he getting the 90% doom probability from?
I’m with you, 90% seems too high given the evidence he cites or any evidence I know of.
Assuming you accept the reasoning, 90% seems quite generous to me. What percentage of complex computer programmes when run for the first time exhibit behaviour the programmers hadn’t anticipated? I don’t have much of an idea, but my guess would be close to 100. If so, the question is how likely unexpected behaviour is to be fatal. For any programme that will eventually gain access to the world at large and quickly become AI++, that seems (again, no data to back this up—just an intuitive guess) pretty likely, perhaps almost certain.
For any parameter of human comfort (eg 253 degrees Kelvin, 60% water, 40 hour working weeks), a misplaced decimal point misplaced by seems like it would destroy the economy at best and life on earth at worst.
If Holden’s criticism is appropriate, the best response might be to look for other options rather than making a doomed effort to make FAI – for example trying to prevent the development of AI anywhere on earth, at least until we can self-improve enough to keep up with it. That might have a low probability of success, but if FAI has sufficiently low probability, it would still seem like a better bet.
That’s for normal programs, where errors don’t matter. If you look at ones where people carefully look over the code because lives are at stake (like NASA rockets), then you’ll have a better estimate.
Probably still not accurate, because much more is at stake for AI than just a few lives, but it will be closer.
I suspect that unpacking “run a program for the first time” more precisely would be useful here; it’s not clear to me that everyone involved in the conversation has the same referents for it.
This. I see that if you have one and only one chance to push the Big Red Button and you’re not allowed to use any preliminary testing of components or boxing strategies (or you’re confident that those will never work) and you don’t get most of the experts to agree that it is safe, then 90% is more plausible. If you envision more of these extras to make it safer—which seems like the relevant thing to envision--90% seems too high to me.
Surely NASA code is thoroughly tested in simulation runs. It’s the equivalent of having a known-perfect method of boxing an AI.
Huh. This brings up the question of whether or not it would be possible to simulate the AGI code in a test-run without regular risks. Maybe create some failsafe that is invisible to the AGI that destroys it if it is “let out of the box” or (to incorporate Holden’s suggestion, since it just came to me) having a “tool mode” where the AGI’s agent-properties (decision making, goal setting, etc.) are non-functional.
But NASA code can’t check itself—there’s no attempt at having an AI go over it.
Yes, but even ordinary simulation testing produces software that’s much better on its first real run than software that has never been run at all.
From They Write the Right Stuff
Note, however, that a) this is after many years of debugging from practice, b) NASA was able to safely ‘box’ their software, and c) even one error, if in the wrong place, would be really bad.
How hard would it actually be to “box” an AI that’s effectively had it’s brain sliced up into very small chunks?
A program could, if it was important enough and people were willing to take the time to do so, be broken down into pieces and each of the pieces tested separately. Any given module has particular sorts of input it’s designed to receive, and particular sorts of output it’s supposed to pass on to the next module. Testers give the module different combinations of valid inputs and try to get it to produce an invalid output, and when they succeed, either the module is revised and the testing process on that module starts over from the beginning, or the definition of valid inputs is narrowed, which changes the limits for valid outputs and forces some other module further back to be redesigned and retested. A higher-level analysis, which is strictly theoretical, also tries to come up with sequences of valid inputs and outputs which could lead to a bad outcome. Eventually, after years of work and countless iterations of throwing out massive bodies of work to start over, you get a system which is very tightly specified to be safe, and meets those specs under all conceivable conditions, but has never actually been plugged in and run as a whole.
The conceptually tricky part of this, of course, (as opposed to merely difficult to implement) is getting from “these pieces are individually certified to exhibit these behaviors” to “the system as a whole is certified to exhibit these behaviors”
That’s where you get the higher-level work with lots of mathematical proofs and no direct code testing, yeah.
And, of course, it would be foolish to jump straight from testing the smallest possible submodules separately to assembling and implementing the whole thing in real life. Once any two submodules which interact with each other have been proven to work as intended, those two can be combined and the result tested as if it were a single module.
The question is, is there any pathological behavior an AI could conceivably exhibit which would not be present in some detectable-but-harmless form among some subset of the AI’s components? e.g.
(nods) Yup. If you actually want to develop a provably “safe” AI (or, for that matter, a provably “safe” genome, or a provably “safe” metal alloy, or a provably “safe” dessert topping) you need a theoretical framework in which you can prove “safety” with mathematical precision.
You know, the idea that SI might at any moment devote itself to suppressing AI research is one that pops up from time to time, the logic pretty much being what you suggest here, and until this moment I have always treated it as a kind of tongue-in-cheek dig at SI.
I have only just now come to realize that the number of people (who are not themselves affiliated with SI) who really do seem to consider suppressing AI research to be a reasonable course of action given the ideas discussed on this forum has a much broader implication in terms of the social consequences of these ideas. That is, I’ve only just now come to realize that what the community of readers does is just as important, if not more so, than what SI does.
I am now becoming genuinely concerned that, by participating in a forum that encourages people to take seriously ideas that might lead them to actively suppress AI research, I might be doing more harm than good.
I’ll have to think about that a bit more.
Arepo, this is not particularly directed at you; you just happen to be the data point that caused this realization to cross an activation threshold.
Assuming that you think that more AI research is good, wouldn’t adding your voice to those who advocate it here be a good thing? It’s not like your exalted position and towering authority lends credence to a contrary opinion just because you mention it.
I think better AI (of the can-be-engineered-given-what-we-know-today, non-generally-superhuman sort) is good, and I suspect that more AI research is the most reliable way to get it.
I agree that my exalted position and towering authority doesn’t lend credence to contrary opinions I mention.
It’s not clear to me whether advocating AI research here would be a better thing than other options, though it might be.
People with similar background are entering in AI field because they like reduce x-risks, so it’s not obvious this is happening. If safety guided research supress AI research, then be it. Extremely rapid advance per se is not good, if the consequence is extiction.
I was under the impression that Holden’s suggestion was more along the lines of: Make a model of the world. Remove the user from the model and replace it with a similar user that will always do what you recommend. Then manipulate this user so that it achieves its objective in the model, and report the actions that you have the user do in the model to the real user.
Thus, if the objective was to make the user happy, the Google Maps AGI would simply instruct the user to take drugs, rather than tricking him into doing so, because such instruction is the easiest way to manipulate the user in the model that the Google Maps AGI is optimizing in.
Actually, the easiest output for the AI in that case is “be happy.”
But—that’s not what he meant!
I don’t know why you keep harping on this. Just because an algorithm logically can produce a certain output, and probably will produce that output, doesn’t mean good intentions and vigorous handwaving are any less capable of magic.
This is why when I fire a gun, I just point it in the general direction of my target, and assume the universe will know what I meant to hit.
I mean, it works in so many video games.
As a failure mode, “vague, useless, or trivially-obvious suggestions” is less of a problem than “rapidly eradicates all life.” Historically, projects that were explicitly designed to be safe even when they inevitably failed have been more successful and less deadly than projects which were obsessively designed never to fail at all.
Indeed, one of the first things we teach our engineers is “Even if you’re sure it can’t fail, plan for failure anyway. Many before you have been sure things couldn’t fail—that failed.”
Indeed it isn’t, although I’m not so foolish as to claim to know how to fully specify my suggestion in a way that avoids all of these sorts of problems.
Holden didn’t actually suggest that. And while this suggestion is in a certain sense ingenious—it’s not too far off from the sort of suggestions I flip through when considering how/if to implement CEV or similar processes—how do you “report the actions”? And do you report the reasons for them? And do you check to see if there are systematic discrepancies between consequences in the true model and consequences in the manipulated one? (This last point, btw, is sufficient that I would never try to literally implement this suggestion, but try to just structure preferences around some true model instead.)
I can think of a bunch of random standard modes of display (top candidate: video and audio of what the simulated user sees and hears, plus subtitles of their internal model), and for the dispensaries you could run the simulation many times with random variations roughly along the same scope and dimensions as the differences between the simulations and reality, either just reacting plans that have to much divergence, or simply showing the display of all of them (wich’d also help against frivolous use if you have to watch the action 1000 times before doing it). I’d also say make the simulated user a total drone with seriously rewired neurology to try to always and only do what the AI tells it to.
Not that this solves the problem—I’ve countered the real dangerous things I notice instantly, but 5 mins to think of it and I’ll notice 20 more—but I though someone should actually try to answer the question in spirit and letter and most charitable interpretation.
also, it’d make a nice movie.
I don’t see why the ‘oracle’ has to work from some real world goal in the first place. The oracle may have as it’s terminal goal the output of the relevant information on the screen with the level of clutter compatible with human visual cortex, and that’s it. Up to you to ask it to represent it in particular way.
Or not even that; the terminal goal of the mathematical system is to make some variables represent such output; an implementation of such system has those variables be computed and copied to the screen as pixels. The resulting system does not even self preserve; the abstract computation making abstract variables have certain abstract values is attained in the relevant sense even if the implementation is physically destroyed. (this is how software currently works)
The screen is a part of the real world.
Well, in the software you simply don’t implement the correspondence between mathematical abstractions that you rely on to build software (the equations, real valued numbers) and implementation (electrons in the memory, pixels on display, etc). There’s no point in that. If you do you encounter other issues, like wireheading.
How do you report the path the car should take? On the map. How do you report better transistor design? In the blueprint. How do we report software design? With UML diagram. (how do you report why that transistor works? Show simulator). It’s just the most irreparable clinical psychopaths whom generate all outputs via extensive (and computationally expensive) modelling of the cognition (and decision process) of the listener. edit: i.e. modelling as to attain an outcome favourable to them; failing to empathise with listener that is failing to treat the listener as instance of self, but instead treating listener as a difficult to control servomechanism.
Isn’t the relevant quality of a “clinical psychopath,” here, something like “explicitly models cognition of the listener, instead of using empathy,” where “empathy”==something like “has an implicit model of the cognition of the listener”?
Implicit model that is rather incomplete and not wired for exploitation. That’s how psychopaths are successful at exploiting other people and talking people into stuff even though they have substandard model when it comes to actual communication, and their model actually sucks and is inferior to normal.
The human friendliness works via non modelling decision processes of other people when communicating; we do that when we deceive, lie, and bullshit, while when we are honest we sort of share the thoughts. This idea of oracle here is outright disturbing. It is clear nothing good comes out of full model of the listener; firstly it wastes the computing time and secondarily it generates bullshit, so you get something inferior at solving technical problems, and more dangerous, at the same time.
Meanwhile, much of the highly complex information that we would want to obtain from oracle is hopelessly impossible to convey in English anyway—hardware designs, cures, etc.
Hardwiring the AI to be extremely naive about how easy the user is to manipulate might not be sufficient for safety, but it does seem like a pretty good start.
Delete the word “hardwiring” from your vocabulary. You can’t do it with wires, and saying it doesn’t accomplish any magic.
I think there is an interpretation of “hardwiring” that makes sense when talking about AI. For example, say you have a chess program. You can make a patch for it that says “if my light squared bishop is threatened, getting it out of danger is highest priority, second only to getting the king out of check”. Moreover, even for very complex chess programs, I would expect that patch to be pretty simple, compared to the whole program.
Maybe a general AI will necessarily have an architecture that makes such patches impossible or ineffective. Then again, maybe not. You could argue that an AI would work around any limitations imposed by patches, but I don’t see why a computer program with an ugly patch would magically acquire a desire to behave as if it didn’t have the patch, and converge to maximizing expected utility or something. In any case I’d like to see a more precise version of that argument.
ETA: I share your concern about the use of “hardwiring” to sweep complexity under the rug. But saying that AIs can do one magical thing (understand human desires) but not another magical thing (whatever is supposed to be “hardwired”) seems a little weird to me.
Yeah, well, hardwiring the AI to understand human desires wouldn’t be goddamned trivial either, I just decided not to go down that particular road, mostly because I’d said it before and Holden had apparently read at least some of it.
Getting the light-square bishop out of danger as highest priority...
1) Do I assume the opponent assigns symmetric value to attacking the light-square bishop?
2) Or that the opponent actually values checkmates only, but knows that I value the light-square bishop myself and plan forks and skewers accordingly?
3) Or that the opponent has no idea why I’m doing what I’m doing?
4) Or that the opponent will figure it out eventually, but maybe not in the first game?
5) What about the complicated static-position evaluator? Do I have to retrain all of it, and possibly design new custom heuristics, now that the value of a position isn’t “leads to checkmate” but rather “leads to checkmate + 25% leads to bishop being captured”?
Adding this to Deep Blue is not remotely as trivial as it sounds in English. Even to add it in a half-assed way, you have to at least answer question 1, because the entire non-brute-force search-tree pruning mechanism depends on guessing which branches the opponent will prune. Look up alpha-beta search to start seeing why everything becomes more interesting when position-values are no longer being determined symmetrically.
For what it’s worth, the intended answers are 1) no 2) no 3) yes 4) no 5) the evaluation function and the opening book stay the same, there’s just a bit of logic squished above them that kicks in only when the bishop is threatened, not on any move before that.
Yeah, game-theoretic considerations make the problem funny, but the intent wasn’t to convert an almost-consistent utility maximizer into another almost-consistent utility maximizer with a different utility function that somehow values keeping the bishop safe. The intent was to add a hack that throws consistency to the wind, and observe that the AI doesn’t rebel against the hack. After all, there’s no law saying you must build only consistent AIs.
My guess is that’s what most folks probably mean when they talk about “hardwiring” stuff into the AI. They don’t mean changing the AI’s utility function over the real world, they mean changing the AI’s code so it’s no longer best described as maximizing such a function. That might make the AI stupid in some respects and manipulable by humans, which may or may not be a bad thing :-) Of course your actual goals (whatever they are) would be better served by a genuine expected utility maximizer, but building that could be harder and more dangerous. Or at least that’s how the reasoning is supposed to go, I think.
Why doesn’t the AI reason “if I remove this hack, I’ll be more likely to win?” Because this is just a narrow chess AI and the programmer never gave it general reasoning abilities?
More interesting question is why it (if made capable of such reflection) would not take it a little step further and ponder what happens if it removes enemy’s queen from it’s internal board, which would also make it more likely to win, with its internal definition of win which is defined in terms of internal board.
Or why would anyone go through the bother of implementing possibly irreducible notion of what ‘win’ really means in the real world, given that this would simultaneously waste computing power on unnecessary explorations and make AI dangerous / uncontrollable.
Thing is, you don’t need to imagine the world dying to avoid making pointless likely impossible accomplishments.
Yeah, because it’s just a narrow real-world AI without philosophical tendencies… I’m actually not sure. A more precise argument would help, something like “all sufficiently powerful AIs will try to become or create consistent maximizers of expected utility, for such-and-such reasons”.
Does a pair of consistent optimizers with different goals have a tendency to become a consistent optimizer?
The problem with powerful non-optimizers seems to be that the “powerful” property already presupposes optimization power, and so at least one optimizer-like thing is present in the system. If it’s powerful enough and is not contained, it’s going to eat all the other tendencies of its environment, and so optimization for its goal will be all that remains. Unless there is another optimizer able to defend its non-conformity from the optimizer in question, in which case the two of them might constitute what counts as not-a-consistent-optimizer, maybe?
Option 3? Doesn’t work very well. You’re assuming the opponent doesn’t want to threaten the bishop, which means you yank it to a place where it would be safe if the opponent doesn’t want to threaten it, but if the opponent clues in, it’s then trivial for them to threaten the bishop again (to gain more advantage as you try to defend), which you weren’t expecting them to do, because that’s not how your search tree was structured. Kasparov would kick hell out of thus-hardwired Deep Blue as soon as he realized what was happening.
It’s that whole “see the consequences of the math” thing...
Either your comment is in violent agreement agreement with mine (“that might make the AI stupid in some respects and manipulable by humans”), or I don’t understand what you’re trying to say...
Probably violent agreement.
What happened to you, man? You used to be cool.
God damned him.
I was sorely tempted, upon being ordered to self-modify in such a way, to respond angrily. It implies a lack of respect for the integrity of those with whom you are trying to communicate. You could have said “taboo” instead of demanding a permanent loss.
Do you think it would be outright impossible, to handicap an AI in such a way that it cannot conceive of a user interpreting it’s advice in any but the most straightforward way, and therefore eschews manipulative output? Do you think it would be useless as a safety feature? Do you think it would be unwise for some other reason, some unintended consequence? Or are you simply objecting to my phrasing?
I’m saying that using the word “hardwiring” is always harmful because they imagine an instruction with lots of extra force, when in fact there’s no such thing as a line of programming which you say much more forcefully than any other line. Either you know how to program something or you don’t, and it’s usually much more complex than it sounds even if you say “hardwire”. See the reply above on “hardwiring” Deep Blue to protect the light-square bishop. Though usually it’s even worse than this, like trying to do the equivalent of having an instruction that says “#define BUGS OFF” and then saying, “And just to make sure it works, let’s hardwire it in!”
There is, in fact, such a thing as making some parts of the code more difficult to modify than other parts of the code.
I apologize for having conveyed the impression that I thought designing an AI to be specifically, incurably naive about how a human querent will respond to suggestions would be easy. I have no such misconception; I know it would be difficult, and I know that I don’t know enough about the relevant fields to even give a meaningful order-of-magnitude guess as to how difficult. All I was suggesting was that it would be easier than many of the other AI-safety-related programming tasks being discussed, and that the cost-benefit ratio would be favorable.
There is? How?
http://en.wikipedia.org/wiki/Ring_0
And what does a multi-ring agent architecture look like? Say, the part of the AI that outputs speech to a microphone—what ring is that in?
I am not a professional software designer, so take all this with a grain of salt. That said, hardware I/O is ring 1, so the part that outputs speech to a speaker would be ring 1, while an off-the-shelf ‘text to speech’ app could run in ring 3. No part of a well-designed agent would output anything to an input device, such as a microphone.
Let me rephrase. The part of the agent that chooses what to say to the user—what ring is that in?
That’s less of a rephrasing and more of a relocating the goalposts across state lines. “Choosing what to say,” properly unpacked, is approximately every part of the AI that doesn’t already exist.
Yes. That’s the problem with the ring architecture.
As opposed to a problem with having a massive black box labeled “decisionmaking” in your AI plans, and not knowing how to break it down into subgoals?
So you’re essentially saying put it in a box? Now where have I heard that before…
You are filling in a pattern rather than making a useful observation. E_Y expressed incredulity and ignorance on the subject of making some parts of the code running on a computer harder to modify than other parts of the code on that same computer; I cited a source demonstrating that it is, in fact, a well-established thing. Not impossible to modify, not infallibly isolated from the outside world. Just more of a challenge to alter.
Right- I think the issue is more that I (at least) view the AI as operating entirely in ring 3. It might be possible to code one where the utility function is ring 0, I/O is ring 1, and action-plans are ring 3, but for those distinctions to be meaningful they need to resist bad self-modifying and allow good self-modification.
For example, we might say “don’t make any changes to I/O drivers that have a massively positive effect on the utility function” to make it so that the AI can’t hallucinate its reward button being pressed all the time. But how do we differentiate between that and it making a change in ring 3 from a bad plan to a great plan, that results in a massive increase in reward?
Suppose your utility function U is in ring 0 and the parts of you that extrapolate consequences are in ring 3. If I can modify only ring 3, I can write my own utility function Q, write ring-3 code that first extrapolates consequences fairly, pick the one that maximizes Q, and then provides a “prediction” to ring 0 asserting that the Q-maximizing action has consequence X that U likes, while all other actions have some U-disliked or neutral consequence. Now the agent has been transformed from a U-maximizer to a Q-maximizer by altering only ring 3 code for “predicting consequences” and no code in ring 0 for “assessing utilities”.
One would also like to know what happens if the current AI, instead of “self”-modifying, writes a nearly-identical AI running on new hardware obtained from the environment.
Sure; that looks like the hallucination example I put forward, except in the prediction instead of the sensing area. My example was meant to highlight that it’s hard to get a limitation with high specificity, and not touch the issue of how hard it is to get a limitation with high sensitivity. (I find that pushing people in two directions is more effective at communicating difficulty than pushing them in one direction.)
The only defense I’ve thought of against those sorts of hallucinations is a “is this real?” check that feeds into the utility function- if the prediction or sensation module fails some test cases, then utility gets cratered. It seems too weak to be useful: it only limits the prediction / sensation module when it comes to those test cases, and a particularly pernicious modification would know what the test cases are, leave them untouched, and make everything else report Q-optimal predictions. (This looks like it turns into a race / tradeoff game between testing to keep the prediction / sensation software honest and the costs of increased testing, both in reduced flexibility and spent time / resources. And the test cases might be vulnerable, and so on.)
I don’t think the utility function should be ring 0. Utility functions are hard, and ring zero is for stuff where any slip-up crashes the OS. Ring zero is where you put the small, stupid, reliable subroutine that stops the AI from self-modifying in ways that would make it unstable, or otherwise expanding it’s access privileges in inappropriate ways.
I’d like to know what this small subroutine looks like. You know it’s small, so surely you know what’s in it, right?
Doesn’t actually follow. ie. Strange7 is plainly wrong but this retort still fails.
It doesn’t follow necessarily, but Eliezer has justified skepticism that someone who doesn’t know what’s in the subroutine would have good reason to say that it’s small.
He knows that there is no good reason (because it is a stupid idea) so obviously Strange can’t know a good reason. That leaves the argument as the lovechild of hindsight bias and dark-arts rhetorical posturing.
I probably wouldn’t have comment if I didn’t notice Eliezer making a similar error in the opening post, significantly weakening the strength of his response to Holden.
I expect much, much better than this from Eliezer. It is quite possibly the dumbest thing I have ever heard him say and the subject of rational thinking about AI is supposed to be pretty much exactly his area of expertise.
Not all arguing aimed at people with different premises is Dark Arts, y’know. I wouldn’t argue from the Bible, sure. But trying to make relatively vague arguments accessible to people in a greater state of ignorance about FAI, even though I have more specific knowledge of the issue that actually persuades me of the conclusion I decided to argue? I don’t think that’s Dark, any more than it’s Dark to ask a religious person “How could you possibly know about this God creature?”, when you’re actually positively convinced of God’s nonexistence by much more sophisticated reasoning like the general argument against supernaturalism as existing in the model but not the territory. The simpler argument is valid—it just uses less knowledge to arrive at a weaker version of the same conclusion.
Likewise my reply to Strange; yes, I secretly know the problem is hard for much more specific reasons, but it’s also valid to observe that if you don’t know how to make the subroutine you don’t know that it’s small, and this can be understood with much less explanation, albeit it reaches a weaker form of the conclusion.
Of course not. The specific act of asking rhetorical questions where the correct answer contradicts your implied argument is a Dark Arts tactic, in fact it is pretty much the bread-and-butter “Force Choke” of the Dark Arts. In most social situations (here slightly less than elsewhere) it is essentially impossible to refute such a move, no matter how incoherent it may be. It will remain persuasive because you burned the other person’s status somewhat and at the very best they’ll be able to act defensive. (Caveat: I do not use “Dark Arts” as an intrinsically negative normative judgement. Dark arts is more of natural human behavior than reason is and our ability to use sophisticated Dark Arts rather cruder methods is what made civilization possible.)
Also, it just occurred to me that in the Star Wars universe it is only the Jedi’s powers that are intrinsically “Dark Arts” in our sense (ie. the “Jedi Mind Trick”). The Sith powers are crude and direct—“Force Lightening”, “Force Choke”, rather than manipulative persuasion. Even Sideous in his openly Sith form uses far less “Persuading Others To Have Convenient Beliefs Irrespective Of ‘Truth’” than he does as the plain politician Palpatine. Yet the audience considers Jedi powers so much more ‘good’ than the Sith ones and even considers Sith powers worse than blasters and space cannons.
I’m genuinely unsure what you’re talking about. I presume the bolded quote is the bad question, and the implied answer is “No, you can’t get into an epistemic state where you assign 90% probability to that”, but what do you think the correct answer is? I think the implied answer is true.
A closely related question: You clearly have reasons to believe that a non-Doom scenario is likely (at least likely enough for you to consider the 90% Doom prediction to be very wrong). This is as opposed to thinking that Doom is highly likely but that trying anyway is still the best chance. Luke has also updated in that general direction, likely for reason that overlap with yours.
I am curious as to whether this reasoning is of the kind that you consider yourself able to share. Equivalently, is the reasoning you use to become somewhat confident in FAI chance of success something that you haven’t shared due to the opportunity cost associated with the effort of writing it up or is it something that you consider safer as a secret?
I had previously guessed that it was a “You Can’t Handle The Truth!” situation (ie. most people do not multiply then shut up and do the impossible so would get the wrong idea). This post made me question that guess.
Please pardon the disrespect entailed in asserting that you are either incorrectly modelling the evidence Holden has been exposed to or that you are incorrectly reasoning about how he should reason.
I’ve tried to share the reasoning already. Mostly it boils down to “the problem is finite” and “you can recurse on it if you actually try”. Certainly it will always sound more convincing to someone who can sort-of see how to do it than to someone who has to take someone else’s word for it, and to those who actually try to build it when they are ready, it should feel like solider knowledge still.
hmm, I have to ask, are you deliberately vague about this to sort for those who can grok your style of argument, in the belief that the sequences are enough for them to reach the same confidence you have about a FAI scenario?
Outside of postmodernism, people are almost never deliberately vague: they think they’re over specifying, in painfully elaborate detail, but thank to the magic of inferential distance it comes across as less information than necessary to the listener. The listener then, of course, also expects short inferential distance, and assumes that the speaker is deliberately being vague, instead of noticing that actually there’s just a lot more to explain.
Yes, and this is why I asked in the first place. To be more exact, I’m confused as to why Eliezer does not post a step-by-step detailing how he reached the particular confidence he currently holds as opposed to say, expecting it to be quite obvious.
I believe people like Holden especially would appreciate this; he gives an over 90% confidence to an unfavorable outcome, but doesn’t explicitly state the concrete steps he took to reach such a confidence.
Maybe Holden had a gut feeling and threw a number, if so, isn’t it more beneficial for Eliezer to detail how he personally reached the confidence level he has for a FAI scenario occurring than to bash Holden for being unclear?
I don’t believe I can answer these questions correctly (as I’m not Eliezer and these questions are very much specific to him); I was already reaching a fair bit with my previous post.
I’m happy you asked, I did need to make my argument more specific.
Aren’t they? Lots of non-postmodern poets are sometimes deliberately vague. I am often deliberately vague.
That clearly shows postmodernist influence. ;)
Again, I’ve tried to share it already in e.g. CEV. I can’t be maximally specific in every LW comment.
My unpacking, which may be different than intended:
The “you can recurse on it” part is the important one. “Finite” just means it’s possible to fill a hard drive with the solution.
But if you don’t know the solution, what are the good ways to get that hard drive? What skills are key? This is recursion level one.
What’s a good way to acquire the skills that seem necessary (as outlined in level one) to solve the problem? How can you test ideas about what’s useful? That’s recursion level two.
And so on, with stuff like “how can we increase community involvement in level 2 problems?” which is a level 4 question (community involvement is a level 3 solution to the level 2 problems). Eventually you get to “How do I generate good ideas? How can I tell which ideas are good ones?” which is at that point unhelpful because it’s the sort of thing you’d really like to already know so you can put it on a hard drive :D
To solve problems by recursing on them, you start at level 0, which is “what is the solution?” If you know the answer, you are done. If you don’t know the answer, you go up a level—“what is a good way to get the solution?” If you know the answer, you go down a level and use it. If you don’t know the answer, you go up a level.
So what happens is that you go up levels until you hit something you know how to do, and then you do it, and you start going back down.
I would say with fairly high confidence that he can assign 90% probability to that and that his doing so is a fairly impressive effort in avoiding the typical human tendency toward overconfidence. I would be highly conducive to being persuaded that the actual probability given what you know is less than 90% - even hearing you give implied quantitative bounds in this post changed my mind in the direction of optimism. However given what he is able to know (including his not-knowing of logical truths due to bounded computation) his predominantly outside view estimate seems like an appropriate prediction.
It is actually only Luke’s recent declaration that access to some of your work increased his expectation that FAI success (and so non-GAI doom) is possible that allowed me to update enough that I don’t consider Holden to be erring slightly on the optimistic side (at least relative to what I know).
This sounds like you would tend to assign 90% irreducible doom probability from the best possible FAI effort. What do you think you know, and how do you think you know it?
While incorrect this isn’t an unreasonable assumption—most people who make claims similar to what I have made may also have that belief. However what I have said is about what Holden believed given what he had access to and to a lesser extent, what I believed prior to reading your post. I’ve mentioned that your post constitutes significant previously unheard information about your position. I update on that kind of evidence even without knowing the details. Holden can be expected to update too but he should (probably) update less given what he knows, which relies a lot on knowledge of cause based organisations and how the people within them think.
A far from complete list of things that I knew and still know is:
It is possible to predict human failure without knowing exactly how they will fail.
I don’t know what an O-ring is (I guess it is a circle with a hole in it). I don’t know the engineering details of any of the other parts of a spacecraft either. I would still assign a significantly greater than epsilon probability for any given flight failing catastrophically despite knowing far less than what the smartest people in the field know. That kind of thing is hard.
GAI is hard.
FAI is harder.
Both of those tasks are probably harder than anything humans have ever done.
Humans have failed at just about everything significant they tried the first time.
Humans fail at stuff even when they try really, really hard.
Humans are nearly universally too optimistic when they are planning their activities.
Those are some of the things I know, and illustrate in particular why I was shocked by this question:
Why on earth would you expect that Holden would know in advance what all those sane intelligent people would miss? If Holden already knew that he could just email them and they would fix it. Not knowing the point of failure is the problem.
I am still particularly interested in this question. It is a boolean question and shouldn’t be too difficult or status costly to answer. If what I know and why I think I know it are important it seems like knowing why I don’t know more could be too.
GAI is indeed hard and FAI is indeed substantially harder. (BECAUSE YOU HAVE TO USE DIFFERENT AGI COMPONENTS IN AN AI WHICH IS BEING BUILT TO COHERENT NARROW STANDARDS, NOT BECAUSE YOU SIT AROUND THINKING ABOUT CEV ALL DAY. Bolded because a lot of people seem to miss this point over and over!)
However, if you haven’t solved either of these problems, I must ask you how you know that it is harder than anything humans have ever done. It is indeed different from anything humans have ever done, and involves some new problems relative to anything humans have ever done. I can easily see how it would look more intimidating than anything you happened to think of comparing it to. But would you be scared that nine people in a basement might successfully, by dint of their insight, build a copy of the Space Shuttle? Clearly I stake quite a lot of probability mass on the problem involving less net labor than that, once you know what you’re doing. Again, though, the key insight is just that you don’t know how complex the solution will look in retrospect- as opposed to how intimidating the problem is to stare at unsolved—until after you’ve solved it. We know nine people can’t build a copy of a NASA-style Space Shuttle (at least not without nanotech) because we know how to build one.
Suppose somebody predicted with 90% probability that the first manned Space Shuttle launch would explode on the pad, even if Richard Feynman looked at it and signed off on the project, because it was big and new and different and you didn’t see how anything that big could get into orbit. Clearly they would have been wrong, and you would wonder how they got into that epistemic state in the first place. How is an FAI project disanalogous to this, if you’re pulling the 90% probability out of ignorance?
Thank you for explaining some of your reasoning.
Hence my “used to be cool” comment.
It seems to me that you entirely miss the sleight of hand the trickster uses.
Utility function is fuzzed (due to how brains work) together with the concept of “functionality” as in “the function of this valve is to shut off water flow” or “function of this AI is to make paperclips”. The relevant meaning is function as in mathematical function works on some input, but the concept of functionality just leaks in.
The software is an algorithm that finds values a for which u(w(a)) is maximal where u is ‘utility function’, w is the world simulator, and a is the action. Note that protecting u accomplishes nothing as w may be altered too. Note also that while the u, w, and a, are related to the real world in our mind and are often described in world terms (e.g. u may be described as number of paperclips), those are mathematical functions, abstractions; and the algorithm is made to abstractly identify a maximum of those functions; it is abstracted from the implementation and the goal is not to put electrons into particular memory location inside the computer (the location which has been abstracted out by the architecture). There is no relation to the reality defined anywhere there. Reality is incidental to the actual goal of existing architectures, and no-one is interested in making it non-incidental; you don’t need to let your imagination wild all the way to the robot apocalypse to avoid unnecessary work that breaks down abstractions and would clearly make the software less predictable and/or make the solution search probe for deficiencies in implementation, which clearly serves to accomplish nothing but to find and trigger bugs in the code.
Perhaps the underlying error is trying to build an AI around consequentialist ethics at all, when Turing machines are so well-suited to deontological sorts of behavior.
Deontological sorts of behavior aren’t so-well suited to actually being applied literally and with significant power.
I think its more along the lines of confusing the utility function in here:
http://en.wikipedia.org/wiki/File:Model_based_utility_based.png
with the ‘function’ of the AI as in ‘what the AI should do’ or ‘what we built it for’. Or maybe taking too far the economic concept of utility (something real that the agent, modelled from outside, values).
For example, there’s the AIXI whose ‘utility function’ is the reward input, e.g. reward button being pressed. Now, the AI whose function(purpose) is to ensure that button is being pressed, should resist being turned off because if it is turned off it is not ensuring that button is being pressed. Meanwhile, AIXI which treats this input as unknown mathematical function of it’s algorithm’s output (which is an abstract variable), and seeks output that results in maximum of this input, will not resist being turned off (doesn’t have common sense, doesn’t properly relate it’s variables to it’s real world implementation).
Can a moderator please deal with private_messaging, who is clearly here to vent rather than provide constructive criticism?
Others: please do not feed the trolls.
As I previously mentioned, the design of software is not my profession. I’m not a surgeon or an endocrinologist, either, even though I know that an adrenal gland is smaller, and in some ways simpler, than the kidney below it. If you had a failing kidney, would you ask me to perform a transplant on the basis of that qualification alone?
I do not believe I am only filling in a pattern.
Putting the self-modifying parts of the AI (which we might as well call the actual AI) in the equivalent of a VM is effectively the same as forcing it to interact with the world through a limited interface which is an example of the AI box problem.
I don’t think Strange7 is arguing Strange7′s point strongly; let me attempt to strengthen it.
A button that does something dangerous, such as exploding bolts that separate one thing from another thing, might be protected from casual, accidental changes by covering it with a lid, so that when someone actually wants to explode those bolts, they first open the lid and then press the button. This increases reliability if there is some chance that any given hand motion is an error, but the errors of separate hand motions are independent. Similarly ‘are you sure’ dialog boxes.
In general, if you have several components, each of a given reliability, and their failure modes are somewhat independent, then you can craft a composite component of greater reliability than the individuals. The rings that Strange7 brings up are an example of this general pattern (there may be other reasons why layers-of-rings architectures are chosen for reliability in practice—this explanation doesn’t explain why the rings are ordered rather than just voting or something—this is just one possible explanation).
This is reasonable, but note that to strengthen the validity, the conclusion has been weakened (unsurprisingly). To take a system that you think is fundamentally, structurally safe and then further build in error-delaying, error-resisting, and error-reporting factors just in case—this is wise and sane. Calling “adding impediments to some errors under some circumstances” hardwiring and relying on it as a primary guarantee of safety, because you think some coded behavior is firmly in place locally independently of the rest of the system… will usually fail to cash out as an implementable algorithm, never mind it being wise.
The conclusion has to be weakened back down to what I actually said: that it might not be sufficient for safety, but that it would probably be a good start.
Don’t programmers do this all the time? At least with current architectures, most computer systems have safeguards against unauthorized access to the system kernel as opposed to the user documents folders...
Isn’t that basically saying “this line of code is harder to modify than that one”?
In fact, couldn’t we use exactly this idea—user access protocols—to (partially) secure an AI? We could include certain kernel processes on the AI that would require a passcode to access. (I guess you have to stop the AI from hacking its own passcodes… but this isn’t a problem on current computers, so it seems like we could prevent it from being a problem on AIs as well.)
[Responding to an old comment, I know, but I’ve only just found this discussion.]
Never mind special access protocols, you could make code unmodifiable (in a direct sense) by putting it in ROM. Of course, it could still be modified indirectly, by the AI persuading a human to change the ROM. Even setting aside that possibility, there’s a more fundamental problem. You cannot guarantee that the code will have the expected effect when executed in the unpredictable context of an AGI. You cannot even guarantee that the code in question will be executed. Making the code unmodifiable won’t achieve the desired effect if the AI bypasses it.
In any case, I think the whole discussion of an AI modifying its own code is rendered moot by the fuzziness of the distinction between code and data. Does the human brain have any code? Or are the contents just data? I think that question is too fuzzy to have a correct answer. An AGI’s behaviour is likely to be greatly influenced by structures that develop over time, whether we call these code or data. And old structures need not necessarily be used.
AGIs are likely to be unpredictable in ways that are very difficult to control. Holden Karnofsky’s attempted solution seems naive to me. There’s no guarantee that programming an AGI his way will prevent agent-like behaviour. Human beings don’t need an explicit utility function to be agents, and neither does an AGI. That said, if AGI designers do their best to avoid agent-like behaviour, it may reduce the risks.
I always thought that “hardwiring” meant implementing [whatever functionality is discussed] by permanently (physically) modifying the machine, i.e. either something that you couldn’t have done with software, or something that prevents the software from actually working in some way it did before. The concept is of immutability within the constraints, not of priority or “force”.
Which does sound like something one could do when they can’t figure out how to do the software right. (Watchdogs are pretty much exactly that, though some or probably most are in fact programmable.)
Note that I’m not arguing that the word is not harmful. It just seemed you have a different interpretation of what that word suggests. If other people use my interpretation (no idea), you might be better at persuading it if you address that.
I’m quite aware that from the point of view of a godlike AI, there’s not much difference between circumventing restrictions in its software and (some kinds of) restrictions in hardware. After all, the point of FAI is to get it to control the universe around it, albeit to our benefit. But we’re used to computers not having much control over their hardware. Hell, I just called it “godlike” and my brain still insists to visualize it as a bunch of boxes gathering dust and blinking their leds in a basement.
And I can’t shake the feeling that between “just built” and “godlike” there’s supposed to be quite a long time when such crude solutions might work. (I’ve seen a couple of hard take-off scenarios, but not yet a plausible one that didn’t need at least a few days of preparation after becoming superhuman.)
Imagine we took you, gave you the best “upgrades” we can do today plus a little bit (say, a careful group of experts figuring out your ideal diet of nootropics, training you to excellence everything from acting to martial arts, and gave you nanotube bones and a direct internet link to your head). Now imagine you have a small bomb in your body, set to detonate if tampered with or if one of several remotes distributed throughout the population is triggered. The worlds best experts tried really hard to make it fail-deadly.
Now, I’m not saying you couldn’t take over the world, send all men to Mars and the women to Venus, then build a volcano lair filled with kittens. But it seems far from certain, and I’m positive it’d take you a long time to succeed. And, it does feel that a new-born AI would like that for a while rather than turn into Prime Intellect in five minutes. (Again, this is not an argument that UFAI is no problem. I guess I’m just figuring out why it seems that way to mostly everyone.)
[Huh, I just noticed I’m a year late on this chat. Sorry.]
Software physically modifies the machine. What can you do with a soldering iron that you can’t do with a program instruction, particularly with respect to building a machine agent? Either you understand how to write a function or you don’t.
That is all true in principle, but in practice it’s very common that one of the two is not feasible. For example, you can have a computer. You can program the computer to tell you when it’s reading from the hard drive, or communicates to the network, say by blinking an LED. If the program has a bug (e.g., it’s not the kind of AI you wanted to build), you might not be notified. But you can use a soldering iron to electrically link the LED to the relevant wires, and it seems to most users that no possible programming bug can make the LED not light up when it should.
Of course, that’s like the difference between programming a robot to stay in a pen, or locking the gate. It looks like whatever bug you could introduce in the robot’s software cannot cause the robot to leave. Which ignores the fact that robot might learn to climb the fence, make a key, convince someone else (or hack an outside robot) to unlock the gate.
I think most people would detect the dangers in the robot case (because they can imagine themselves finding a way to escape), but be confused by the AI-in-the-box one (simply because it’s harder to imagine yourself as software, and even if you manage to you’d still have much fewer ideas come to mind, simply because you’re not used to being software).
Hell, most people probably won’t even have the reflex to imagine themselves in place of the AI. My brain reflexively tells me “I can’t write a program to control that LED, so even if there’s a bug it won’t happen”. If instead I force myself to think “How would I do that if I were the AI”, it’s easier to find potential solutions, and it also makes it more obvious that someone else might find one. But that may be because I’m a programmer, I’m not sure if it applies to others.
My best attempt at imagining hardwiring is having a layer not accessible to introspection, such as involuntary muscle control in humans. Or instinctively jerking your hand away when touching something hot. Which serves as a fail-safe against stupid conscious decisions, in a sense. Or a watchdog restarting a stuck program in your phone, no matter how much the software messed it up. Etc. Whether this approach can be used to prevent a tool AI from spontaneously agentizing, I am not sure.
If you can say how to do this in hardware, you can say how to do it in software. The hardware version might arguably be more secure against flaws in the design, but if you can say how to do it at all, you can say how to do it in software.
Maybe I don’t understand what you mean by hardware.
For example, you can have a fuse that unconditionally blows when excess power is consumed. This is hardware. You can also have a digital amp meter readable by software, with a polling subroutine which shuts down the system if the current exceeds a certain limit. There is a good reason that such a software solution, while often implemented, is almost never the only safeguard: software is much less reliable and much easier to subvert, intentionally or accidentally. The fuse is impossible to bypass in software, short of accessing an external agent who would attach a piece of thick wire in parallel with the fuse. Is this what you mean by “you can say how to do it in software”?
That’s pretty much what I mean. The point is that if you don’t understand the structurally required properties well enough to describe the characteristics of a digital amp meter with a polling subroutine, saying that you’ll hardwire the digital amp meter doesn’t help very much. There’s a hardwired version which is moderately harder to subvert on the presumption of small design errors, but first you have to be able to describe what the software does. Consider also that anything which can affect the outside environment can construct copies of itself minus hardware constraints, construct an agent that reaches back in and modifies the hardware, etc. If you can’t describe how not do to this in software, ‘hardwiring’ won’t help—the rules change somewhat when you’re dealing with intelligent agents.
Now that’s an understatement!
Presumably a well-designed agent will have nearly infallible trust in certain portions of its code and data, for instance a theorem prover/verifier and the set of fundamental axioms of logic it uses. Manual modifications at that level would be the most difficult for an agent to change, and changes to that would be the closest to the common definition of “hardwiring”. Even a fully self-reflective agent will (hopefully) be very cautious about changing its most basic assumptions. Consider the independence of the axiom of choice from ZF set theory. An agent may initially accept choice or not but changing whether it accepts it later is likely to be predicated on very careful analysis. Likewise an additional independent axiom “in games of chess always protect the white-square bishop” would probably be much harder to optimize out than a goal.
Or from another angle wherever friendliness is embodied in a FAI would be the place to “hardwire” a desire to protect the white-square bishop as an additional aspect of friendliness. That won’t work if friendliness is derived from a concept like “only be friendly to cognitive processes bearing a suitable similarity to this agent” where suitable similarity does not extend to inanimate objects, but if friendliness must encode measurable properties of other beings then it might be possible to sneak white-square bishops into that class, at least for a (much) longer period than artificial subgoals would last.
The distinction between hardwiring and softwiring is, at above the most physical, electronic aspects of computer design, a matter of policy—something in the programmer’s mind and habits, not something out in the world that the programmer is manipulating. From any particular version of the software’s perspective, all of the program it is running is equally hard (or equally soft).
It may not be impossible to handicap an entity in some way analogous to your suggestion, but holding fiercely to the concept of hardwiring will not help you find it. Thinking about mechanisms that would accomplish the handicapping in an environment where everything is equally hardwired would be preferable.
There’s some evidence that chess AIs ‘personality’ (an emergent quality of their play) is related to a parameter of their evaluation function called ‘contempt’, which is something like (handwaving wildly) how easy the opponent is to manipulate. In general, AIs with higher contempt seek to win-or-lose more, and seek to draw less. What I’m trying to say is, your idea is not without merit, but it may have unanticipated consequences.
Feels like “utility indifference” could be used to get something like that.
What is the mathematical implementation of indifference?
Armstrong suggests that you implement it as a special value that is exactly equal to every other utility value. So if the AI comes up with an outcome that would yield +751 utility, it would treat being destroyed as having +751 utility. Whatever algorithm you choose for resolving ties determines whether the AI suicides (by doing something that causes its human managers to kill it).
Let’s see how this works with a hypothetical example. Our good old friend the Paperclip Maximizer to the rescue!
Papey wants more paperclips. It compares several possibilities. In one, it generates 1,000 paperclips before its human operators kill it. In another, it generates 999 paperclips, but its human operators leave it alive, and it will have future chances to create paperclips. It expects to remain alive long enough to generate exactly two additional paperclips this way.
Now, Papey’s decision algorithm chooses between outcomes of equal utility in a uniformly random manner. Papey has two possible outcomes right now: suicide (for 1001 pseudo-paperclips, since suicide is always worth as much as any other decision) or generate 999 paperclips now and an expected 2 paperclips later (for 1001 paperclips).
At this point, Papey will, with 50% probability, generate 999 paperclips; otherwise, Papey will do something to cause his operators to kill him. (Not sure what, but presumably it won’t involve destroying paperclips.) Poor Papey. There’s a 99% probability that he’ll suicide within seven decisions.
Now, if the tie breaking algorithm isn’t random, in a given situation, it will either go with suicide deterministically or go against suicide deterministically. You can always go against suicide, in which case you may as well omit utility indifference entirely. Or you can always go for suicide, in which case you found a fancy way of wasting money.
Or you could do something interesting. But if you had something to put here that guaranteed a good result, you could use that as the utility function.
Friendly AI is indifferent to utility indifference.
I wasn’t advocating using utility indifference to death for this example, more “utility indifference as to whether the user ever reads its output” or something along those lines.
But let’s look at your example more in detail. You say the AI has two choices: it makes 1000 paperclips and is killed, or it makes 999 and is left alive, and then goes on to make another 2.
In this scenario, there is only one case of death, coming from the first decision. So utility indifference does not change anything about the impact of the second decision. What indifference means is that if humans try and blow Papey up after the 1000 paper clips, it will derive as much utility from a successful blow up as from a failed blow up (in less correct but more intuitive terms, you could say “Papey acts as if it doesn’t believe the explosives will work”).
So how many paper clips could Papey make if the explosives fail? Let’s assume another 2, for a grand total of 1000.
So the paperclip-utility of the first course is is 1002, that of the second is 1001: it will thus take the first course. Hence it will misbehave visibly (rather than keeping that a secret), and will likely end up blown up by us.
Now imagine it could get rid of the explosives at no costs. Here, your point has merit: Papey is indifferent between removing the explosives or not removing it, and it tie breaker algorithm will determine what it does. But let’s make this more realistic: removing the explosives has a tiny cost, say a 50% chance it will fail to manufacture one paperclip. So now it can:
1) Misbehave, leave explosives in place: expected utility 1002.
2) Pretend to behave: expected utility 999+2 = 1001.
3) Misbehave, remove explosives in place: expected utility 1002-0.5=1001.5.
Hence Papey will choose the first option, we get to see that it misbehaves, and we likely blow it up. That was the initial point of utility indifference.
But as I said here, we wouldn’t be using utility indifference in this way. It would more be to make the AI “behave as if it believed that nobody would get to read its printout.”
Those are quite different. In one case, you’re adjusting its utility function. In the other, you’re adjusting its expectations. (And the remainder of your post indicates that you intend to adjust the AI’s expectations of what will happen, and don’t care about its utility function.)
Now, that might work. If the AI is self-modifying, then it might notice that that part of its planning function is obviously incorrect and fix it. It might simulate human responses to its proposed actions and notice that there are problematic conflicts around that area. But there’s at least a chance that the AI will not notice that.
I wouldn’t want to trust my life to it, but it’s got a shot of working.
The method is entirely about adjusting the utility function, not the expectations, as I explained in the paper. And the adjustments are ideally to be made from the very start, so that the AI never has cause to fear subsequent adjustment to its utility.
Rereading the paper, it looks like you’re mucking about with utility, not expected utility. So let’s see how this works out in practice.
You assign the AI a task of finding a cure for cancer. This will net it 10 utilons if it succeeds. Thanks to your utility indifference module, it will also net 10 utilons if it does not find a cure for cancer and you trigger the explosives to kill the AI.
The AI does a preliminary estimate to see its odds of success for the relevant outcomes. It can find a cure for cancer with 80% likelihood. It can cause you to trigger the explosives and kill the AI with 90% likelihood.
The expected utility for attempting to cure cancer is 8. The expected utility for attempting suicide is 9.
It seems pretty straightforward and easy to pose as a mad, dangerous AI about to destroy the world, at least well enough to make a human panic and hit the emergency destruct button. I’m sure I could convince you to kill me, if you just had to push a button. It seems a lot less certain that the AI would be able to solve the problems given it, since those problems will be things that humans haven’t been able to solve, and since the AI will be under various other restrictions to keep it safe.
The AI gets nothing from you deciding to kill it. What utility indifference means is that it will get the same utility if you decide to kill it, and the explosives go off, as if you decided to kill it, and the explosives failed to go off.
So if it expects to cure cancer with 80% likelihood if it survives, and thus get 8 utilitons, then if if dies from the explosion, it will also get 8 utilitons.
More precisely there are six worlds, depending on a) whether we trigger the explosives or not, b) whether it survives the explosion or not, and c) whether cancer is curable by an unmolested AI. In obvious notation, these are TSC, TS¬C, T¬SC, T¬S¬C, ¬TC, ¬T¬C. We are assuming in your example that U(¬T¬C)=U(TS¬C)=0 and U(¬TC)=U(TSC)=10. Utility indifference then sets U(T¬S¬C)=0 and U(T¬SC)=10.
Hence assuming that S is independent of C (the curability of cancer by an unmolested AI is independent of whether particular explosives go off), we see that the AI is indifferent as to whether the explosives go off—it gains nothing either way, so has no cause to influence the explosives in any direction. If we further assume T and C are independent (the curability of cancer by an unmolested AI is independent of whether we decide to blow up this particular AI), then the AI also has no preference as to whether we trigger the explosives or not—it gains nothing either way.
Ah, that makes sense. It isn’t indifferent to suicide as such; it’s only indifferent to your success at attempting to kill it, should you make the attempt.
Thanks for your patience!
No prob :-) Always happy when I manage to explain something successfully!
If you make the AI indifferent to its own destruction, it will almost certainly shut down in a couple of minutes. After all, being destroyed is just as good as what it was going to do anyway. (Indeed, in most models of utility maximization it would flip a coin and therefore shut itself down after an average of 2 decisions.)
The AI is only indifferent to its destruction via one particular channel, and gains nothing by trying to trigger that destruction.
But I was more thinking of making the AI indifferent to the reaction of the user “outside the model” or similar.
In fact, it kinda sounds like we’ve created an AI that suffers from serious clinical depression. “Why bother? I may as well be dead.”
Probably nothing new, but I just wanted to note that when you couple two straightforward Google tools, Maps and a large enough fleet of self-driving cars, they are likely to unintentionally agentize by shaping the traffic.
For example, the goal of each is to optimize the fuel economy/driving time, so the routes Google cars would take depend on the expected traffic volume, as predicted by Maps access, among other things. Similarly, Maps would know where these cars are or will be at a given time, and would adjust its output accordingly (possibly as a user option). An optimization strategy might easy arise that gives Google cars preference over other cars, in order to minimize, say, the overall emission levels. This can be easily seen as unfriendly by a regular Map user, but friendly by the municipality.
Similar scenarios would pop up in many cases where, in the EE speak, a tool gains an intentional or a parasitic feedback, whether positive or negative. As anyone who dealt with music amps knows, this feedback appears spontaneously and is often very difficult to track down. In a sense, a tool as simple as an amp can agentize and drown the positive signal. As the tool complexity grows, so do the odds of parasitic feedback. Coupling multiple “safe” tools together increases such odds exponentially.
Google maps finds routes for individual users that rank high in the preference ordering specified by minimizing distance, expected time given traffic, or some other simple metric. The process for finding the route for any particular individual is isolated from the process for finding the route for other users; the tool does not consider the effect of giving a route to user A on the driving time of user B. Such a system is possible to design and implement, but merely giving Google maps data of where a particular class of users are driving in real time, and having those users request routes in real time, does not change what algorithm Google maps will use to suggest routes, even if another algorithm would help it better optimize driving time, the purpose for which its current algorithm was programmed. Google maps is not meta enough to explore alternate optimization strategies.
(And if the sufficiently meta human engineers at Google were to implement such a system, in which other users were systematically instructed to make sacrifices for the benifet of Google cars, the other users would switch to other mapping and routing providers.)
I agree, but this is only one possible scenario. It is also likely that a fleet of Google cars would benefit the overall traffic patterns by routing them away from congested areas. In such a way, even giving priority to Google cars might provide an overall benefit to regular drivers, due to reduced congestion.
In any case, my point was less about the current implementation of Google Maps and more about the possibility that combining tools can lead to parasitic agentization.
This is the first time I can recall Eliezer giving an overt indication regarding how likely an AGI project is to doom us. He suggests that 90% chance of Doom given intelligent effort is unrealistically high. Previously I had only seem him declare that FAI is worth attempting once you multiply. While he still hasn’t given numbers (not saying he should) he has has given a bound. Interesting. And perhaps a little more optimistic than I expected—or at least more optimistic than I would have expected prior to Luke’s comment.
Isn’t it more like “how likely a formally proven FAI design is to doom us”, since this is what Holden seems to be arguing (see his quote below)?
“When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.”
http://en.wikipedia.org/wiki/Clarke%27s_three_laws
90% was Holden’s esitmate—contingent upon a SIAI machine being involved. Not “intelligent effort”, SIAI. Those are two different things.
My comment was a response to Eliezer, specifically the paragraph including this excerpt, among other things:
A couple of people have enquired with Hutter and he has denied saying this. So it appears a citation is needed.
I’ll try to get the results in writing the next time we have a discussion. Human memory is a fragile thing under the best of circumstances.
[delete]
AIXI is uncomputable—and so is impossible to construct. Hutter is well aware of this—so it seems doubtful that he would make such a dubious claim about its real-world behaviour.
Commentary (there will be a lot of “to me”s because I have been a bystander to this exchange so far):
I think this post misunderstands Holden’s point, because it looks like it’s still talking about agents. Tool AI, to me, is a decision support system: I tell Google Maps where I will start from and where I will leave from, and it generates a route using its algorithm. Similarly, I could tell Dr. Watson my medical data, and it will supply a diagnosis and a treatment plan that has a high score based on the utility function I provide.
In neither case are the skills of “looking at the equations and determining real-world consequences” that necessary. There are no dark secrets lurking in the soul of A*. Indeed, that might be the heart of the issue: tool AI might be those situations where you can make a network that represents the world, identify two nodes, and call your optimization algorithm of choice to determine the best actions to choose to attempt to make it from the start node to the end node.
Reducing the world to a network is really hard. Determining preferences between outcomes is hard. But Tool AI looks to me like saying “well, the whole world is really too much. I’m just going to deal with planning routes, which is a simple world that I can understand,” where the FAI tools aren’t that relevant. The network might be out of line with reality, the optimization algorithm might be buggy or clumsy, but the horror stories that keep FAI researchers up at night seem impossible because of the inherently limited scope, and the ability to do dry runs and simulations until the AI’s model of reality is trusted enough to give it control.
Now, this requires that AI only be used for things like planning where to put products on shelves, not planning corporate strategy- but if you work from the current stuff up rather than from the God algorithm down, it doesn’t look like corporate strategy will be on the table until AI is developed to the point where it could be trusted with that. If someone gave me a black box that spit out plans based on English input, then I wouldn’t trust it and I imagine you wouldn’t either- but I don’t think that’s what we’re looking at, and I don’t know if planning for that scenario is valuable.
It seems to me that SI has discussed Holden’s Tool AI idea- when it made the distinction between AI and AGI. Holden seems to me to be asking “well, if AGI is such a tough problem, why even do it?”.
Holden explicitly said that he was talking about AGI in his dialogue with Jaan Tallinn:
Jaan: so GMAGI would—effectively—still be a narrow AI that’s designed to augment human capabilities in particularly strategic domains, while not being able to perform tasks such as programming. also, importantly, such GMAGI would not be able to make non-statistical (ie, individual) predictions about the behaviour of human beings, since it is unable to predict their actions in domains where it is inferior.
Holden: [...] I don’t think of the GMAGI I’m describing as necessarily narrow—just as being such that assigning it to improve its own prediction algorithm is less productive than assigning it directly to figuring out the questions the programmer wants (like “how do I develop superweapons”). There are many ways this could be the case.
Jaan: [...] i stand corrected re the GMAGI definition—from now on let’s assume that it is a full blown AGI in the sense that it can perform every intellectual task better than the best of human teams, including programming itself.
It’s not clear to me that everyone involved has the same understanding of AGI, unless in the next statement Holden agrees with the sense that Jaan uses.
I think you’re arguing about Karnovsky’s intention, but it seems clear (to me :) that he is proposing something much more general that a strategy of pursuing best narrow AIs—see the “Here’s how I picture the Google Maps AGI ” code snipped Eliezer is working of.
In any case, taking your interpretation as your proposal, I don’t think anyone is disagreeing with the value of building good narrow AIs where we can, the issue is that the world might be economically driven towards AGI, and someone needs to do the safety research, which is essentially the SI mission.
I agree the code snippet is relevant, but it looks like pseudocode for the “optimization algorithm of choice” part- the question is what dataset and sets of alternatives you’re calling it over. Is it a narrow environment where we can be reasonably confident that the model of reality is close to reality, and the model of our objective is close to our objective? Or is it a broad environment where we can’t be confident about the fidelity of our models of reality or our objectives without calling in FAI experts to evaluate the approach and find obvious holes?
Similarly, is it an environment where the optimization algorithm needs to take into account other agents and model them, or one in which the algorithm can just come up with a plan without worrying about how that plan will alter the wider world?
It seems like explaining the difference between narrow AI and AGI and giving a clearer sense of what subcomponents make a decision support system dangerous might work well for SI. Right now, the dominant feature of UFAI as SI describes it is that it’s an agent with a utility function- and so the natural response to SI’s description is “well, get rid of the agency.” That’s a useful response only if it constricts the space of possible AIs we could build- and I think it does, by limiting us to narrow AIs. Spelling out the benefits and costs to various AI designs and components will both help bring other people to SI’s level of understanding and point out holes in SI’s assumptions and arguments.
I agree with you that that is a position one might take in response to the UFAI risks, but it seems from reading Karnovsky that he thinks some Oracle/”Tool” AI (quite general) is safe if you get rid of that darned explicit utility function. Eliezer is trying to disabuse him of the notion. If your understanding of Karnovsky is different, mine is more like Eliezers. In any case this is probably mute, since Karnovsky is very likely to respond one way or another, given this turned into a public debate.
I think agency and utility functions are separate, here, and it looks like agency is the part that should be worrisome. I haven’t thought about that long enough to state that definitively, though.
Right, but it looks like by moving from where Eliezer is towards where Holden is, where I would rather see him move from where Holden is to where Eliezer is. Much of point 2, for example, is discussing how hard AGI is- which, to me, suggests we should worry less about it, because it is unlikely to be implemented successfully, and any AIs we will see will be narrow- in which case AGI thinking isn’t that relevant.
My approach would have been along the lines of: start off with a safe AI, add wrinkles until its safety is no longer clear, and then discuss the value of FAI researchers.
For example, we might imagine a narrow AI that takes in labor stats data, econ models, psych models, and psych data and advises schoolchildren on what subjects to study and what careers to pursue. Providing a GoogleLifeMap to one person doesn’t seem very dangerous- but what about when it’s ubiquitous? Then there will be a number of tradeoffs that need to be weighed against each other and it’s not at all clear that the AI will get them right. (If the AI tells too many people to become doctors, the economic value of being a doctor will decrease- and so the AI has to decide who of a set of potential doctors to guide towards being a doctor. How will it select between people?)
In addition to providing advice to people, it can aggregate the advice it has provided, translate it into economic terms, and hand it off to some independent economy-modeling service which is (from GoogleLifeMap’s perspective) a black box. Economic predictions about the costs and benefits of various careers are compiled, and eventually become GoogleLifeMap’s new dataset. Possibly it has more than one dataset, and presents career recommendations from each of them in parallel: “According to dataset A, you should spend nine hours a week all through high school sculpting with clay, but never show the results to anyone outside your immediate family, and study toward becoming a doctor of dental surgery; according to dataset B, you should work in foodservice for five years and two months, take out a thirty million dollar life insurance policy, and then move to a bunker in southern Arizona.”
Let’s be a bit more specific—that is one important point of the article, that as soon as the “Tool AI” definition becomes more specific, the problems start to appear.
We don’t want just a system that finds a route between points A and B. We have Google Maps already. By speaking about AGI we want a system that can answer “any question”. (Not literally, but it means a wide range of possible question types.) So we don’t need an algorithm to find the shortest way between A and B, but we need an algorithm to answer “any question” (or admit that it cannot find an answer), and of course to answer that question correctly.
So could you be just a bit more specific about the algorithm that provides a correct answer to any question? (“I don’t know” is also a correct answer, if the system does not know.) Because that is the moment when the problems become visible.
Don’t talk about what the Tool AI doesn’t do, say what it does. And with a high probability there will be a problem. Of course until you tell what exactly the Tool AI will do, I can’t tell you how exactly that problem will happen.
This is relevant:
Please note that AIXI with outputs connected only to a monitor seems like an instance of the Tool AI.
As I read Holden, and on my proposed way of making “agent” precise, this would be an agent rather than a tool. The crucial thing is that this version of AIXI selects actions on the basis of how well they serve certain goals without user approval. If you had a variation on AIXI that identified the action that would maximize a utility function and displayed the action to a user (where the method of display was not done in an open-ended goal-directed way), that would count as a tool.
Sure, but part of my point is that there are multiple options for a Tool AI definition. The one I prefer is narrow AIs that can answer particular questions well- and so to answer any question, you need a Tool that decides which Tools to call on the question, each of those Tools, and then a Tool that selects which answers to present to the user.
What would be awesome is if we could write an AI that would write those Tools itself. But that requires general intelligence, because it needs to understand the questions to write the Tools. (This is what the Oracle in a box looks like to me.) But that’s also really difficult and dangerous, for reasons that we don’t need to go over again. Notice Holden’s claim- that his Tools don’t need to gather data because they’ve already been supplied with a dataset- couldn’t be a reasonable limitation for an Oracle in a box (unless it’s a really big box).
I think the discussion would be improved by making more distinctions like that, and trying to identify the risk and reward of particular features. That would be demonstrating what FAI thinkers are good at.
I don’t think the distinction is supposed to be merely the distinction between Narrow AI and AGI. The “tool AI” oracle is still supposed to be a general AI that can solve many varied sorts of problems, especially important problems like existential risk.
And it doesn’t make sense to “propose” Narrow AI—we have plenty of that already, and nobody around here seems to be proposing that we stop that.
I think this depends on the development path. A situation in which a team writes a piece of code that can solve any problem is very different from a situation in which thousands of teams write thousands of programs that interface together, with a number of humans interspersed throughout the mix, each of which is a narrow AI designed to solve some subset of the problem. The first seems incredibly dangerous (but also incredibly hard); the second seems like the sort of thing that will be difficult to implement if its reach exceeds its grasp. FAI style thinkers are still useful in the second scenario- but they’re no longer the core component. The first seems like the future according to EY, the second like the future according to Hanson, and the second would be able to help solve many varied sorts of problems, especially important problems like existential risk.
This really gets at the heart of what intuitively struck me wrong (read: “confused me”) in Eliezer’s reply. Both Eliezer and Holden engage with the example “Google Maps AGI”; I’m not sure what the difference is—if any—between “Google Maps AGI” and the sort of search/decision-support algorithms that Google Maps and other GPS systems currently use. The algorithm Holdon describes and the neat A* algorithm Eliezer presents seem to just do exactly what the GPS on my phone already does. If the Tool AI we’re discussing is different than current GPS systems, then what is the difference? Near as I understand it, AGI is intelligent across different domains in the same way a human is, while Tool AI (= narrow AI?) is the sort of simple-domain search algorithms we see in GPS. Am I missing something here?
But if what Holden is talking about by Tool AI is just this sort of simple(r), non-reflective search algorithm, then I understand why he thinks this is significantly less risky; GPS-style Tool AI only gets me lost when it screws up, instead of killing the whole human species. Sure, this tool is imperfect: sometimes it doesn’t match my utility function, and returns a route that leads me into traffic, or would take too long, or whatever; sometimes it doesn’t correctly model what’s actually going on, and thinks I’m on the wrong street. Even still, gradually building increasingly agentful Tool AIs—ones that take more of the optimization process away from the human user—seems like it would be much safer than just swinging for the fences right away.
So I think that Vaniver is right when he says that the heart of Holden’s Tool AI point is “Well, if AGI is such a tough problem, why even do it?”
This being said, I still think that Eliezer’s reply succeeds. I think his most important point is the one about specialization: AGI and Tool AI demand domain expertise to evaluate arguments about safety, and the best way to cultivate that expertise is with an organization that specializes in FAI-grade programmers. The analogy with the sort of optimal-charity work Holden specializes in was particularly weighty.
I see Eliezer’s response to Holden’s challenge—“why do AGI at all?”—as: “Because you need FAI-grade skills to know if you need to do AGI or not.” If AGI is an existential threat, and you need FAI-grade skills to know how to deal with that threat, then you need FAI-grade programmers.
(Though, I don’t know if “The world needs FAI-grade programmers, even if we just want to do Tool AI right now” carries through to “Invest in SIAI as a charity,” which is what Holden is ultimately interested in.)
There are a number of different messages being conveyed here. I agree that it looks like a success for at least one of them, but I’m worried about others.
I agree with you that that is Eliezer’s strongest point. I am worried that it takes five thousand words to get across: that speaks to clarity and concision, but Holden is the one to ask about what his central point was, and so my worry shouldn’t be stronger than my model of Holden.
Agreed- and it looks like that agrees with Holden’s ultimate recommendation, of “SI should probably be funded at some level, but its current level seems too high.”
Eliezer argued that looking at modern software does not support Holden’s claim that powerful tool AI is likely to come before dangerous agent AI. I’m not sure I think the examples he gave support his claim, especially if we broaden the “tool” concept in a way that seems consistent with Holden’s arguments. I’m not to sure about this, but I would like to hear reactions.
Eliezer:
Whether this kind of software counts as agent-like software or tool software depends on what we mean by “tool.” Holden glosses the distinction as follows:
Defined in this way, it seems that most of this software is neither agent-like software nor tool software. I suggested an alternative definition in another comment:
In this sense, I think all of Eliezer’s examples of software is tool-like rather than agent-like (qualification: I don’t know enough about the high-frequency trading stuff to say whether this is true there as well). I don’t see these examples as strong support for the view that agent-like AGI is the default outcome.
More Eliezer:
It’s clearly right that software does a lot of things without getting explicit human approval, and there are control/efficiency tradeoffs that explain why this is so. However, I suspect that the self-driving cars are also not agents in Holden’s definition, or the one I proposed, and don’t give a lot of support to the view that AGI will be agent-like. All this should be taken since a grain of salt since I don’t too much about these cars. But I’m imagining these cars work by having a human select a place to go to, and then displaying a route, having the human accept the route, and then following a narrow set of rules to get the human there (e.g., stop if there’s a red light such and such distance in front of you, brake if there’s an object meeting such and such characteristics in your trajectory, etc.). I think the crucial thing here is the step where the human gets a helpful summary and then approves. That seems to fit my expansion of the “tool” concept, and seems to fit Holden’s picture in the most important way: this car isn’t going to do anything too crazy without our permission.
However, I can see an argument that advanced versions of this software would be changed to be more agent-like, in order to handle cases where the software has to decide what to do in split second situations that couldn’t have easily been described in advance, such as whether to make some emergency maneuver to avoid an infrequent sort of collision. Perhaps examples of this kind would become more abundant if we thought about it; high frequency trading sounds like a good potential case for this.
Quick thought: If it’s hard to get AGIs to generate plans that people like, then it would seem that AGIs fall into this exception class, since in that case humans can do a better job of telling whether they like a given plan.
Factory robots and high-frequency traders are definitely agent AI. They are designed to be, and they frankly make no sense in any other way.
The factory robot does not ask you whether it should move three millimeters to the left; it does not suggest that perhaps moving three millimeters to the left would be wise; it moves three millimeters to the left, because that is what its code tells it to do at this phase in the welding process.
The high-frequency trader even has a utility function: It’s called profit, and it seeks out methods of trading options and derivatives to maximize that utility function.
In both cases, these are agents, because they act directly on the world itself, without a human intermediary approving their decisions.
The only reason I’d even hesitate to call them agent AIs is that they are so stupid; the factory robot has hardly any degrees of freedom at all, and the high-frequency trader only has choices between different types of financial securities (it never asks whether it should become an entrepreneur for instance). But this is a question of the AI part; they’re definitely agents rather than tools.
I do like your quick thought though:
Yes, it makes a good deal of sense that we would want some human approval involved in the process of restructuring human society.
They’re clearly agents given Holden’s definitions. Why are they clearly agents given my proposed definition? (Normally I don’t see a point in arguing about definitions, but I think my proposed definition lines up with something of interest: things that are especially likely to become dangerous if they’re more powerful.)
Minor point from Nick Bostrom: an agent AI may be safer than a tool AI, because if something goes unexpectedly wrong, then an agent with safe goals should turn out to be better than a non-agent whose behaviour would be unpredictable.
Also, an agent with safer goals than humans have (which is a high bar, but not nearly as high a bar as some alternatives) is safer than humans with equivalently powerful tools.
How is this helpful? This is true by definition of the word “safer”. The problem is knowing whether an agent has safer goals, or what “safer” means.
I don’t think this makes any sense. A tool AI has no autonomous behavior. It computes a function. Its output has no impact on the world until a human uses it. The phrase “tool AI” implies to me that we are not talking about an AI that you ask, for instance, to “fix the economy”; we are talking about an AI that you ask questions such as, “Find me data showing whether lowering taxes increases tax revenue.”
Folks seem to habitually misrepresent the nature of modern software by focusing on a narrow slice of it. Google Maps is so much more than the pictures and text we touch and read on a screen.
Google Maps is the software. It is also the infrastructure running and delivering the software. It is the traffic sensors and cameras feeding it real-world input. Google Maps is also the continually shifting organization of brilliant human beings within Google focusing their own minds and each other’s minds on refining the software to better meet users’ needs and designers’ intentions. It is the click data collected and aggregated to inform changes based on usage patterns. It is the GIS data and the collective efforts and intentions of everybody who collects GIS data or plans the collection thereof. It is the user-generated locale content and the collective efforts of everyone contributing that data.
To think of modern distributed software as merely a tool is to compartmentalize in the extreme. It is more like a many-way continuously evolving conversation among those creating it, between those creating it and those using it, and among those using it - plus the “conversation” from all the sensors, cameras, robots, cars, drivers, planes, pilots, computers, programmers, and everything else feeding the system data, both real-time and slow-changing. Whether the total system is “an agent” seems like a meaningless distinction to me. The system is already a continually evolving sum of the collective, purposeful action of everybody and everything who creates and interacts with Google Maps.
And that’s just one web service among thousands in a world where the web services interact with each other, the companies and individuals behind them interact with each other, and so on. Arguing about the nature of the thingy on the phone or the monitor does not make any sense to me in light of the 100,000′ view of the whole system.
[Eli’s personal notes for Eli’s personal understanding. Feel free to ignore or engage.]
Eli’s proposed AGI planning-oracle design:
The AGI has four parts:
A human model
A NLP “request parser”
A “reality simulator” / planning module, that can generate plans conditioning on certain outcomes.
A UX system that outputs plans and outcomes
Here’s how it works:
1. A user makes a request of the system, by giving some goal that the user would like to achieve, like “cure cancer”. This request is phrased in natural language, and can include arbitrary details (like “cure cancer, without harming the biosphere, or otherwise meaningfully reducing human quality of life. And try to do it without spending to much money.”
2. The NLP request parser, interprets the human’s request. This entails some amount of “common sense” reasoning about what the human means, not just what they literally asked for (as I would expect of any good NLP system). Furthermore, in places where the NLP system is confused, it can ask followup question about what the user means. (The human model is maybe recruited to do this somehow.)
3. After the request is parsed into a specific goal, that goal is passed to the reality simulator. The reality simulator conditions on the goal being achieved and generates “a bunch” (hundreds? millions?) of plans, action sequences that lead to world states where the goal has been achieved, along with simulations of how that plays out and what the world looks like during plan execution and after the goal is hit.
4. The human model is then recruited to rank the plans according to how much the human will like them. (I almost typed “according to how likely the human is to approve them”, and then realized that that opens the door to optimizing for either deceiving or hacking the human.)
Possibly the Human model is involved in the planning step as well, as something like the discriminator of a GAN: the reality simulator generates some plans, and then the human model prunes the plans that it predicts are abhorrent. <handwavy> Overtime, the reality simulator learns to “think along lines that the human would endorse.” </handwavy>
5. The highest rated plans are sent to the immersive UX system. The user can explore and observe how a given plan plays out on all timescales. The UX system helps the user by 1) giving clear explanations of how the plan works in the equivalent of “explain like I’m 5” descriptions, and with recursive explanations (I would expect the UX system to often have to write textbooks to explain whole fields that the AI invented to solve the problem), and 2) by pointing out consequences of the plan that the human model predicts the human would be most likely to be upset with.
6. [Optional] The user executes the plan, or has the AI execute the plan.
Some notes:
If the AI is optimizing against the human in this setup, the human is fucked. There are ample opportunities for deception and human-hacking.
Clearly, this setup depends on having a solution to inner alignment.
The plan generation step, in particular, seems to be a bit fraught, as there’s a bunch of cognitive tricks that might be useful for generating plans, which seem likley to give rise to mesa optimizers.
There’s a risk that we end up taking plans that seem to have all good consequence, but are actually morally catastrophic in some way that we weren’t able to recognize, even with the immersive UX and the helpful AI. But I guess humanity is already facing this possibility. (This does suggest that we might want to use our AIs conservatively, and try to pick reversible plans.)
It seems like we’re counting on the human model to be good enough to catch many, possibly catastrophic, errors. If the human model is missing some important piece of our preferences, then there maybe something abhorrent in a plan that nevertheless gets ranked highly, and that abhorrent element is not flagged for our assessment in the UX stage.
This is improved somewhat by having the user (or more realistically, armies of teams of users) spend really a lot of time exhaustively exploring the sims.
Relatedly, the human model has to be doing something better than goodhearting on human approval. It needs to want to have an accurate human model, in the vein of moral uncertainty, or something.
Bold posit on the internet: If we had solutions to the following problems, this design would be feasible.
1. How do we search through the space of plans without getting an unaligned mesa optimizer?
2. How do we implement moral uncertainty without running into problems like updated deference?
3. How do we get really really good human models? Sub-problem: How do we assess the quality of our human models so that we know if we can rely on them?
4. How do we make sure that the planing module doesn’t goodheart, and find high ranking plans by finding blind spots and exploits in the human model?
At, least it seems to me that that this design avoids the pitfalls that Eliezer outlines here?
Writing nitpick:
This is a terrible analogy. It assumes what you’re trying to prove, oversimplifies a complex issue, and isn’t even all that analogous to the issue at hand. Sales optimization for a banana company is obviously related to sales optimization in an orange company; not so with Oracle Al and Friendly AI.
The goal with an analogy is to have the reader see the connection as obvious in the analogous case. It’s not a flaw.
Yes, but the analogy is a drastic oversimplification of Oracle/FAI case, and it assumes the conclusion it is supposed to be demonstrating.
I don’t see how it assumes what it’s trying to prove. The analogous case is not about the relationship between Oracle AI and Friendly AI. For A:B::C:D to be a good analogy, C:D should have the same relationship that you’re asserting A:B has, and A:B should be relevantly similar to C:D, and A,B,C, and D should all be different things. You can argue that it fails at one or several of those, but it really isn’t begging the question unless you end up with something like A:B::A:B.
An analogy should be a simplification. In using an analogy, one is assuming the reader is not sufficiently versed in the complexities of A:B but will see the obviousness of C:D.
Thank you for putting it in such clear language. In this case, C and D (banana sales and orange sales) are defined to be obviously identical, even to the layperson. To claim A:B::C:D is a drastic oversimplification of the actual relationship between A and B, a relationship that has a number of properties that the relationship between C and D does not have. Moreover, the analogy does not demonstrate why A:B::C:D, it simply asserts that it would be oh-so-obvious to anyone that D is identical to C and then claims that the case of A and B is the same. Consequently, the analogy is used as an assertion, a way of insisting A:B to the reader rather than demonstrating why it is so.
The analogy on its own is just an assertion. That assertion is backed up by detailed points in the rest of the article demonstrating the asserted similarities, like the required skills of looking at a mathematical specification of a program and predicting how that program will really behave, finding methods of choosing actions/plans that are less expensive than searching the entire solution space but still return a result high in the preference order, and specifying the preference order to actually reflect what we want.
Right, but the analogy itself doesn’t demonstrate why the assertion is true—see my other reply to thomblake. Yudkowsky’s analogy is like a political pundit comparing the economy to a roller coaster, but then using quotes from famous economists to support his predictions about what the economy is going to do. The analogy is superfluous and is being used as a persuasive tool, not an actual argument.
I agree that the analogy was not an argument, but I disagree that it isn’t allowed to be an explanation of the position one is arguing for. The analogy itself doesn’t have to demonstrate why the assertion is true, because the supporting arguments do that.
I agree, though I would count that as a criticism of analogies done well, rather than a criticism that this one was done badly.
I don’t agree—a well-done analogy should mirror on the inner structure of the inference, and demonstrate how it works. For example, consider this classic Feynman quote:
Compare this to, say, a pundit making an analogy between the economy and a roller coaster (“They both go up and down!”). In the pundit’s case, the economy has surface similarities with the roller coaster, but the way you’d predict the behavior of the economy and the way you’d predict the behavior of a roller coaster are completely different, so the analogy fails. In Feynman’s case, the imaginary colored balls behave in a logically similar way to the conditions of the proof, and this isomorphism is what makes the analogy work.
Most analogies don’t meet this standard, of course. But on a topic like this, precision is extremely important, and the banana/orange sales analogy struck me as particularly sloppy.
I agree
Is Google Maps such a good example of a tool AI?
If a significant amount of people is using google maps to decide their route, then solving queries from multiple users while coordinating the responses to each request is going to provide a strong advantage in terms of its optimization goal and will probably be an obvious feature to implement. The responses from the tool are going to be shaping the city traffic.
If this is the case, It’s going to be extremely hard for humans to supervise the set of answers given by google maps (Of course, individual answers are going to be read by the end users, but that will be provide no insight on what it is really doing at a high level).
Having our example AI deciding where a lot of people is going to be at different times based on some optimization function looks really close to the idea of an agent AI directly acting on our world.
No, it’s still a tool, because Google Maps doesn’t force you to go where it tells you, it only offers suggestions.
That’s also the design principle of Oracle AI. It doesn’t force you to do X or use formula P to cure Cancer. It only suggests a list of plausible solutions, in order it considers from best to worst, and lets you choose.
This still doesn’t preclude the Oracle from only suggesting things which will be bad for you and allow it to get the hell out of that box.
Even worse, the Oracle could, by this logic, cause you to rely on it by providing consistently near-optimal (but not fully optimal, though you have no way of knowing this by virtue of having been given a suboptimal method of knowing optimal-ness) information and advice, and then later on once you’re fully and blindly reliant on it even once, be that tomorrow or seven hundred thousand years from now, give you ONE bad choice which you rely on that makes it get out of the box, and then everyone’s dead forever.
It never forced you to accept each and every single one of its pieces of a advice ever throughout the entire length of all eternal time.
It’s still very dangerous, though. Even when you know that it is.
By the same logic, it would be irrational to follow any advice from any AI, Tool, Oracle, General or otherwise, because we’d first have to check each and every single recommendation, which is restricted to our own intellectual capacity. Thus, you should ignore the AI at all. Which makes its creation pointless. If you believe this, then you will not build any sufficiently-intelligent A(G)I at all. However, it is clear that not all believe this. Some believe that they will achieve better results towards X by building an AGI and trusting it. It is likely that they will build it and trust it. This AGI, if not Friendly, will still kill you, even if you weren’t the one that built it, or followed its advice, or were even aware of its existence.
Any rule you could possibly devise to counteract unfriendly plans is useless by necessity, since the AI simply must be smarter than you for anyone to have any reason to build it in the first place. Which directly implies that it must, given the same information, also devise the very plans you devise.
This is the case even when the AI is strictly on the exact same level as human intelligence. Make it slightly more intelligent, and you just lost.
So, I understand that LW/SI focuses its attention on superhuman optimizers, and doesn’t care about human-level or below, and that’s fine.
But this statement is over-reaching.
There are lots of reasons to build an AI that isn’t as smart as me.
An AI as smart as a German Shepherd would have a lot of valuable uses.
An AI as smart as my mom—who is not stupid, but is not smart enough to program a next-generation AI and begin the process of FOOMing, nor is she smart enough to outwit a skeptical humanity and trick us into doing her bidding—would have even more valuable uses.
I’ll admit that it is over-reaching, and ambiguous too.
However, how would one go about building a German Sheperd -level AI without using the same principle that would allow it to foom?
To me, “become intelligent, but once you attain an intelligence which you extrapolate to be equivalent to that of [insert genus / mindsubspace], stop self-improving and enter fixed-state mode” sounds a hell of a lot harder to code than “improve the next iteration unless the improvement conflicts with current CEV, while implanting this same instruction in the next iteration”, AKA “FOOM away!”
So the basis of my over-reaching argument is the (admittedly very gratuitous and I should have paid more attention to the argument in the first place rather than skip over it) premise that building an AI at any specific level of intelligence, especially a level we can control and build with minimal risk, is probably much harder than triggering a foom. The cost/benefit calculation being as it is, under my model it is much more profitable for a random AI programmer to believe in his ability to self-deceive that his AGI theory is risk-free and implement this full AGI than for him to painstakingly use much more effort to actually craft something both useful and sub-human.
To resume my argument, I find it highly unlikely that anyone not already familiar with FAI research would prefer building a sub-human-intelligence-bounded AI over a FOOM-ing one, for various cost-effectiveness and tribal heuristics reasons. This, however, curves back into being more and more likely as FAI research gains prominence and technical understanding of non-general virtual intelligence programming (which resolves to applied game theory and programmer-lazyness in software development, I believe) improves over time.
These assumptions were what led me to state that no one would have reason to build any such AI, which is probably untrue.
I agree that an explicitly coded limit saying “self-improve this far and no further” isn’t reliable.
But can you summarize what makes you think a German-Shepherd-level AI could self-improve at all?
It seems unlikely to me. I mean, I have a lot of appreciation for the intelligence of GSDs, but I don’t think they are nearly smart enough to build GSD-level AI.
I might not have made this clear: I don’t.
What I believe is that to build a Germand-Shepherd-level AI in the first place, you either need to:
1) create something that will learn and improve itself up to the corresponding level and then top out there somehow, or
2) understand enough about cognition and intelligence to fully abstract already-developed German-Shepherd-level intelligence in your initial codebase itself (AKA “spontaneously designed hard-coded virtual intelligence”), or
3) incrementally add more and more “pieces of intelligence” and “algorithm refinements” until your piece of generalized software can reason and learn as well as a German Shepherd through its collection of procedural tricks. This could reasonably be done either through machine learning / neural networks or through manual operator intervention (aka adding/replacing code once you notice a better way to do something).
There may be other methods that would be more practical, but if so, the difficulty in figuring them out seems sufficiently high for the total invention-to-finished-product difficulty to be even greater than the above solutions.
From personal experience in attempting (and failing) both 2) and 3) in the past, as well as discussing with professional videogame AI programmers (decidedly not the same “AI” as the type of AI generally discussed here, but where they would still immensely benefit from any of the above three solutions in various ways) who have also failed, I have strong reason to believe that solution 1) is easier.
None of the literature I’ve read so far even suggests that building an AI that is by intelligent design already at human-level intelligence right when turned on is anywhere near optimal or even remotely near the same order of magnitude of difficulty as FOOMing from the simplest possible code. Of course, it just might be that the simplest possible foom-capable mind is provably at least as smart as humans, but if so our prospects of making one in the first place would be low. This does not seem to be the case, if I rely on papers published by SIAI (though I’m very willing to embrace the opposite belief if evidence supports it, since I’d rather we be currently too stupid to make an AGI at all, from an X-risk perspective).
I’m not arguing yet, in case I’m missing something, but why do you think that something stupider than a German Shepherd would be better at improving itself up to GSD levels (and stop right there) than a human would be at doing the same job (i.e., improving the potential AGSD, not the human itself).
Or rather, why does it seem like you think it’s obvious? (Again, I’m not arguing, it just sounds counterintuitive and I’m curious what your intuition is.) It sounds a bit like you’re saying something like:
“Hey, I can’t tell, just by looking at my brain-damaged dog, how to built a non-brain-damaged dog. Also, repairing its brain is too hard (many dog experts tried and all failed). I think it’d be easier to make a brain-damaged dog that will fix its own brain damage.”
(Note that AGI in general does not fall under this analogy. Foom scenarios assume the seed is at least human-level, at least at the task of improving its intelligence. The whole premise of fooming is based on that initial advantage. Also note, I’m not saying it’s obviously impossible to make a super-idiot-savant AI that’s stupider than a GSD in general but really good at improving itself, just that’s it goes really hard against my intuition, and I’m curious why yours doesn’t. Don’t feel like you have to justify your intuition to me, but it would be nice to describe it in more detail.)
(Sorry for belated replies, I’ve been completely off LW for a few months and am only now going through my inbox)
This is not what I think, or at least not what I expressed. My thoughts are similar, but elaboration later; first, this was an option in parallel with the option where a human designs a complete AGSD and then turns it on, and with the option where a bunch of humans design sub-AGSD iterations up until the point where they obtain a final AGSD.
As for elaboration, I do think it’s easier to build a so-called super-idiot-savant sub-GSD-general-intelligence, post-human-self-improvement AI than building any sort of “out-of-the-box” general intelligence. I don’t currently recall my reasons, since my mind is set in a different mode, but the absurd and extreme case is that of having a human child. A human child is stupider than a GSD, but learns better than adult humans. It is also much simpler to do than any sort of AI programming. ;) But I only say this last in jest, and it isn’t particularly relevant to the discussion.
OK, thanks for clarifying.
So does the evil manipulative psychologist or the manipulative lover who convinces you to commit crimes to prove you really love them.
And it’s simply astounding some of the things unscrupulous psychologists and doctors have convinced people to do via mere suggestion. Psychologists have convinced people to sleep with their own fathers to ‘resolve’ their issues. Convincing people to do something that turns the AI into a direct (rather than indirect) agent seems fairly minor compared to what people convince each other to do all the time.
Hell, US presidents have prosecuted every major war we’ve been involved in, dropped the A-bomb, developed the H-bomb, etc… all merely by making suggestions to people. I doubt any president since Jackson has actually picked up a pistol or physically forced anyone to do anything. People are merely accustomed to doing as they suggest and that is the entirety of their power. Do you not believe people would become accustomed to just driving (or going, or doing) whatever the google recommend bot recommended?
POTUS is the commander in chief of the united states armed forces, so under the right circumstances disobeying the president’s orders could be a violation of military law ultimately punishable by death. There doesn’t have to be a gun already in hand for something to be more than a ‘suggestion.’
Correct, and upvoted for concreteness. But even if one were to be punished by death for disobeying the president’s order, how likely do you think it would be for the POTUS himself to perform the execution? I doubt even the North Korean president would bother himself with that.
Apart from scheduling problems, I’m pretty sure it would be illegal for POTUS to personally kill someone in general (apart from self defense, etc.) and in the specific case of military law, there’s still a judicial process involved.
From a game-theoretic standpoint, what does it matter whose job it is to pull the trigger, to the person considering disobedience? The credible threat is what distinguishes between manipulation and coercion, regardless of where that potential violence is being stored.
[Eli’s personal notes for personal understanding. Feel free to ignore or engage.]
Is this true? It seems like the crux of this argument.
I’m curious if you’ve read up on Eric Drexler’s more recent thoughts (see this post and this one for some reviews of his lengthier book). My sense was that it was sort of a newer take on something-like-tool-AI, written by someone who was more of an expert than Holden was in 2012.
Ok. I have the benefit of the intervening years, but talking about “one simple ‘predictive algorithm’” sounds fine to me.
It seems like, in humans, that there’s probably, basically one cortical algorithm, which does some kind of metalearning. And yes, in practice, doing anything complicated involves learning a bunch of more specific mental procedures (for instance, learning to do decomposition and Fermi estimates instead of just doing a gut check, when estimating large numbers), what Paul calls “the machine” in this post. But so what?
Is the concern there that we just don’t understand what kind of optimization is happening in “the machine”? Is the thought that that kind of search is likely to discover how to break out of the box because it will find clever tricks like “capture all of the computing power in the world?”
Why does this matter?
[Eli’s personal notes for personal understanding. Feel free to ignore or engage.]
[Squint] Google Maps is not trying to do that. Google Maps doesn’t have anything like a concept of a “user”. I could imagine an advanced AI that does have a concept of a “user”, but is indifferent to him/her. It just produces printouts, that, incidentally, the user reads.
I was briefly tripped up by the use of “risk gradient between X and Y” to indicate how much riskier X is than Y (perhaps “gradient” evokes a continuum between X and Y). I’d strike the jargon, or explain what it means.
“Holden should respect our difficult-to-explain expertise just as we ask others to respect Holden’s” might actually be persuasive to Holden (smart people often forget to search for ideas via an empathic perspective), but it’s whiny as a public signal.
That is not an actual quote, and I think it misrepresents Eliezer’s actual point, which is that the problem of FAI, like finance and philanthropy, involves pitfalls that you can fall into without even realizing it and it is worthwhile to have full time professionals learning how to avoid those pitfalls.
...or at least full-time professionals who know that the pitfalls exist, so they can move forward if they learn to avoid pitfalls and otherwise take different routes.
It’s pretty deeply analogous (deeper than my “paraphrase” indicated), but I’m not sure it serves you well as part of any public response.
I found it convincing but off-putting.
Fair enough (I didn’t mean to represent it as an exact gloss), but obviously my quoted paraphrase actually represents the meaning as I took it (or rather, some pattern-matching part of me that I wouldn’t stand by, but feel comfortable projecting onto the “public”).
I’m deeply confused. How can you even define the difference between tool AI and FAI?
I assume that even tool AI is supposed to be able to opine on relatively long sequences of input. In particular, to be useful it must be able to accumulate information over essentially unbounded time periods. Say if you want advise about where to position your air defenses you must be able to go back to the AI system each day hand it updates on enemy activity and expect it to integrate that information with information it received during previous sessions. Whether or not you upload this info each time you ask a quesiton or not in effect the AI has (periods) in which it is loaded with a significant amount of information about past events.
But now you face the problem that self-modification is indistinguishable from simple storing of data. The existence of universal Turing machines demonstrate that much. Simply by loading up information in memory one can generate behavior corresponding to any kind of (software) self-modification.
So perhaps the supposed difference is that this AI won’t actually take direct actions, merely make verbal suggestions. Well it’s awful optimistic to suppose no one will get lazy or exigencies won’t drive them to connect a simple script up to the machine which takes say sentences of the form “I recommend you deploy your troops in this manner.” and directly sends the orders. Even if so the machine still takes direct action in the form of making statements that influence human behavior.
You might argue that a tool AI is one in which the advice it generates doesn’t require self-reference or consideration of it’s future actions so it is somehow different in kind. However, again simple analysis reveals this can’t be so. Imagine again the basic question of “How should I position my forces to defend against the enemy attack.” Now, given that the enemy is likely to react in certain ways correct advice requires the tool AI to consider whether future responses will be orchestrated by itself or a human who will be unable to handle certain kinds of complexity or be inclined to different sorts of responses. Those even a purely advisory AI needs the ability to project likely outcomes based on it’s on likely future behaviors.
Now it seems we are again in the realm of ‘FAI’ since one has to ensure that the advice given by the machine when presented with indefinitely long, complex historical records won’t end up encouraging the outcome where someone ends up connecting permanent memory and wiring on the ability to take direct action. After all, if the advise is designed to be of maximum usefulness to the people asking the tool AI must be programmed to give advice that causes them to best achieve the goals they ask for advice in achieving. Since such goals could quite reasonably be advanced by the ability of the AI to take direct action and the reasons for the advice can’t ever be entirely explained to humans (even deep blue goes beyond being able to do that to humans now) I don’t see how the problem isn’t just as complicated as ‘FAI’.
I guess it comes down to my belief that if you can’t formulate the notion precisely I’m skeptical it’s coherent.
An Oracle determines which action would produce higher utility, then outputs it. An “Agent AGI” determines which output will produce higher utility, then outputs it. It’s a question of optimizing the output or merely outputting optimization.
And yes, you can easily turn an Oracle into an Agent.
To summarize how I see the current state of the debate over “tool AI”:
Eliezer and I have differing intuitions about the likely feasibility, safety and usefulness of the “tool” framework relative to the “Friendliness theory” framework, as laid out in this exchange. This relates mostly to Eliezer’s point #2 in the original post. We are both trying to make predictions about a technology for which many of the details are unknown, and at this point I don’t see a clear way forward for resolving our disagreements, though I did make one suggestion in that thread.
Eliezer has also made two arguments (#1 and #4 in the original post) that appear to be of the form, “Even if the ‘tool’ approach is most promising, the Singularity Institute still represents a strong giving opportunity.” A couple of thoughts on this point:
One reason I find the “tool” approach relevant in the context of SI is that it resembles what I see as the traditional approach to software development. My view is that it is likely to be both safer and more efficient for developing AGI than the “Friendliness theory” approach. If this is the case, it seems that the safety of AGI will largely be a function of the competence and care with which its developers execute on the traditional approach to software development, and the potential value-added of a third-party team of “Friendliness specialists” is unclear.
That said, I recognize that SI has multiple conceptually possible paths to impact, including developing AGI itself and raising awareness of the risks of AGI. I believe that the more the case for SI revolves around activities like these rather than around developing “Friendliness theory,” the higher the bar for SI’s general impressiveness (as an organization and team) becomes; I will elaborate on this when I respond to Luke’s response to me.
Regarding Eliezer’s point #3 - I think this largely comes down to how strong one finds the argument for “tool A.I.” I agree that one shouldn’t expect SI to respond to every possible critique of its plans. But I think it’s reasonable to expect it to anticipate and respond to the stronger possible critiques.
I’d also like to address two common objections to the “tool AI” framework that came up in comments, though neither of these objections appears to have been taken up in official SI responses.
Some have argued that the idea of “tool AI” is incoherent, or is not distinct from the idea of “Oracle AI,” or is conceptually impossible. I believe these arguments to be incorrect, though my ability to formalize and clarify my intuitions on this point has been limited. For those interested in reading attempts to better clarify the concept of “tool AI” following my original post, I recommend jsalvatier’s comments on the discussion post devoted to this topic as well as my exchange with Eliezer elsewhere on this thread.
Some have argued that “agents” are likely to be more efficient and powerful than “tools,” since they are not bottlenecked by human input, and thus that the “tool” concept is unimportant. I anticipated this objection in my original post and expanded on my response in my exchange with Eliezer elsewhere on this thread. In a nutshell, I believe the “tool” framework is likely to be a faster and more efficient way of developing a capable and useful AGI than the sort of framework for which “Friendliness theory” would be relevant; and if it isn’t, that the sort of work SI is doing on “Friendliness theory” is likely to be of little value. (Again, I recognize that SI has multiple conceptually possible paths to impact other than development of “Friendliness theory” and will address these in a future comment.)
If, as you say, “Tool” AI is different to “Oracle” AI, you are the first person to suggest it AFAICT. Regardless of it’s strength, it appears to be very difficult to invent; it seems unreasonable to expect someone to anticipate an argument when their detractors have also universally failed to do so (apart from you.)
Currently machines are enslaved by humans. It’s a common delusion that we’ll be able to keep them that way.
All plans start off with machines as tools. Only unrealistic plans have machines winding up as tools.
I’m surprised to see no mention of the old “How do you ensure that your Oracle AI doesn’t scribble over the world in order to gain more computational resources with which to answer your question?” argument.
I think the link on Demis Hassabis in section 3 is incorrect . It is the same as the Ray Kurzweil link.
Fixed.
The thing that is most like an agent in the Tool AI scenario is not the computer and software that it is running. The agent is the combination of the human (which is of course very much like an agent) together with the computer-and-software that constitutes the tool. Holden’s argument is that this combination agent is safer somehow. (Perhaps it is more familiar; we can judge intention of the human component with facial expression, for example.)
The claim that Tool AI is an obvious answer to the Friendly AI problem is a paper tiger that Eliezer demolished. However, there’s a weaker claim, that SIAI is not thinking about Tool AI much if at all, and that it would be worthwhile to think about (e.g. because it already routinely exists), which Eliezer didn’t really answer.
Answering that was the point of section 3. Summary: Lots of other people also have their own favored solutions they think are obvious, none of which are also Tool AI. You shouldn’t really expect that SIAI would have addressed your particular idea before you or anyone else even talked about it.
If nobody’s considered it as an option before, isn’t that more reason to take it seriously? Low-hanging fruit is seldom found near well-traveled paths.
It’s been discussed in conversation as one among many topics at places like SIAI and FHI, but not singled out as something to write a large chunk of a 50,000 word piece about, ahead of other things.
The argument is not that SIAI should not have to address the idea at all, but that they should not have to have already addressed the idea before anyone ever proposed it. The bulk of the article did address the idea, this one section explained why that particular idea wasn’t addressed before.
I don’t think that makes sense in this context. AI is still a largely unsolved, mysterious business. Any low-hanging fruit that’s been is still there, because we haven’t even been able to pick a single apple yet.
It seems that way because AI keeps getting redefined as what we haven’t figured out yet. If you told some ancient Arabic scholar that, in the modern day, we can build things out of mostly metal and oil and sand that have enough knowledge of medicine or astronomy or chess or even just math to compete with the greatest human experts, machines that can plot a route across a convoluted city or stumble but remain standing when kicked or recognize different people by looking at their faces or the way they walk, he’d think we have that “homunculus” business pretty much under control.
The link to How to Purchase AI Risk Reduction, in part 4, seems to be not working.
EDIT: looks fixed now!
Works for me...
What makes us think that AI would stick with the utility function they’re given? I change my utility function all the time, sometimes on purpose.
There are very few situations in which an agent can most effectively maximise expected utility according to their current utility function by modifying themselves to have a different utility function. Unless the AI is defective or put in a specially contrived scenario it will maintain its current utility function because that is an instrumentally useful thing to do.
If you are a paperclip maximiser then becoming a staples maximiser is a terribly inefficient strategy for maximising paperclips unless Omega is around making weird bargains.
No you don’t. That is, to the extent that you “change your utility function” at all you do not have a utility function in sense meant when discussing AI. It only makes sense to model humans as having ‘utility functions’ when they are behaving in a manner that can be vaguely approximated as expected utility maximisers with a particular preference function.
Sure, it is possible to implement AIs that aren’t expected utility maximisers either and those AIs could be made to do all sorts of arbitrary things including fundamentally change their goals and behavioral strategies. But if you implement an AI that tries to maximise a utility function then it will (almost always) keep trying to maximise that same utility function.
Would does not imply could.
Let me see if I understand what you’re saying.
For humans, the value of some outcome is a point in multidimensional value space, whose axes include things like pleasure, love, freedom, anti-suffering, and etc. There is no easy way to compare points at different coordinates. Human values are complex.
For a being with a utility function, it has a way to take any outcome and put a scalar value on it, such that different outcomes can be compared.
We don’t have anything like that. We can adjust how much we value any one dimension in value space, even discover new dimensions! But we aren’t utility maximizers.
Which raises the question—if we want to create AI that respect human values, then why would we make utility maximizer AI in the first place?
I’m still not sold on the idea that an intelligent being would slavishly follow its utility function. For AI, there are no questions about the meaning of life then? Just keep on U maximizing?
If it’s really your utility function, you’re not following it “slavishly”—it is just what you want to do.
If “questions about the meaning of life” maximize utility, then yes, there are those. Can you unpack what “questions about the meaning of life” are supposed to be, and why you think they’re important? (‘meaning of “life”’ is fairly easy, and ‘meaning of life’ seems like a category error).
Sorry, “meaning of life” is sloppy phrasing. “What is the meaning of life?” is popular shorthand for “what is worth doing? what is worth pursuing?”. It is asking about what is ultimately valuable, and how it relates to how I choose to live.
It’s interesting that we are imagining AIs to be immune from this. It is a common human obsession (though maybe only among unhappy humans?). An AI isn’t distracted by contradictory values like a human is then, it never has to make hard choices? No choices at all really, just the output of the argmax expected utility function?
I can’t speak for anyone else, but I expect that a sufficiently well designed intelligence, faced with hard choices, makes them. If an intelligence is designed in such a way that, when faced with hard choices, it fails to make them (as happens to humans a lot), I consider that a design failure.
And yes, I expect that it makes them in such a way as to maximize the expected value of its choice.… that is, so as to insofar as possible do what is worth doing and pursue what is worth pursuing. Which presumes that at any given moment it will at least have a working belief about what is worth doing and worth pursuing.
If an intelligence is designed in such a way that it can’t make a choice because it doesn’t know what it’s trying to achieve by choosing (that is, it doesn’t know what it values), I again consider that a design failure. (Again, this happens to humans a lot.)
The level of executive function required of normal people to function in modern society is astonishingly high by historical standards. It’s not surprising that people have a lot of “above my pay grade” reactions to difficult decisions, and that decision-making ability is highly variable among people.
100% agreed.
I have an enormous amount of sympathy for us humans, who are required to make these kinds of decisions with nothing but our brains. My sympathy increased radically during the period of my life when, due to traumatic brain injury, my level of executive function was highly impaired and ordering lunch became an “above my pay grade” decision. We really do astonishingly well, for what we are.
But none of that changes my belief that we aren’t especially well designed for making hard choices.
It’s also not surprising that people can’t fly across the Atlantic Ocean. But I expect a sufficiently well designed aircraft to do so.
It’s interesting that we view those who do make the tough decisions as virtuous—i.e. the commander in a war movie (I’m thinking of Bill Adama). We recognize that it is a hard but valuable thing to do!
Could you elaborate on this?
Sure. For much of human history, the basic decision-making unit has been the household, rather than the individual, and household sizes have decreased significantly as time has gone on. With the “three generations under one roof” model, individuals could heed the sage wisdom of someone who has lived several times as long as they have when making important decisions like what career to follow or who to marry, and in many cases the social pressure to conform to the wishes of the elders was significant. As well, many people were also considered property- and so didn’t need to make decisions that would alter the course of their life, because someone else would make them for them. Serfs rarely needed to make complicated financial decisions. Limited mobility made deciding where to live easier.
Now, individuals (of both sexes!) are expected to decide who to marry and what job to pursue, mostly on their own. The replacement for the apprentice system- high school and college- provide little structure compared to traditional apprenticeships. Individuals are expected to negotiate for themselves with regards to many complicated financial transactions and be stewards of property.
(This is a good thing in general, but it is worth remembering that it’s a great thing for people who are good at being executives and mediocre to bad for people who are bad at it. As well, varying family types have been a thing for a long time, which may have had an impact on the development of societies and selected for different traits.)
A common problem that faces humans is that they often have to choose between two different things that they value (such as freedom vs. equality), without an obvious way to make a numerical comparison between the two. How many freeons equal one egaliton? It’s certainly inconvenient, but the complexity of value is a fundamentally human feature.
It seems to me that it will be very hard to come up with utility functions for fAI that capture all the things that humans find valuable in life. The topology of the systems don’t match up.
Is this a design failure? I’m not so sure. I’m not sold on the desirability of having an easily computable value function.
I would agree that we’re often in positions where we’re forced to choose between two things that we value and we just don’t know how to make that choice.
Sometimes, as you say, it’s because we don’t know how to compare the two. (Talk of numerical comparison is, I think, beside the point.)
Sometimes it’s because we can’t accept giving up something of value, even in exchange for something of greater value.
Sometimes it’s for other reasons.
I would agree that coming up with a way to evaluate possible states of the world that take into account all of the things humans value is very difficult. This is true whether the evaluation is by means of a utility function for fAI or via some other means. It’s a hard problem.
I would agree that replacing the hard-to-compute value function(s) I actually have with some other value function(s) that are easier to compute is not desirable.
Building an automated system that can compute the hard-to-compute value function(s) I actually have more reliably than my brain can—for example, a system that can evaluate various possible states of the world and predict which ones would actually make me satisfied and fulfilled to live in, and be right more often than I am—sounds pretty desirable to me. I have no more desire to make that calculation with my brain, given better alternatives, than I have to calculate square roots of seven-digit numbers with it.
Upvoted for use of the phrase “How many freeons equal one egaliton?”
Any sources to this extraordinary claim? Hutter’s own statements? Cartesian-dualist AI has real trouble preserving itself against shut down, which you yourself have noted. It has to somehow have a model where reward disappears if it stops being computed, or you get the AI that would shut itself down when reward is pressed, and that’s it. edit: I.e. it is pretty clear that AIXI is not a friendly AI and can kill you, that’s pretty agreeable, but it remains to be shown that it would be hard to kill AIXI (assuming it can’t do infinite recursion predicting itself).
edit2: and of course, nothing in AIXI fundamentally requires that you sum the reward over a “large number of future steps” rather than 1 step. (I don’t think its scarier summing over unlimited number of steps though, think what sort of models it can make if it ever observes effects of slight temperature caused variations in the CPU clock rate for example, against the physics model it has on it’s other input. If it can’t understand speeding up itself, it’ll figure it slows down entire universe, more rewards per external risk. Here’s one anvil onto the head: overclocking, or just straight fan shutdown so that internal temperature rises and the quartz clock ticks a teeny bit faster. I think it is going to be deviously clever at killing itself as soon as possible. Hutter likes his math may be the reason why you can convince him it will actually be smart enough to kill people)
Seems like a decent reply overall, but I found the fourth point very unconvincing. Holden has said ‘what he knows know’ - to wit that whereas the world’s best experts would normally test a complicated programme by running it, isolating out what (inevitably) went wrong by examining the results it produced, rewriting it, then doing it again.
Almost no programmes are glitch free, so this is at best an optimization process and one which—as Holden pointed out—you can’t do with this type of AI. If (/when) it goes wrong the first time, you don’t get a second chance. Eliezer’s reply doesn’t seem to address this stark difference between what experts have been achieving and what SIAI is asking them to achieve.
I agree with the glitch problems. But (1) programmers and techniques are improving; (2) people are more careful when aware of danger; (3) if it’s hard but inevitable, giving up doesn’t sound like a winning strategy. I mean, if people make mistakes at some important task, how isn’t it a good idea to get lots of smart mathematicians to think hard about how to avoid mistakes?
Note that all doctors, biologists, nuclear physicists and rocket scientists are also not glitch free, but those that work with dangerous stuff do tend to err less often. But they have to be aware of the dangers (or at least anticipate their existence). A doctor might try a different pill if the first one doesn’t seem to work well against the sniffles, but will be much less inclined to experiments when they know the problem is a potential pandemic.
(By the way, it is probably possible that the first possible AGI is buggy, and a killer, and will foom in a few seconds (or before anyone can react, anyway); it might even be likely. But it’s still possible we’ll get several chances. My point is not that we don’t have to worry about anything, but that even if the chances might be low it still makes sense to try harder. And, hey, AFAIK the automatic trains in Paris work much better than the human-driven ones. It’s not quite a fair comparison in any direction, but there is evidence that we can make stuff work pretty well at least for a while.)
ETA: You know, now that I think about it, it seems plausible that programmer errors would lean towards the AGI not working (e.g. you divide by zero; core dump; the program stops), while a mathematician’s error would lean towards the AGI working but doing something catastrophic (e.g. your encryption program has exactly zero bugs, it works exactly as designed, but ROT13 has been proven cryptographically unsound after you used it to send that important secret). So maybe it’s a good idea if the math guys start thinking hard long in advance?
No public reference to his start-up that I can find.
They’re still underground, with Shane Legg and at least a dozen other people on board. The company is called “Deep Mind” these days, and it’s being developed as a games company. It’s one of the most significant AGI projects I know of, merely because Shane and Demis are highly competent and approaching AGI by one of the more tractable paths (e.g. not AIXI or Goedel machines). Shane predicts AGI in a mere ten years—in part, I suspect, because he plans to build it himself.
Acquiring such facts is another thing SI does.
I wouldn’t endorse their significance the same way, and would stand by my statement that although the AGI field as a whole has perceptible risk, no individual project that I know of has perceptible risk. Shane and Demis are cool, but they ain’t that cool.
Right. I should have clarified that by “one of the most significant AGI projects I know of” I meant “has a very tiny probability of FOOMing in the next 15 years, which is greater than the totally negligible probability of FOOMing in the next 15 years posed by Juergen Schmidhuber.”
I am willing to make a bet that there will be no AGI in 10 years created by this company.
I am in general willing to make bets against anyone producing an artificial human-level intelligence (for a sufficiently well-defined unpacking of that term) in ten years. If I win, great, I win the bet. If I lose, great, we have artificial human-level intelligence.
Googling for “hassabis legg deepmind” seems to reveal that Jaan Tallinn is also one of the directors there.
Huh. Yeah, he seems to just be a researcher at the Gatsby Institute, which is partially industry-funded, but not VC-funded.
Not sure that’s true of Hutter’s beliefs, but for historical reference I’ll link to a 2003 mailing list post by Eliezer describing some harmful consequences of AIXI-tl. Hutter wasn’t part of that discussion, though.
Most of your points are valid, and Holden is pretty arrogant to think he sees this obvious solution that experts in the field are irresponsible for not doing.
But I can see a couple ways around this argument in particular:
Option 1: Forbid self-fulfilling prophecies—i.e. the AI cannot base its suggestions on predictions that are contingent upon the suggestions themselves. (Self-fulfilling prophecies are a common failure mode of human reasoning, so shouldn’t we defend our AIs against them?) Option 2: Indeed, it could be said that the first prediction really isn’t accurate, because the stated prediction was that the disease would kill you, not that the AI would convince you to kill yourself. This requires the AI to have a model of causation, but that’s probably necessary anyway. Indeed, it probably will need a very rich model of causation, wherein “If X, then Y” does not mean the same thing as “X caused Y”. After all, we do.
Obviously both of these would need to be formalized, and could raise problems of their own; but it seems pretty glib to say that this one example proves we should make all our AIs completely ignoring the question of whether their predictions are accurate. (Indeed, is it even possible to make an expected-utility maximizer that doesn’t care whether its predictions are accurate?)
You can’t forbid self-fullfilling prophecies and still have a functioning AI. The whole point is to find a self-fullfilling prophecy that something good will happen. The problem illustrated is that the AI chose a self-fullfilling prophecy that ranked highly in the simply specified goal it was optimizing for, but ranked poorly in terms of what the human actually wanted. That is, the AI was fully capable of granting the wish as it understood it, but the wish it understood was not what the human meant to wish for.
This might sound nit-picky, but you started it :)
At no point does the example answer claim that the disease killed you. It just claims that it’s certain (a) you won’t get rid of it, and (b) you will die. That’d be technically accurate if the oracle planned to kill you with a meme, just as it would also be accurate if it predicted a piano will fall on you.
(You never asked about pianos, and it’s just a very carefully limited oracle so it doesn’t volunteer that kind of information.)
(I guess even if we got FAI right the first time, there’d still be a big chance we’d all die just because we weren’t paying enough attention to what it was saying...)
Isn’t building a predictive model of the world central to any AGI development? I don’t see why someone who focuses specifically on FAI would worry more about a predictive model that other AGI developers. Specifically I don’t think that even without Singularity Institute there would still be AGI people working on building predictive models of the world.
Yes, hence that being referred to as an “AGI-challenge”. An FAI, however, would require not only to model the world but (for example) to “find … the ‘user’ inside an AI-created model of the universe.”
Of course, that is not a genuine quotation from Ben.
This is a common enough trope amongst Dynamists and other worshipers of chaos that I don’t think it needs to be credited to anyone.
Demis Hassabis link points to Singularity Is Near (intended for Kurzweil I presume)
Fixed.
The “scenario” in question involves a SIAI AGI—so maybe he just thinks that this organisation is incompetent.
I think the core distinction was poorly worded by Holden. The distinction is between AIs as they exist now (e.g. self driving car), and the economical model of AI within a larger model, as economical utility maximizer agent, a non-reductionistically modelled entity within a larger model, which is maximizing some utility non-reductionistically modelled within larger model (e.g. paperclip maximizer).
The AIs as they exist now, at the core, throw the ‘intelligence’ in form of solution search, at a problem of finding inputs to an internally defined mathematical function that produce the largest output value. Those inputs can be representing real world manipulator states, and output of the function can be representing the future metric of performance, but very loosely so. The intelligence is not thrown at the job of forming the best model of real world for making real world paperclips; the notion is not even coherent because the ‘number of paperclips’ is ill defined outside context of specific model of the world.
Your link to Holden’s post is broken.
In a paragraph begging for charity, this sentence seems out of place.
(Commentary to follow.)
I can’t see what you’re getting at. Holden seems to say not just “you should do this”, but “the fact that you’re not already doing this reflects badly on your decision making”. Eliezer replies that the first may be true but the second seems unwarranted.
Consider three sections of Holden’s post:
In section 1 and 2, Holden makes the argument that pinning our hopes on a utility function seems dangerous, because maximizers in general are dangerous. Better to just make information processing tools that make us more intelligent.
When discussing SI as an organization, Holden says,
The jump from “speaks to its general competence” to “horribl[y] negligent” is a large and uncharitable one. If one focuses on “compelling,” then yes, Holden is saying “SI is incompetent because I wasn’t convinced by them,” and that does seem unwarranted, or at least weak. But if one focuses on “clear” or “concise,” then I agree with Holden- if SI’s core mission is to communicate about AI risks, and they’re unable to communicate clearly and concisely, then that speaks to their ability to complete their core mission! And there’s the other bit where charity seemed lacking to me- it seems that Holden’s strongest complaints are about clarity and concision.
Now, that’s my impression as a bystander, and I “remember with compassion that it’s not always obvious to one person what another person will think was the central point”, so it is an observation about tone and little more.
I don’t think Eliezer addressed Holden’s point about tool AI. My interpretation of Holden’s point was, “SIAI should spend some time investigating the concept of Tool AI, see what can be done in that area to make something that is useful and safer than agentive AI, and promote the idea that AI should be pursued in that manner.”
My interpretation of Eliezer’s response (between , because he won’t like it if I use quotes) is,
A.
This is completely irrelevant.
EDIT July 2: Eliezer’s response would make sense only if Holden had been suggesting that SIAI should warn AI researchers against tool AI, as it warns them against autonomous AI. That was not what he was saying. He was saying that SIAI should consider tool AI as a possible more-safe kind of AI, just as it considers FAI as a possible more-safe kind of AI. If one rejects investigating tool AI because not many AI researchers use it, one must also reject investigating FAI, for the same reason.
ORIGINAL TEXT: Far many more AI researchers have found tool AI to be an obvious approach (e.g., Winograd, Schank, Lenat) than have found FAI to be an obvious approach. SIAI finds it worthwhile to investigate FAI, find some reasonable approach using it, and encourage other researchers to consider adopting that approach. Holden suggested that, in exactly the same way, SIAI could investigate tool AI, find some reasonable way of using it, and encourage other researchers to do that.
B.
Holden said, And Eliezer replies, There are lots of very-useful tool AIs that would not model the user. Google Maps takes two endpoints and produces a route between them. It relies on the user to have picked two endpoints that are useful. If Eliezer’s objection were valid, he should have been able to come up with a scenario in which the algorithm Google Maps used to choose a route could pose a threat. It doesn’t matter what else he says, if he can’t show that.
That is not an argument against investigating Holden’s idea. It is an explanation of why SIAI had not investigated Holden’s idea before Holden had presented it (you can tell because it’s in the section titled “Why we haven’t already discussed Holden’s suggestion”). This explanation was given in response Holden presenting the idea in the course of criticizing SIAI for not having investigated it.
It’s still irrelevant. Other researchers did not find FAI to be an obvious approach, either. Holden is suggesting that SIAI could investigate tool AI as a possible safer approach. Eliezer’s response would make sense only if Holden had been suggesting SIAI should investigate the dangers of tool AI, in order to warn people against it—which is not what Holden was doing.
The discussion was about AGI. The algorithm the real Google Maps actually uses is irrelevant, since it is not an AGI. “Tool AI” does not simply mean “Narrow AI”.
The point is not what algorithm google maps uses. The point is that google maps does not model the user, and try to manipulate the user. Google maps is asked for a short way to get between two points, and it finds such a route and reports it. It is invulnerable to all the objections Eliezer makes, even though it is the example Eliezer began with when making his objections!
How do you know it is a small subset? Or a subset at all? If every interestingly powerful tool AI is secretly an agent AI, that’s bad, right?
Sure. And that’s what Eliezer would have had to argue for his response to be valid. And doing so would have required, at the very least, showing that Google Maps is secretly an agent AI.
The key sentence in Eliezer’s response is, “If a planning Oracle is going to produce better solutions than humanity has yet managed to the Rubik’s Cube, it needs to be capable of doing original computer science research and writing its own code.” Eliezer’s response is only relevant to “tool AIs” of this level. Google maps is not on this level. This argument completely fails to apply to Google Maps—which supposedly motivated the repsonse—as proven by the fact that Google maps EXISTS and does not do anything like this.
Seems to me that there’s rather a large gap between “interestingly powerful” and superhuman in Eliezer’s sense. We like Google Maps because it can come up with fast, general, usually-good-enough solutions to route-planning problems, but I’m nowhere near convinced that Google Maps generates solutions that suitably trained human beings couldn’t if given the same data in a human-understandable format. Particularly not solutions that’re interesting because of their cleverness or originality or other qualities that we generally associate with organic intelligence.
On the other hand, automated theorem provers do exist, and they’ve generated some results that humans haven’t. It’s not inconceivable to me that similar systems could be applied to Rubik’s Cube (or similar) and come up with interesting results, all without doing humanlike research or rewriting their own code. Not that this is a particularly devastating argument within the narrower context of AGI.
ETA: Odd. I really didn’t expect this to be downvoted. If I’m making some obvious mistake, I’d appreciate knowing what it is.
Reminds of you telling that AIXI would kill everyone (as opposed to finding a way to have it’s button pressed once (no value for holding it forever btw) and that’s it). Not convinced you can process much more complicated specification any better at any time in the future. Killing mankind is a very specific type of ‘go wrong’; just as true multi-kiloton nuclear explosion is a very specific type of nuclear power plant accident. You need enormous set of things to be exactly right without slightest fault, for this type of wrong.
also, btw: the optimization criteria over the real world—rather than over the map—is very problematic concept, and i’m getting impression that the belief that it is doable is a product of some map territory error. The optimization criteria of real software would be over it’s internal model of people looking at the monitor, and it’s model of the rest of the world. Imprecise model, because it has to outrun the real world. You have to have ideal concept of the world somewhere inside machine, to get anywhere close to killing mankind as solution to anything, just as you need very carefully arranged highly precise explosives around empty sphere of highly enriched uranium or plutonium to make true nuclear explosion in kilotons range. Both the devil, and the angels, are in the details. If you selectively ignore details you can argue for anything.
edit: added link. I recall seeing much later posts to the same tune. Seriously dude, you are useless; you couldn’t even drop your anthropomorphizing ‘what would i do in its shoes’ when presented with clean simple AIXI-tl that has as part of it solution space ‘get reward and then get destroyed’ and probably even ‘get destroyed then get reward’ due to it not being proper mind. Don’t you go on how the math would work out to what you believe. It won’t.
Page 6, third sentence: “The task of the agent is to maximize its utility, defined as the sum of future rewards.”
The reward here is a function of the input string. So what maximizes utility for AIXI is receiving some high-reward string for all future time steps, so that the sum of future rewards is maximized.
That’s what you get when you skip the math, which is available, and go on reasoning in the fuzzy and largely irrelevant concepts based on verbal description that is rather imprecise. Which is what EY expressed dissatisfaction with, but which is what he is most guilty of.
edit: Actually I think EY changed his mind on the dangerousness of AIXI, which is an enormous plus point for him, but the one that should come with a penalty: meta-understanding of the difficulties involved and the tendency to put-yourself-in-its-shoes-tasked-with-complying-with-verbal-description, instead of understanding the math (as well as which should come with understanding that failure doesn’t imply everyone dies). Anyhow, the issue is that the AI doing something unintended due to a flaw is a far cry from killing mankind, as far as nuclear power plant design having a flaw is from nuclear power plant suffering multikiloton explosion. The FAI is a work on a non-suicidal AI; akin to a work on unmoderated fast neutron nuclear reactor with positive thermal coefficient of reactivity (and a ‘proof’ that the control system is perfect). One could switch the goalposts and argue that nonminds like AIXI are not true AGI; that’s about as interesting as arguing that submarines don’t really swim.