Tiling Agents for Self-Modifying AI (OPFAI #2)
An early draft of publication #2 in the Open Problems in Friendly AI series is now available: Tiling Agents for Self-Modifying AI, and the Lobian Obstacle. ~20,000 words, aimed at mathematicians or the highly mathematically literate. The research reported on was conducted by Yudkowsky and Herreshoff, substantially refined at the November 2012 MIRI Workshop with Mihaly Barasz and Paul Christiano, and refined further at the April 2013 MIRI Workshop.
Abstract:
We model self-modication in AI by introducing ‘tiling’ agents whose decision systems will approve the construction of highly similar agents, creating a repeating pattern (including similarity of the offspring’s goals). Constructing a formalism in the most straightforward way produces a Godelian difficulty, the Lobian obstacle. By technical methods we demonstrate the possibility of avoiding this obstacle, but the underlying puzzles of rational coherence are thus only partially addressed. We extend the formalism to partially unknown deterministic environments, and show a very crude extension to probabilistic environments and expected utility; but the problem of finding a fundamental decision criterion for self-modifying probabilistic agents remains open.
Commenting here is the preferred venue for discussion of the paper. This is an early draft and has not been reviewed, so it may contain mathematical errors, and reporting of these will be much appreciated.
The overall agenda of the paper is introduce the conceptual notion of a self-reproducing decision pattern which includes reproduction of the goal or utility function, by exposing a particular possible problem with a tiling logical decision pattern and coming up with some partial technical solutions. This then makes it conceptually much clearer to point out the even deeper problems with “We can’t yet describe a probabilistic way to do this because of non-monotonicity” and “We don’t have a good bounded way to do this because maximization is impossible, satisficing is too weak and Schmidhuber’s swapping criterion is underspecified.” The paper uses first-order logic (FOL) because FOL has a lot of useful standard machinery for reflection which we can then invoke; in real life, FOL is of course a poor representational fit to most real-world environments outside a human-constructed computer chip with thermodynamically expensive crisp variable states.
As further background, the idea that something-like-proof might be relevant to Friendly AI is not about achieving some chimera of absolute safety-feeling, but rather about the idea that the total probability of catastrophic failure should not have a significant conditionally independent component on each self-modification, and that self-modification will (at least in initial stages) take place within the highly deterministic environment of a computer chip. This means that statistical testing methods (e.g. an evolutionary algorithm’s evaluation of average fitness on a set of test problems) are not suitable for self-modifications which can potentially induce catastrophic failure (e.g. of parts of code that can affect the representation or interpretation of the goals). Mathematical proofs have the property that they are as strong as their axioms and have no significant conditionally independent per-step failure probability if their axioms are semantically true, which suggests that something like mathematical reasoning may be appropriate for certain particular types of self-modification during some developmental stages.
Thus the content of the paper is very far off from how a realistic AI would work, but conversely, if you can’t even answer the kinds of simple problems posed within the paper (both those we partially solve and those we only pose) then you must be very far off from being able to build a stable self-modifying AI. Being able to say how to build a theoretical device that would play perfect chess given infinite computing power, is very far off from the ability to build Deep Blue. However, if you can’t even say how to play perfect chess given infinite computing power, you are confused about the rules of the chess or the structure of chess-playing computation in a way that would make it entirely hopeless for you to figure out how to build a bounded chess-player. Thus “In real life we’re always bounded” is no excuse for not being able to solve the much simpler unbounded form of the problem, and being able to describe the infinite chess-player would be substantial and useful conceptual progress compared to not being able to do that. We can’t be absolutely certain that an analogous situation holds between solving the challenges posed in the paper, and realistic self-modifying AIs with stable goal systems, but every line of investigation has to start somewhere.
Parts of the paper will be easier to understand if you’ve read Highly Advanced Epistemology 101 For Beginners including the parts on correspondence theories of truth (relevant to section 6) and model-theoretic semantics of logic (relevant to 3, 4, and 6), and there are footnotes intended to make the paper somewhat more accessible than usual, but the paper is still essentially aimed at mathematically sophisticated readers.
- Results from MIRI’s December workshop by 15 Jan 2014 22:29 UTC; 73 points) (
- Start Under the Streetlight, then Push into the Shadows by 24 Jun 2013 0:49 UTC; 52 points) (
- MIRI’s 2013 Summer Matching Challenge by 23 Jul 2013 19:05 UTC; 38 points) (
- Southern California FAI Workshop by 20 Apr 2014 8:55 UTC; 32 points) (
- Proper value learning through indifference by 19 Jun 2014 9:39 UTC; 30 points) (
- Loebian cooperation in the tiling agents problem by 26 Jun 2017 14:52 UTC; 8 points) (
- 13 Sep 2013 1:49 UTC; 6 points) 's comment on Open thread, September 2-8, 2013 by (
- Meetup : Urbana-Champaign: The löbstacle for picking good actions. by 21 Sep 2013 12:46 UTC; 5 points) (
- 20 Aug 2014 17:15 UTC; 4 points) 's comment on Steelmanning MIRI critics by (
- 6 Jun 2013 20:23 UTC; 3 points) 's comment on Tiling Agents for Self-Modifying AI (OPFAI #2) by (
- 31 Mar 2023 13:18 UTC; 1 point) 's comment on It Can’t Be Mesa-Optimizers All The Way Down (Or Else It Can’t Be Long-Term Supercoherence?) by (
- 10 Jul 2013 2:15 UTC; 0 points) 's comment on Superintelligent AGI in a box—a question. by (
The fact that MIRI is finally publishing technical research has impressed me. A year ago it seemed, to put it bluntly, that your organization was stalling, spending its funds on the full-time development of Harry Potter fanfiction and popular science books. Perhaps my intuition there was uncharitable, perhaps not. I don’t know how much of your lead researcher’s time was spent on said publications, but it certainly seemed, from the outside, that it was the majority. Regardless, I’m very glad MIRI is focusing on technical research. I don’t know how much farther you have to walk, but it’s clear you’re headed in the right direction.
(Reply to.)
By default, if you can build a Friendly AI you were not troubled by the Lob problem. That working on the Lob Problem gets you closer to being able to build FAI is neither obvious nor certain (perhaps it is shallow to work on directly, and those who can build AI resolve it as a side effect of doing something else) but everything has to start somewhere. Being able to state crisp difficulties to work on is itself rare and valuable, and the more you engage with a problem like stable self-modification, the more you end up knowing about it. Engagement in a form where you can figure out whether or not your proof goes through is more valuable than engagement in the form of pure verbal arguments and intuition, although the latter is significantly more valuable than not thinking about something at all.
Reading through the whole Tiling paper might make this clearer; it spends the first 4 chapters on the Lob problem, then starts introducing further concepts once the notion of ‘tiling’ has been made sufficiently crisp, like the Vingean principle or the naturalistic principle, and then an even more important problem with tiling probabilistic agents (Ch. 7) and another problem with tiling bounded agents (Ch. 8), neither of which are even partially solved in the paper, but which would’ve made a lot less sense—would not have been reified objects in the reader’s mind—if the paper hadn’t spent all that time on the mathematical machinery needed to partially solve the Lob problem in logical tiling, which crispifies the notion of a ‘problem with tiling’.
I feel like in this comment you’re putting your finger on a general principal of instrumental rationality that goes beyond the specific issue at hand, and indeed beyond the realm of mathematical proof. It might be worth a post on “engagement” at some point.
Specifically, I note similar phenomena in software development where sometimes what I start working on ends up being not at all related to the final product, but nonetheless sets me off on a chain of consequences that lead me to the final, useful product. And I too experience the annoyance of managers insisting that I lay out a clear path from beginning to end, when I don’t yet know what the territory looks like or sometimes even what the destination is.
As Eisenhower said, “Plans are worthless, but planning is everything.”
If you can build a Friendly AI which can self-modify. FOOM-able algorithms are an important but not the only avenue to AGI, friendly or otherwise. Also, the “AGI”-class doesn’t necessarily imply superhuman cognition. Humans are intelligent agents for which the Löb problem has little bearing since we can’t (or don’t) self-modify to such a large degree quite yet.
Yes, but Friendly AI does. Nobody said you needed to solve the Lob problem to build an AGI. What we’re talking about here is something more specific than that.
Any agent that takes in information about the world is implicitly self-modifying all the time.
Here’s a distinction you could make: an AI is self-modifying if it is effectively capable of making any change to its source code at any time, and non-self-modifying if it is not. (The phrase “capable of” is vague, of course.)
I can imagine non-self-modifying AI having an advantage over self-modifying AI, because it might be possible for an NSM AI to be protected from its own stupidity, so to speak. If the AI were to believe that overwriting all of its beliefs with the digits of pi is a good idea, nothing bad would happen, because it would be unable to do that. Of course, these same restrictions that make the AI incapable of breaking itself might also make it incapable of being really smart.
I believe I’ve heard someone say that any AI capable of being really smart must be effectively self-modifying, because being really smart involves the ability to make arbitrary calculations, and if you can make arbitrary calculations, then you’re not restricted. My objection is that there’s a big difference between making arbitrary calculations and running arbitrary code; namely, the ability to run arbitrary code allows you to alter other calculations running on the same machine.
Lemme expand on my thoughts a little bit. I imagine a non-self-modifying AI to be made of three parts: a thinking algorithm, a decision algorithm, and a belief database. The thinking and decision algorithms are immutable, and the belief database is (obviously) mutable. The supergoal is coded into the decision algorithm, so it can’t be changed. (Problem: the supergoal only makes sense in the concept of certain beliefs, and beliefs are mutable.) The contents of the belief database influence the thinking algorithm’s behavior, but they don’t determine its behavior.
The ideal possibility is that we can make the following happen:
The belief database is flexible enough that it can accommodate all types of beliefs from the very beginning. (If the thinking algorithm is immutable, it can’t be updated to handle new types of beliefs.)
The thinking algorithm is sufficiently flexible that the beliefs in the belief database can lead the algorithm in the right directions, producing super-duper intelligence.
The thinking algorithm is sufficiently inflexible that the beliefs in the belief database cannot cause the algorithm to do something really bad, producing insanity.
The supergoal remains meaningful in the context of the belief database regardless of how the thinking algorithm ends up behaving.
(My ideas haven’t been taken seriously in the past, and I have no special knowledge in this area, so it’s likely that my ideas are worthless. They feel valuable to me, however.)
This point seems like an argument as an argument in favor of the relevance of the problem laid out in this post. I have other complaints with this framing of the problem, which I expect you would share.
The key distinction between this and contemporary AI is not self-modification, but wanting to have the kind of agent which can look at itself and say, “I know that as new evidence comes in I will change my beliefs. Fortunately, it looks like I’m going to make better decisions as a result” or perhaps even more optimistically “But it looks like I’m not changing them in quite the right way, and I should make this slight change.”
The usual route is to build agents which don’t reason about their own evolution over time. But for sufficiently sophisticated agents, I would expect them to have some understanding of how they will behave in the future, and to e.g. pursue more information based on the explicit belief that by acquiring that information they will enable themselves to make better decisions. This seems like it is a more robust approach to getting the “right” behavior than having an agent which e.g. takes “Information is good” as a brute fact or has a rule for action that bakes in an ad hoc approach to estimating VOI. I think we can all agree that it would not be good to build an AI which calculated the right thing to do, and then did that with probability 99% and took a random action with probability 1%.
That said, even if you are a very sophisticated reasoner, having in hand some heuristics about VOI is likely to be helpful, and if you think that those heuristics are effective you may continue to use them. I just hope that you are using them because you believe they work (e.g. because of empirical observations of them working, the belief that you were intelligently designed to make good decisions, or whatever), not because they are built into your nature.
For a somewhat contrived and practically less relevant notion of self modifying. You could regard a calculator as being self modifying, not very relevantly.
It would be useful to understand why we think a calculator doesn’t “count” as self-modification. In particular, we don’t think calculators run into the Lob obstacle, so what is the difference between calculators and AIs?
As always in such matters, think of Turing Machines. If the transition function isn’t modified, the state of the Turing Machine may change. However, it’ll always be in a internal state prespecified in its transition function, it won’t get unknown or unknowable new entries in its action table.
Universal Turing Machines are designed to change, to take their transition function from the input tape as input, a prime example of self-modification. But they as well—having read their new transition function from their input tape—will go along their business as usual without further changes to their transition function. (You can of course program them to later continue changing their action table, but the point is that such changes to its own action table—to its own behavior—are clearly delineated from just contents in its memory / work tape.)
A calculator or a non-self-modifying AI will undergo changes in its memory, but it’ll never endeavor to define new internal states, with new rules, on its own. It’ll memorize whether you’ve entered “0.7734” in its display, but it’ll only perform its usual actions on that number. A game of tetris will change what blocks it displays on your screen, but that won’t modify its rules.
There may be accidental modifications (bugs etc.) leading to unknown states and behavior, but I wouldn’t usefully call that an active act of self-modification. (It’s not a special case to guard against, other than by the usual redundancy / using checksums. But that’s no more FAI research than rather the same constraints as when working with e.g. real time or mission critical applications.)
I don’t think this is quite there. A UTM is itself a TM, and its transition function is fixed. But it emulates a TM, and it could instead emulate a TM-with-variable-transition-function, and that thing would be self-modifying in a deeper sense than an emulation of a standard TM.
But it’s still not obvious to me how to formalize this, because (among other problems) you can replace an emulated TMWVTF with an emulated UTM which in turn emulates a TMWVTF...
See the last paragraph of this comment highlighting my question about the relevance of the operationalization.
I feel like I’m not clear on what question you’re asking. Can you give an example of what a good answer would look like, maybe using Xs and Ys since I can hardly ask you to come up with an actual good argument?
There are many possible operationalizations of a self-modifying AI. For example,
One could model a self-improving AI as the Chinese economy (which is in some sense a self-improving powerful optimization process).
One could model a self-improving AI as a chess playing computer program which uses a positional weighting system to choose which moves to make, and which analyzes which weighting heuristics statistically lead to more winning games, in order to improve its positional weighting system.
My reaction to your paper is similar to what my reaction would be to a paper that studies ways to make sure that the Chinese economy doesn’t change in such as way that so that GDP start dropping, or ways to make sure that the chess program doesn’t self-modify to get worse and worse at winning chess games rather than better and better.
It’s conceivable that such a paper would be useful for building a self-improving AI, but a priori I would bet very heavily that activities such as
Working to increase rationality
Spreading concern for global welfare
Building human capital of people who are concerned about global welfare
are more cost-effective activities ways for reducing AI risk than doing such research.
I’m looking for an argument for why the operationalization in the paper is more likely to be relevant to creating safe AI than modeling a self-improving AI as the Chinese economy, or as the aforementioned chess program, or than a dozen other analogous operationalizations that I could make up.
If somebody wrote a paper showing how an economy could naturally build another economy while being guaranteed to have all prices derived from a constant set of prices on intrinsic goods, even as all prices were set by market mechanisms as the next economy was being built, I’d think, “Hm. Interesting. A completely different angle on self-modification with natural goal preservation.”
I’m surprised at the size of the apparent communications gap around the notion of “How to get started for the first time on a difficult basic question”—surely you can think of mathematical analogies to research areas where it would be significant progress just to throw out an attempted formalization as a base point?
There are all sorts of disclaimers plastered onto the paper about how this only works because logic is monotonic, probabilistic reasoning is not monotonic etcetera. The point is to have a way, any way of just getting started on stable self-modification even though we know the particular exact formalism doesn’t directly work for probabilistic agents. Once you do that you can at least state what it is you can’t do. A paper on a self-replicating economy with a stable set of prices on intrinsic goods would likewise be something you could look at and say, “But this formally can’t do X, because Y” and then you would know more about X and Y then you did previously. Being able to say, “But the verifier-suggester separation won’t work for expected utility agents because probabilistic reasoning is not monotonic” means you’ve gotten substantially further into FAI work than when you’re staring dumbly at the problem.
AIXI was conceptual progress on AGI, and especially public discussion of AGI, because it helped people like me say much more formally all the things that we didn’t like about AIXI, like the anvil problem or AIXI seizing control of its reward channel or AIXI only being able to represent utility functions of sensory data rather than environmental ontologies. Someone coming up with a list of 5 key properties the tiling architecture does not have would be significant progress, and I would like to specifically claim that as an intended, worthwhile, fully-pays-back-the-effort positive consequence if it happens—and this is not me covering all the bases in case of disappointment, the paper was presented in a way consonant with that goal and not in a way consonant with claiming one-trueness.
I don’t understand the model you have of FAI research where this is not the sort of thing that you do at the beginning.
Thanks for continuing to engage.
I described my position in another comment. To reiterate and elaborate:
My current best guess is that there are so many unrelated potential models for AI (relative to the information that we currently have) that the probability of FAI work on a single one of them ending up being relevant is tiny. In order to make a compelling argument for the relevance of MIRI’s work on the Lob problem, you have to argue that the model used isn’t only one of, e.g. 10^10 distinct models of AI with similar probability of being realized in practice.
One could argue that the problem is sufficiently important so that one should work on it even if the probability of the work being relevant is tiny. But there are other interventions on the table. You’ve made major contributions by spreading rationality and by creating a community for people who are interested in global welfare to network and collaborate with one another. These things probably substantially reduce astronomical waste (in expectation). In order to argue in favor of MIRI’s FAI research being optimal philanthropy, you have to argue that the probability of the research being relevant is sufficiently great so that its expected value outweighs the expected value of these other activities.
One could argue that if there are in fact so many models for AI then we’re doomed anyway, so we should assume that there aren’t so many models. But rather than trying to work on the models that we think most relevant now, we can wait until it becomes more clear what AGI will look like in practice, and then develop FAI for that type of AI. Whether or not this is feasible is of course related to the question of whether the world’s elites will navigate the creation of AI just fine. I think that there are good reasons to think that the probability of this is pretty high, and that the most leveraged efforts are getting good people in positions of future influence rather than doing FAI research now. Your work on rationality training and community building can help, and already has helped a lot with this.
Neither 2 nor 3 is the sort of argument I would ever make (there’s such a thing as an attempted steelman which by virtue of its obvious weakness doesn’t really help). You already know(?) that I vehemently reject all attempts to multiply tiny probabilities by large utilities in real-life practice, or to claim that the most probable assumption about a background epistemic question leads to a forgone doom and use this to justify its improbable negation. The part at which you lost me is of course part 1.
I still don’t understand what you could be thinking here, and feel like there’s some sort of basic failure to communicate going on. I could guess something along the lines of “Maybe Jonah is imagining that Friendly AI will be built around principles completely different from modern decision theory and any notion of a utility function...” (but really, is something like that one of just 10^10 equivalent candidates?) ”...and more dissimilar to that than logical AI is from decision theory” (that’s a lot of dissimilarity but we already demonstrated conceptual usefulness over a gap that size). Still, that’s the sort of belief someone might end up with if their knowledge of AI was limited to popular books extolling the wonderful chaos of neural networks, but that idea is visibly stupid so my mental model of Anna warns me not to attribute it to you. Or I could guess, “Maybe Jonah is Holden-influenced and thinks that all of this discussion is irrelevant because we’re going to build a Google Maps AGI”, where in point of fact it would be completely relevant, not a tiniest bit less relevant, if we were going to build a planning Oracle. (The experience with Holden does give me pause and make me worry that EA people may think they already know how to build FAI using their personal wonderful idea, just like vast numbers of others think they already know how to build FAI.) But I still can’t think of any acceptable steel version of what you mean, and I say again that it seems to me that you’re saying something that a good mainstream AI person would also be staring quizzically at.
What would be one of the other points in the 10^10-sized space? If it’s something along the lines of “an economic model” then I just explained why if you did something analogous with an economic model it could also be interesting progress, just as AIXI was conceptually important to the history of ideas in the field. I could explain your position by supposing that you think that mathematical ideas never generalize across architectures and so only analyzing the exact correct architecture of a real FAI could be helpful even at the very beginning of work, but this sounds like a visibly stupid position so the model of Anna in my head is warning me not to attribute it to you. On the other hand, some version of, “It is easier to make progress than Jonah thinks because the useful generalization of mathematical ideas does not require you to select correct point X out of 10^10 candidates” seems like it would almost have to be at work here somewhere.
I seriously don’t understand what’s going on in your head here. It sounds like any similar argument should Prove Too Much by showing that no useful work or conceptual progress could have occurred due to AI work in the last 60 years because there would be 10^10 other models for AI. Each newly written computer program is unique but the ideas behind them generalize, the resulting conceptual space can be usefully explored, that’s why we don’t start over with every new computer program. You can do useful things once you’ve collected enough treasure nuggets and your level of ability builds up, it’s not a question of guessing the one true password out of 10^10 tries with nothing being progress until then. This is true on a level of generality which applies across computer science and also to AI and also to FAI and also to decision theory and also to math. Everyone takes this for granted as an obvious background fact of doing research which is why I would expect a good mainstream AI person to also be staring quizzically at your statements here. I do not feel like the defense I’m giving here is in any way different from the defense I’d give of a randomly selected interesting AI paper if you said the same thing about it. “That’s just how research works,” I’d say.
Please amplify point 1 in much greater detail using concrete examples and as little abstraction as possible.
I continue to appreciate your cordiality.
A number of people have recently told me that they have trouble understanding me unless I elaborate further, because I don’t spell out my reasoning in sufficient detail. I think that this is more a matter of the ideas involved being complicated, and there being a lot of inferential distance, than it is lack of effort on my part, but I can see how it would be frustrating to my interlocutors. It seems that I’m subject to the illusion of transparency. I appreciate your patience.
I know that you’ve explicitly disavowed arguments of the type in my points 2 and 3. My reason for bringing them up is to highlight the importance of addressing point 1: to emphasize that it doesn’t suffice to say “the problem is important and we have to get started on it somehow.” I recognize that we have very different implicit assumptions on point 1, and that that’s where the core of the disagreement lies.
There’s essentially only one existing example of an entity with general intelligence: a human. I think that our prior should be that the first AGI will have internal structure analogous to that of a human. Here I’m not suggesting that an AGI will have human values by default: I’m totally on board with your points about the dangers of anthropomorphization in that context. Rather, what I mean is that I envisage the first AGI as having many interacting specialized modules, rather than a mathematically defined utility function.
There are serious dangers of such an entity having values that are orthogonal to humans, and serious dangers of value drift. (Your elegant article Why does power corrupt?, has some relevance to the latter point.) But it seems to me that the measures that one would want to take to prevent humans’ goals changing seem completely different from the sorts of measures that might emerge from MIRI’s FAI research.
I’ll also highlight a comment of Nick Beckstead, which you’ve already seen and responded to. I didn’t understand your response.
I should clarify that I don’t have high confidence that the first AGI will develop along these lines. But it’s my best guess, and it seems much more plausible to me than models of the type in your paper.
The difference that I perceive between the two scenarios is the nature of the feedback loops in each case.
When one is chipping away at a problem incrementally, one has the capacity to experiment and use the feedback generated from experimentation to help one limit the search space. Based on what I know about the history of science, general relativity is one of the only successful theories that was created without lots of empirical investigation.
The engineers who designed the first bridges had trillions of combinations of design features and materials to consider a priori, the vast majority of which wouldn’t work. But an empirical discovery like “material X is too weak to work within any design” greatly limits the search space, because you don’t have to think further about any of the combinations involving material X. Similarly if one makes a discovery of the type “material Y is so strong that it’ll work with any design.” By making a series of such discoveries, one can hone in on a few promising candidates.
This is how I predict that the development of AGI will go. I think that the search space is orders of magnitude too large to think about in a useful way without a lot of experimentation, and that a priori we can’t know what the first AGI will look like. I think that once it becomes more clear what the first AGI will look like, it will become much more feasible to make progress on AI safety.
It’ll take me a while to come up with a lot of concrete hypotheticals, but I’ll get back to you on this.
Okay. This sounds like you’re trying to make up your own FAI theory in much the same fashion as Holden (and it’s different from Holden’s, of course). Um, what I’d like to do at this point is take out a big Hammer of Authority and tell you to read “Artificial Intelligence: A Modern Approach” so your mind would have some better grist to feed on as to where AI is and what it’s all about. If I can’t do that… I’m not really sure where I could take this conversation. I don’t have the time to personally guide you to understanding of modern AI starting from that kind of starting point. If there’s somebody else you’d trust to tell you about AI, with more domain expertise, I could chat with them and then they could verify things to you. I just don’t know where to take it from here.
On the object level I will quickly remark that some of the first attempts at heavier-than-air flying-machines had feathers and beaks and they did not work very well, that ‘interacting specialized modules’ is Selling Nonapples, that there is an old discussion in cognitive science about the degree of domain specificity in human intelligence, and that the idea that ‘humans are the only example we have’ is generally sterile, for reasons I’ve already written about but I can’t remember the links offhand, hopefully someone else does. It might be in Levels of Organization in General Intelligence, I generally consider that pretty obsolete but it might be targeted to your current level.
Either of my best guess or Holden’s best guess could be right, and so could lots of other ideas that we haven’t thought of. My proposed conceptual framework should be viewed as one of many weak arguments.
The higher level point that I was trying to make is that [the conceptual framework implicit in view that the MIRI’s current FAI research has a non-negligible chance of being relevant to AI safety] seems highly conjunctive. I don’t mean this rhetorically at all – I genuinely don’t understand why you think that we can make progress given how great the unknown unknowns are. You may be right, but justification of your view requires further argumentation.
A more diplomatic way of framing this would be something like:
“The book Artificial Intelligence: A Modern Approach has a discussion of current approaches to artificial intelligence. Are you familiar with the ideas therein? If not, I’d suggest that you take a look”
Putting that aside, based on conversations with a number of impressive people in machine learning, etc. who I know, my impression is that at the moment, there aren’t strong contenders for research programs that could plausibly lead to AGI. I largely accept Luke’s argument in his blog post on AI timelines, but this is based on the view that the speed of research is going to increase a lot over the coming years, rather than on the belief that any existing research programs have a reasonable chance of succeeding.
I’d be very interested in hearing about existing research programs that have a reasonable chance of succeeding.
Is it your view that no progress has occurred in AI generally for the last sixty years?
The field as a whole has been making perfectly good progress AFAICT. We know a bleepton more about cognition than we did in 1955 and are much less confused by many things. Has someone been giving you an impression otherwise and if so, what field were they in?
No, it’s clear that there have been many advances, for example in chess playing programs, auto-complete search technology, automated translation, driverless cars, and speech recognition.
But my impression is that this work has only made a small dent in the problem of general artificial intelligence.
Three graduate students in machine learning at distinct elite universities.
Scott Aaronson. Even though he works in theoretical computer science rather than AI, he’s in close proximity with many colleagues who work on artificial intelligence at MIT, and so I give a fair amount of weight to his opinion.
Also, the fraction of scientists who I know who believe that there’s a promising AGI research agenda on the table is very small, mostly consisting of people around MIRI. Few of the scientists who I know have subject matter expertise, but if there was a promising AGI research agenda on the table, I would expect news of it to have percolated to at least some of the people in question.
I think I may have been one of those three graduate students, so just to clarify, my view is:
Zero progress being made seems too strong a claim, but I would say that most machine learning research is neither relevant to, nor trying to be relevant to, AGI. I think that there is no real disagreement on this empirical point (at least, from talking to both Jonah and Eliezer in person, I don’t get the impression that I disagree with either of you on this particular point).
The model for AGI that MIRI uses seems mostly reasonable, except for the “self-modification” part, which seems to be a bit too much separated out from everything else (since pretty much any form of learning is a type of self-modification—current AI algorithms are self-modifying all the time!).
On this vein, I’m skeptical of both the need or feasibility of an AI providing an actual proof of safety of self-modification. I also think that using mathematical logic somewhat clouds the issues here, and that most of the issues that MIRI is currently working on are prerequisites for any sort of AI, not just friendly AI. I expect them to be solved as a side-effect of what I see as more fundamental outstanding problems.
However, I don’t have reasons to be highly confident in these intuitions, and as a general rule of thumb, having different researchers with different intuitions pursue their respective programs is a good way to make progress, so I think it’s reasonable for MIRI to do what it’s doing (note that this is different from the claim that MIRI’s research is the most important thing and is crucial to the survival of humanity, which I don’t think anyone at MIRI believes, but I’m clarifying for the benefit of onlookers).
Agreed, the typical machine learning paper is not AGI progress—a tiny fraction of such papers being AGI progress suffices.
I want to note that the general idea being investigated is that you can have a billion successive self-modifications with no significant statistically independent chance of critical failure. Doing proofs from axioms in which case the theorems are, not perfectly strong, but at least as strong as the axioms with conditionally independent failure probabilities not significantly lowering the conclusion strength below this as they stack, is an obvious entry point into this kind of lasting guarantee. It also suggests to me that even if the actual solution doesn’t use theorems proved and adapted to the AI’s self-modification, it may have logic-like properties. The idea here may be more general than it looks at a first glance.
Can you name some papers that you think constitute AGI progress? (Not a rhetorical question.)
I’m not sure if I parse this correctly, and may be responding to something that you don’t intend to claim, but I want to remark that if the probabilities of critical failure at each stage are
0.01, 0.001, 0.0001, 0.00001, etc.
then total probability of critical failure is less than 2%. You don’t need the probability of failure at each stage to be infinitesimal, you only need the probabilities of failure to drop off fast enough.
How would they drop off if they’re “statistically independent”? In principle this could happen, given a wide separation in time, if humanity or lesser AIs somehow solve a host of problems for the self-modifier. But both the amount of help from outside and the time-frame seem implausible to me, for somewhat different reasons. (And the idea that we could know both of them well enough to have those subjective probabilities seems absurd.)
The Chinese economy was stagnant for a long time, but is now much closer to continually increasing GDP (on average) with high probability, and I expect that “goal” of increasing GDP will become progressively more stable over time.
The situation may be similar with AI, and I would expect it to be by default.
I’m aware of this argument, but I think there are other ways to get this. The first tool I would reach for would be a martingale (or more generally a supermartingale), which is a statistical process that somehow manages to correlate all of its failures with each other (basically by ensuring that any step towards failure is counterbalanced in probability by a step away from failure). This can yield bounds on failure probabiity that hold for extremely long time horizons, even if there is non-trivial stochasticity at every step.
Note that while martingales are the way that I would intuitively approach this issue, I’m trying to make the broader argument that there are ways other than mathematical logic to get what you are after (with martingales being one such example).
Please expand on this, because I’m having trouble understanding your idea as written. A martingale is defined as “a sequence of random variables (i.e., a stochastic process) for which, at a particular time in the realized sequence, the expectation of the next value in the sequence is equal to the present observed value even given knowledge of all prior observed values at a current time”, but what random variable do you have in mind here?
I can make some sense of this, but I’m not sure whether it is what Jacob has in mind because it doesn’t seem to help.
Imagine that you’re the leader of an intergalactic civilization that wants to survive and protect itself against external threats forever. (I’m spinning a fancy tale for illustration; I’ll make the link to the actual AI problem later, bear with me.) Your abilities are limited by the amount of resources in the universe you control. The variable X(t) says what fraction you control at time t; it takes values between 0 (none) or 1 (everything). If X(t) ever falls to 0, game’s over and it will stay at 0 forever.
Suppose you find a strategy such that X(t) is a supermartingale; that is, E[X(t’) | I_t] >= X_t for all t’ > t, where I_t is your information at time t. [ETA: In discrete time, this is equivalent to E[X(t+1) | I_t] >= X_t, i.e., in expectation you have at least as many resources in the next round as you have in this round.] Now clearly we have E[X(t’) | I_t] ⇐ P[X(t’) > 0 | I_t], and therefore P[X(t’) > 0 | I_t] >= X_t. Therefore, given your information at time t, the probability that your resources will never fall to zero is at least X_t (this follows from the above by using the assumption that if they ever fall to 0, then they stay at 0). So if you start with a large share of the resources, there’s a large probability that you’ll never run out.
The link to AI is that we replace “share of resources” by some “quality” parameter describing the AI. I don’t know whether Jacob has ideas what such parameter might be, but it would be such that there is a catastrophe iff it falls to 0.
The problem with all of this is that it sounds mostly like a restatement of “we don’t want there to be an independent failure probability on each step; we want there to be a positive probability that there is never a failure”. The martingale condition is a bit more specific than that, but it doesn’t tell us how to make that happen. So, unless I’m completely mistaken about what Jacob intended to say (possible), it seems more like a different description of the problem rather than a solution to the problem...
Thank you Benja, for the very nice explanation! (As a technical point, what you are describing is a “submartingale”, a supermartingale has the inequality going in the opposite direction and then of course you have to make 1 = failure and 0 = success instead of the other way around).
Martingales may in some sense “just” be a rephrasing of the problem, but I think that’s quite important! In particular, they implicitly come with a framework of thought that suggests possible approaches—for instance, one could imagine a criterion for action in which risks must always be balanced by the expectation of acquiring new information that will decrease future risks—we can then imagine writing down a potential function encapsulating both risk to humanity and information about the world / humanity’s desires, and have as a criterion of action that this potential function never increase in expectation (relative to, e.g., some subjective probability distribution that we have reason to believe is well-calibrated).
I second Wei’s question. I can imagine doing logical proofs about how your successor’s algorithms operate to try to maximize a utility function relative to a lawfully updated epistemic state, and would consider my current struggle to be how to expand this to a notion of a lawfully approximately updated epistemic state. If you say ‘martingale’ I have no idea where to enter the problem at all, or where the base statistical guarantees that form part of the martingale would come from. It can’t be statistical testing unless the problem is i.i.d. because otherwise every context shift breaks the guarantee.
I’m not sure how to parse your last sentence about statistical testing, but does Benja’s post and my response help to clarify?
You are aware that not all statistical tests require i.i.d. assumptions, right?
I’d be interested in your thoughts on the point about computational complexity in this comment.
It seems to me like relatively narrow progress on learning is likely to be relevant to AGI. It does seem plausible that e.g. machine learning research is not too much more relevant to AGI than progress in optimization or in learning theory or in type theory or perhaps a dozen other fields, but it doesn’t seem very plausible that it isn’t taking us closer to AGI in expectation.
Yes, reflective reasoning seems to be necessary to reason about the process of learning and the process of reflection, amongst other things. I don’t think any of the work that has been done applies uniquely to explicit self-modification vs. more ordinary problems with reflection (e.g. I think the notion of “truth” is useful if you want to think about thinking, and believing that your own behavior is sane is useful if you want to think about survival as an instrumental value).
This seems quite likely (or at least the weaker claim, that either these results are necessary for any AI or they are useless for any AI, seems very likely). But of course this is not enough to say that such work isn’t useful for better understanding and coping with AI impacts. If we can be so lucky as to find important ideas well in advance of building the practical tools that make those ideas algorithmically relevant, then we might develop a deeper understanding of what we are getting into and more time to explore the consequences.
In practice, even if this research program worked very well, we would probably be left with at least a few and perhaps a whole heap of interesting theoretical ideas. And we might have few clues as to which will turn out to be most important. But that would still give us some general ideas about what human-level AI might look like, and could help us see the situation more clearly.
Indeed, I would be somewhat surprised if interesting statements get proven often in the normal business of cognition. But this doesn’t mean that mathematical logic and inference won’t play an important role in AI—logical is by far the most expressive language that we are currently aware of, and therefore a natural starting point if we want to say anything formal about cognition (and as far as I can tell this is not at all a fringe view amongst folks in AI).
I’d be interested in your response to the following, which I wrote in another context. I recognize that I’m far outside of my domain of expertise, and what I write should be read as inquisitive rather than argumentative:
The impression that I’ve gotten is that to date, impressive applications of computers to do tasks that humans do are based around some combination of
Brute force computation
Task specific algorithms generated by humans
In particular, they doesn’t seem at all relevant to mimicking human inference algorithms.
As I said in my point #2 here: I find it very plausible that advances in narrow AI will facilitate the development of AGI by enabling experimentation.
The question that I’m asking is more: “Is it plausible that the first AGI will be based on filling in implementation details of current neural networks research programs, or current statistical inference research programs?”
Something worth highlighting is that researchers in algorithms have repeatedly succeeded in developing algorithms that solve NP-complete problems in polynomial time with very high probability, or that give very good approximations to solutions to problems in polynomial time where it would be NP-complete to get the solutions exactly right. But these algorithms can’t be ported from one NP-complete problem to another while retaining polynomial running time. One has to deal with each algorithmic problem separately.
From what I know, my sense is that one has a similar situation in narrow AI, and that humans (in some vague sense) have a polynomial time algorithm that’s robust across different algorithmic tasks.
I don’t really understand how “task specific algorithms generated by humans” differs from general intelligence. Humans choose a problem, and then design algorithms to solve the problem better. I wouldn’t expect a fundamental change in this situation (though it is possible).
I think this is off. A single algorithm currently achieves the best known approximation ratio on all constraint satisfaction problems with local constraints (this includes most of the classical NP-hard approximation problems where the task is “violate as few constraints as possible” rather than “satisfy all constraints, with as high a score as possible”), and is being expanded to cover increasingly broad classes of global constraints. You could say “constraint satisfaction is just another narrow task” but this kind of classification is going to take you all the way up to human intelligence and beyond. Especially if you think ‘statistical inference’ is also a narrow problem, and that good algorithms for planning and inference are more of the same.
All I’m saying here is that general intelligence can construct algorithms across domains, whereas my impression is that impressive human+ artificial intelligence to date hasn’t been able to construct algorithms across domains.
General artificial intelligence should be able to prove:
The Weil conjectures
The geometrization conjecture,
Monstrous Moonshine
The classification of simple finite groups
The Atiyah Singer Index Theorem
The Virtual Haken Conjecture
and thousands of other such statements. My impression is that current research in AI is analogous to working on proving these things one at a time.
Working on the classification of simple finite groups could indirectly help you prove the Atiyah-Singer Index Theorem on account of leading to the discovery of structures that are relevant, but such work will only make a small dent on the problem of proving the Atiyah-Singer Index Theorem. Creating an algorithm that can prove these things (that’s not over-fitted to the data) is a very different problem from that of proving the theorems individually.
Do you think that the situation with AI is analogous or disanalogous?
I’m not sure if I follow. Is the algorithm that you have in mind the conglomeration of all existing algorithms?
If so, it’s entirely unclear how quickly the algorithm is growing relative to the problems that we’re interested in.
No, there is a single SDP rounding scheme that gets optimal performance on all constraint satisfaction problems (the best we know so far, and the best possible under the unique games conjecture).
Can you give a reference?
http://dl.acm.org/citation.cfm?id=1374414
PDF.
I’d be interested in your thoughts on this discussion post.
I would disagree with the statement that our algorithms are all domain-specific. Often some amount of domain-specific knowledge is needed to design a good algorithm, but it is often quite minimal. For instance, my office-mate is building a parser for interpreting natural language semantics, and has taken zero linguistics classes (but has picked up some amount of linguistics knowledge from talks, etc.). Of course, he’s following in the footsteps of people who do know linguistics, but the point is just that the methods people use tend to be fairly general despite requiring task-specific tuning.
I agree, of course, that there are few systems that work across multiple domains, but I’m not sure that that’s a fundamental issue so much as a symptom of broader issues that surface in this context (such as latent variables and complex features).
Thanks Jacob. I’d be interested in your thoughts on this discussion post.
You can’t do that? From random things like computer security papers, I was under the impression that you could do just that—convert any NP problem to a SAT instance and toss it at a high-performance commodity SAT solver with all its heuristics and tricks, and get an answer back.
You can do this. Minor caveat: this works for overall heuristic methods- like “tabu search” or “GRASP”- but many of the actual implementations you would see in the business world are tuned to the structure of the probable solution space. One of the traveling salesman problem solvers I wrote a while back would automatically discover groups of cities and move them around as a single unit- useful when there are noticeable clusters in the space of cities, not useful when there aren’t. Those can lead to dramatic speedups (or final solutions that are dramatically closer to the optimal solution) but I don’t think they translate well across reformulations of the problem.
I’m not a subject matter expert here, and just going based on my memory and what some friends have said, but according to http://en.wikipedia.org/wiki/Approximation_algorithm,
You can do that. But although such algorithms will produce correct answers to any NP problem when given correct answers to SAT, that does not mean that they will produce approximate answers to any NP problem when given approximate answers to SAT. (In fact, I’m not sure if the concept of an approximate answer makes sense for SAT, although of course you could pick a different NP-complete problem to reduce to.)
Edit: My argument only applies to algorithms that give approximate solutions, not to algorithms that give correct solutions with high probability, and reading your comment again, it looks like you may have been referring to the later. You are correct that if you have a polynomial-time algorithm to solve any NP-complete problem with high probability, then you can get a polynomial-time algorithm to solve any NP problem with high probability. Edit 2: sort of; see discussion below.
Oh, I see. I confused probabilistic algorithms with ones bounding error from the true optimal solution.
Can you give a reference?
If a problem is NP-complete, then by definition, any NP problem can be solved in polynomial time by an algorithm which is given an oracle that solves the NP-complete problem, which it is allowed to use once. If, in place of the oracle, you substitute a polynomial-time algorithm which solves the problem correctly 90% of the time, the algorithm will still be polynomial-time, and will necessarily run correctly at least 90% of the time.
However, as JoshuaZ points out, this requires that the algorithm solve every instance of the problem with high probability, which is a much stronger condition than just solving a high proportion of instances. In retrospect, my comment was unhelpful, since it is not known whether there are any algorithms than solve every instance of an NP-complete problem with high probability. I don’t know how generalizable the known tricks for solving SAT are (although presumably they are much more generalizable than JoshuaZ’s example).
This is the key. If you had an algorithm that solved every instance of an NP-complete problem in polynomial time with high probability, you could generate a proof of the Riemann hypothesis with high probability! (Provided that the polynomial time algorithm is pretty fast, and that the proof isn’t too long)
It depends on think on what AlexMennen meant by this. If for example there is a single NP complete problem in BPP then it is clear that NP is in BPP. Similar remarks apply to ZPP, and in both cases, almost the entire polynomial hierarchy will collapse. The proofs here are straightforward.
If, however, Alex meant that one is picking random instance of a specific NP complete problem, and that they can be solved deterministically, then Alex’s claim seems wrong. Consider for example this problem: “If an input string of length n starts with exactly floor(n^(1/2)) zeros and then a 1, treat the remainder like it is an input string for 3-SAT. If the string starts with anything else, return instead the parity of the string.” This is an NP-complete problem where we can solve almost all instances with high probability since most instances are really just a silly P problem. But we cannot use this fact to solve another NP complete problem (say normal 3-SAT) with high probability.
Why?
Well, in the easy case of ZPP, ZPP is contained in co-NP, so if NP is contained in ZPP then NP is contained in co-NP, in which case the hierarchy must collapse to the first level.
In the case of BPP, the details are slightly more subtle and requires deeper results. If BPP contains NP, then Adelman’s theorem says that then the entire polynomial hierarchy is contained in BPP. Since BPP is itself contained at finite level of the of the hierarchy, this forces collapse to at least that level.
Point of order: Let A = “these results are necessary for any AI” and B = “they are useless for any AI”. It sounds like you’re weakening from A to (A or B) because you feel the probability of B is large, and therefore the probability of A isn’t all that large in absolute terms. But if much of the probability mass of the weaker claim (A or B) comes from B, then if at all possible, it seems more pragmatically useful to talk about (i) the probability of B and (ii) the probability of A given (not B), instead of talking about the probability of (A or B), since qualitative statements about (i) and (ii) seem to be what’s most relevant for policy. (In particular, even knowing that “the probability of (A or B) is very high” and “the probability of A is not that high”—or even “is low”—doesn’t tell us whether P(A|not B) is high or low.)
My impression from your above comments is that we are mostly in agreement except for how much we respectively like mathematical logic. This probably shouldn’t be surprising given that you are a complexity theorest and I’m a statistician, and perhaps I should learn some more mathematical logic so I can appreciate it better (which I’m currently working on doing).
I of course don’t object to logic in the context of AI, it mainly seems to me that the emphasis on mathematical logic in this particular context is unhelpful, as I don’t see the issues being raised as being fundamental to what is going on with self-modification. I basically expect whatever computationally bounded version of probability we eventually come up with to behave locally rather than globally, which I believe circumvents most of the self-reference issues that pop up (sorry if that is somewhat vague intuition).
Thanks Jacob.
I’d be interested in your thoughts on my comment here.
Hm. I’m not sure if Scott Aaronson has any weird views on AI in particular, but if he’s basically mainstream-oriented we could potentially ask him to briefly skim the Tiling Agents paper and say if it’s roughly the sort of paper that it’s reasonable for an organization like MIRI to be working on if they want to get some work started on FAI. At the very least if he disagreed I’d expect he’d do so in a way I’d have better luck engaging conversationally, or if not then I’d have two votes for ‘please explore this issue’ rather than one.
I feel again like you’re trying to interpret the paper according to a different purpose from what it has. Like, I suspect that if you described what you thought a promising AGI research agenda was supposed to deliver on what sort of timescale, I’d say, “This paper isn’t supposed to do that.”
This part is clearer and I think I may have a better idea of where you’re coming from, i.e., you really do think the entire field of AI hasn’t come any closer to AGI, in which case it’s much less surprising that you don’t think the Tiling Agents paper is the very first paper ever to come closer to AGI. But this sounds like a conversation that someone else could have with you, because it’s not MIRI-specific or FAI-specific. I also feel somewhat at a loss for where to proceed if I can’t say “But just look at the ideas behind Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, that’s obviously important conceptual progress because...” In other words, you see AI doing a bunch of things, we already mostly agree on what these sorts of surface real-world capabilities are, but after checking with some friends you’ve concluded that this doesn’t mean we’re less confused about AGI then we were in 1955. I don’t see how I can realistically address that except by persuading your authorities; I don’t see what kind of conversation we could have about that directly without being able to talk about specific AI things.
Meanwhile, if you specify “I’m not convinced that MIRI’s paper has a good chance of being relevant to FAI, but only for the same reasons I’m not convinced any other AI work done in the last 60 years is relevant to FAI” then this will make it clear to everyone where you’re coming from on this issue.
He wrote this about a year ago:
And later:
Without further context I see nothing wrong here. Superintelligences are Turing machines, check. You might need a 10^20 slowdown before that becomes relevant, check. It’s possible that the argument proves too much by showing that a well-trained high-speed immortal dog can simulate Mathematica and therefore a dog is ‘intellectually expressive’ enough to understand integral calculus, but I don’t know if that’s what Scott means and principle of charity says I shouldn’t assume that without confirmation.
EDIT: Parent was edited, my reply was to the first part, not the second. The second part sounds like something to talk with Scott about. I really think the “You’re just as likely to get results in the opposite direction” argument is on the priors overstated for most forms of research. Does Scott think that work we do today is just as likely to decrease our understanding of P/NP as increase it? We may be a long way off from proving an answer but that’s not a reason to adopt such a strange prior.
As it happens, I’ve been chatting with Scott about this issue recently, due to some comments he made in his recent quantum Turing machine paper:
I thought his second objection (“how could we know what to do about it?”) was independent of his first objection (“AI seems farther away than the singularitarians tend to think”), but when I asked him about it, he said his second objection just followed from the first. So given his view that AI is probably centuries away, it seems really hard to know what could possibly help w.r.t. FAI. And if I thought AI was several centuries away, I’d probably have mostly the same view.
I asked Scott: “Do you think you’d hold roughly the same view if you had roughly the probability distribution over year of AI creation as I gave in When Will AI Be Created? Or is this part of your view contingent on AI almost certainly being several centuries away?”
He replied: “No, if my distribution assigned any significant weight to AI in (say) a few decades, then my views about the most pressing tasks today would almost certainly be different.” But I haven’t followed up to get more specifics about how his views would change.
And yes, Scott said he was fine with quoting this conversation in public.
I think I’d be happy with a summary of persistent disagreement where Jonah or Scott said, “I don’t think MIRI’s efforts are valuable because we think that AI in general has made no progress on AGI for the last 60 years / I don’t think MIRI’s efforts are priorities because we don’t think we’ll get AGI for another 2-3 centuries, but aside from that MIRI isn’t doing anything wrong in particular, and it would be an admittedly different story if I thought that AI in general was making progress on AGI / AGI was due in the next 50 years”.
I think that your paraphrasing
is pretty close to my position.
I would qualify it by saying:
I’d replace “no progress” with “not enough progress for there to be a known research program with a reasonable chance of success.”
I have high confidence that some of the recent advances in narrow AI will contribute (whether directly or indirectly) to the eventual creation of AGI (contingent on this event occurring), just not necessarily in a foreseeable way.
If I discover that there’s been significantly more progress on AGI than I had thought, then I’ll have to reevaluate my position entirely. I could imagine updating in the directly of MIRI’s FAI work being very high value, or I could imagine continuing to believe that MIRI’s FAI research isn’t a priority, for reasons different from my current ones.
Agreed-on summaries of persistent disagreement aren’t ideal, but they’re more conversational progress than usually happens, so… thanks!
I’m doing some work for MIRI looking at the historical track record of predictions of the future and actions taken based on them, and whether such attempts have systematically done as much harm as good.
To this end, among other things, I’ve been reading Nate Silver’s The Signal and the Noise. In Chapter 5, he discusses how attempts to improve earthquake predictions have consistently yielded worse predictive models than the Gutenberg-Richter law. This has slight relevance.
Such examples not withstanding, my current prior is on MIRI’s FAI research having positive expected value. I don’t think that the expected value of the research is zero or negative – only that it’s not competitive with the best of the other interventions on the table.
My own interpretation of Scott’s words here is that it’s unclear whether your research is actually helping in the “get Friendly AI before some idiot creates a powerful Unfriendly one” challenge. Fundamental progress in AI in general could just as easily benefit the fool trying to build a AGI without too much concern for Friendliness, as it could benefit you. Thus, whether fundamental research helps out avoiding the UFAI catastrophy is unclear.
I’m not sure that interpretation works, given that he also wrote:
Since Scott was addressing steps taken to act on the conclusion that friendliness was supremely important, presumably he did not have in mind general AGI research.
Yes, I would welcome his perspective on this.
I think I’ve understood your past comments on this point. My questions are about the implicit assumptions upon which the value of the research rests, rather than about what the research does or doesn’t succeed in arguing.
As I said in earlier comments, the case for the value of the research hinges on its potential relevance to AI safety, which in turn hinges on how good the model is for the sort of AI that will actually be built. Here I don’t mean “Is the model exactly right?” — I recognize that you’re not claiming it to be — the question is whether the model is in the right ballpark.
A case for the model being a good one requires pointing to a potentially promising AGI research program to which the model is relevant. This is the point that I feel hasn’t been addressed.
Some things that I see as analogous to the situation under discussion are:
A child psychology researcher who’s never interacted with children could write about good child rearing practices without the research being at all relevant to how to raise children well.
An economist who hasn’t looked at real world data about politics could study political dynamics using mathematical models without the researcher being at all relevant to politics in practice.
A philosopher who hasn’t study math could write the philosophy of math without the writing being relevant to math.
A therapist who’s never had experience with depression could give advice to a patient on overcoming depression without the advice being at all relevant to overcoming depression.
Similarly, somebody without knowledge of the type of AI that’s going to be built could research AI safety without the research being relevant to AI safety.
Does this help clarify where I’m coming from?
I’m open to learning object level material if I learn new information that convinces me that there’s a reasonable chance that MIRI’s FAI research is relevant to AI safety in practice.
Yes, this is where I’m coming from.
Missing link suspected. Suggest verifying that the url includes ‘http://’.
Just wondering why you see Jonah Sinick of high enough status to be worth explaining to what’s been discussed on LW repeatedly. Or maybe I’m totally misreading this exchange.
Maybe something to do with Jonah being previously affiliated with GiveWell?
I’m puzzled as to what you think I’m missing: can you say more?
Matching “first AGI will [probably] have internal structure analogous to that of a human” and “first AGI [will probably have] many interacting specialized modules” in a literal (cough uncharitable cough) manner, as evidenced by “heavier-than-air flying-machines had feathers and beaks”. Your phrasing hints at an anthropocentric architectural bias, analogous to the one you specifically distance yourself from regarding values.
Maybe you should clarify that part, it’s crucial to the current misunderstanding, and it’s not clear whether by “interacting specialized modules” you’d also refer to “Java classes not corresponding to anything ‘human’ in particular”, or whether you’d expect a “thalamus-module”.
I think that people should make more of an effort to pay attention to the nuances of people’s statements rather than using simple pattern matching.
There’s a great deal to write about this, and I’ll do so at a later date.
To give you a small taste of what I have in mind: suppose you ask “How likely is it that the final digit of the Dow Jones will be 2 in two weeks.” I’ve never thought about this question. A priori, I have no Bayesian prior. What my brain does, is to amalgamate
The Dow Jones index varies in a somewhat unpredictable way
The last digit is especially unpredictable.
Two weeks is a really long time for unpredictable things to happen in this context
The last digit could be one of 10 values between 0 and 9
The probability of a randomly selected digit between 0 and 9 being 2 is equal to 10%
Different parts of my brain generate the different pieces, and another part of my brain combines them. I’m not using a single well-defined Bayesian prior, nor am I satisfying a well defined utility function.
I don’t want to comment on the details, as this is way outside my area of expertise, but I do want to point out that you appear to be a victim of the bright dilettante fallacy. You appear to think that your significant mathematical background makes you an expert in an unrelated field without having to invest the time and effort required to get up to speed in it.
I don’t claim to have any object level knowledge of AI.
My views on this point are largely based on what I’ve heard from people who work on AI, together with introspection as to how I and other humans reason, and the role of heuristics in reasoning.
Let me try from a different angle.
With humans, we see three broad clusters of modification: reproduction, education, and chemistry. Different people are physically constructed in different ways, and so we can see evolution of human civilization by biological evolution of the humans inside it. The environments that people find themselves in or choose leave imprints on those people. Chemicals people ingest can change those people, such as with caffeine, alcohol, morphine, or heroin. (I would include ‘changing your diet to change your thought processes’ under chemical changes, but the chemical changes from becoming addicted to heroin and from not being creatine deficient look very different.)
For AIs, most of the modification that’s interesting and new will look like the “chemistry” cluster. An AI modifying its source code will look a lot like a human injecting itself with a new drug that it just invented. (Nick_Beckstead’s example of modifying the code of the weather computer is more like education than it is like chemistry.)
This is great because some drugs dramatically improve performance, and so a person on caffeine could invent a super nootropic, and then on the super nootropic invent a cure for cancer and an even better nootropic, and so on. This is terrifying because any drug that adjusts your beliefs or your decision-making algorithm (think of ‘personality’ as a subset of this) dramatically changes how you behave, and might do so for the worse. This is doubly terrifying because these changes might be irreversible- you might take a drug that gets rid of your depression by making you incapable of feeling desire, and then not have any desire to restore yourself! This is triply terrifying because the effects of the drug might be unknown- you might not be able to determine what a drug will do to you until after you take it, and by then it might be too late.
For humans this problem is mostly solved by trial and error followed by patternmatching- “coffee is okay, crack is not, because Colin is rich and productive and Craig is neither”- which is not useful for new drugs, and not useful for misclassified old drugs, and not very safe for very powerful systems. The third problem- that the effects might be unknown- is the sort of thing that proofs might help with, except there are some technical obstacles to doing that. The Lobstacle is a prominent theoretical one, and while it looks like there are lots of practical obstacles as well surmounting the theoretical obstacles should help with surmounting the practical obstacles.
Any sort of AGI that’s able to alter its own decision-making process will have the ability to ‘do chemistry on itself,’ and one with stable values will need to have solved the problem of how to do that while preserving its values. (I don’t think that humans have ‘stable’ values; I’d call them something more like ‘semi-stable.’ Whether or not this is a bug or feature is unclear to me.)
I understand where you’re coming from, and I think that you correctly highlight a potential source of concern, and one which my comment didn’t adequately account for. However:
I’m skeptical that it’s possible to create an AI based on mathematical logic at all. Even if an AI with many interacting submodules is dangerous, it doesn’t follow that working on AI safety for an AI based on mathematical logic is promising.
Humans can impose selective pressures on emergent AI’s so as to mimic the process of natural selection that humans experienced.
Eliezer’s position is that the default mode for an AGI is failure; i.e. if an AGI is not provably safe, it will almost certainly go badly wrong. In that contest, if you accept that “an AI with many interacting submodules is dangerous,” that that’s more or less equivalent to believing that one of the horribly wrong outcomes will almost certainly be achieved if an AGI with many submodules is created.
Humans are not Friendly. They don’t even have the capability under discussion here, to preserve their values under self-modification; a human-esque singleton would likely be a horrible, horrible disaster.
No doubt. And as of now, for none of them we’re able to tell whether they are safe or not. There’s insufficient rigor in the language, the formulizations aren’t standardized or pinned down (in this subject matter). MIRI’s work is creating and pinning down the milestones for how we’d even go about assessing self-modifying friendly AI in terms of goal stability, in mathematical language.
To have any operationalization of how some specific model of self-modification provably maintains some invariant would be a large step forward, the existence of other models of self-modification nonwithstanding. Safety cannot be proven for all approaches, because not all approaches are safe.
Even if that were so, that’s not MIRI’s (or EY’s) most salient comparative advantage (also: CFAR).
My claim is that there are sufficiently many possible models for AI that given what we (the Less Wrong community, not necessarily AI researchers) know now, the probability of a given model being developed is tiny.
The actionable safety issues that would come up if the AI is like the Chinese economy would be very different from the actionable safety issues that would come up if the AI is like a self-improving chess playing program, which would be very different from the actionable safety issues that would come up if the AI is of the type that Eliezer’s publication describes.
Given the paucity of information available about the design of the first AI, I don’t think that the probability of doing safety research on a particular model being actionable is sufficiently high for such research to be warranted (relative to other available activities).
Eliezer made a major contribution to increasing rationality with his How To Actually Change Your Mind sequence, which improved the rationality of many people who I know, including myself.
MIRI could engage in other AI safety activities, such as improving future forecasting.
If an organization doesn’t have a cost-effective activity to engage in, and the employees recognize this, then they can leave and do something else. Here I’m not claiming that this is in fact the case of MIRI, rather, I’m just responding to your argument.
MIRI’s staff could migrate to CFAR.
Out of all of the high impact activities that MIRI staff could do, it’s not clear to me that Friendly AI research is their comparative advantage.
Also, even if we accept that MIRI’s comparative advantage has to do with having a clearer view of the Friendliness vs. UnFriendliness distinction, why wouldn’t it be more effective for them to try to insure against an UnFriendly outcome by addressing the UnFriendliness already in the world today? For instance, corporate governance. Corporations’ optimization powers are a tremendous source of human happiness, but their UnFriendly tendencies are clear. For now, corporations have only parasitic intelligence, and don’t look particularly foomy, but if I had to bet on whether MIRI or Google/TenCent/Palantir/whatever was more likely to foom, there would be no contest.
[There are a bunch of assumptions embedded there. The principal ones are:
If a corporation, as currently constituted, somehow went foom it would be likely to be UnFriendly
If we were able to make it so corporations appeared more Friendly in their day-to-day actions, they would also become less likely to rush headlong into an UnFriendly foom.
I think 1 is pretty undeniable, but I could understand it if someone disagreed with 2.]
I am very glad to see MIRI taking steps to list open problems and explain why those problems are important for making machine intelligence benefit humanity.
I’m also struggling to see why this Lob problem is a reasonable problem to worry about right now (even within the space of possible AI problems). Basically, I’m skeptical that this difficulty or something similar to it will arise in practice. I’m not sure if you disagree, since you are saying you don’t think this difficulty will “block AI.” And if it isn’t going to arise in practice (or something similar to it), I’m not sure why this should be high on the priority list of general AI issues to think about it (edited to add: or why working on this problem now should be expected to help machine intelligence develop in a way that benefits humanity).
Some major questions I have are:
What are some plausible concrete examples of self-modifications where Lob issues might cause you to stumble? I promise not to interpret your answer as “Eliezer says this is probably going to happen.”
Do you think that people building AGI in the future will stumble over Lob issues if MIRI doesn’t work on those issues? If so, why?
Part of where I’m coming from on the first question is that Lobian issues only seem relevant to me if you want to argue that one set of fundamental epistemic standards is better than another, not for proving that other types of software and hardware alterations (such as building better arms, building faster computers, finding more efficient ways to compress your data, finding more efficient search algorithms, or even finding better mid-level statistical techniques) would result in more expected utility. But I would guess that once you have an agent operating with a minimally decent fundamental epistemic standards, you just can’t prove that altering the agent’s fundamental epistemic standards would result in an improvement. My intuition is that you can only do that when you have an inconsistent agent, and in that situation it’s unclear to me how Lobian issues apply.
Part of where I’m coming from on the second question is that evolutionary processes made humans who seem capable of overcoming putative Lobian obstacles to self-modification. See my other comment for more detail. The other part has to do with basic questions about whether people will adequately prepare for AI by default.
I think you are right that strategy work may be higher value. But I think you underestimate the extent to which (1) such goods are complements [granting for the moment the hypothesis that this kind of AI work is in fact useful], and (2) there is a realistic prospect of engaging in many such projects in parallel, and that getting each started is a bottleneck.
As Wei Dai observed, you seem to be significantly understating the severity of the problem. We are investigating conditions under which an agent can believe that its own operation will lawfully lead to good outcomes, which is more or less necessary for reasonably intelligent, reasonably sane behavior given our current understanding.
Compare to: “I’m not sure how relevant formalizing mathematical reasoning is, because evolution made humans who are pretty good at reasoning without any underlying rigidly defined formal systems.”
Is there an essential difference between these cases? Your objection is very common, but it looks like to me like it is on the wrong end of a very strong empirical regularity, i.e. it seems like you would argue against some of the most compelling historical developments in mathematics on similar grounds, while basically never ending up on the right side of an argument.
Similarly, you would discourage the person who advocates studying mathematical logic with the goal of building a thinking machine [which as far as I can tell was one of the original objects, before the program of formalization took off]. I do think we can predictably say that such research is worthwhile.
This is without even getting into MIRI’s raison d’etre, namely that it may be possible for societies to produce AI given widely varying levels of understanding of the underlying formal frameworks, and that all things equal we expect a deeper understanding of the underlying theory to result in better outcomes (according to the values of AI designers).
This is an interesting point I wasn’t fully taking into consideration. As I said in another comment, where MIRI has the right kind of technical AI questions, it makes sense to write them up.
I think it would greatly help me understand the expected practical implications of this research if you could address the question I asked in the original comment: “What are some plausible concrete examples of self-modifications where Lob issues might cause you to stumble?” I think I get why it causes problems if, as Wei Dai said, the AI makes decisions purely based on proofs. I don’t see how the problem would be expected to arise in scenarios that seem more plausible. I think that the MIRI people working in this problem know a lot more about this than me, which is why I am asking for examples; I expect you have something to tell me that will make this make more sense.
The argument I was trying to make was of the form:
Creation process A [natural and cultural evolution] led to agents who don’t stumble over problem B [Lobian issues].
By analogy, creation process C [people making AGI] will lead to agents who don’t stumble over problem B [Lobian issues], even if MIRI does not take special precautions to prevent this from happening.
Therefore, it is not necessary to take special precautions to make sure creation process C doesn’t stumble over problem B.
I don’t think this type of reasoning will lead to the conclusion that formalizing mathematics and doing mathematical logic are not worthwhile. Perhaps you interpreted my argument another way.
The opportunity to kill yourself in exchange for $10 is a prototypical case. It’s well and good to say “this is only a problem for an agent who uses proofs,” but that’s not a scenario, that’s a type of agent. Yes, real agents will probably not use mathematical logic in the naive way. But as I pointed out in response to Wei Dai, probabilism doesn’t make the issues go away. It softens the impossibility proofs, but we still lack possibility proofs (which is what MIRI is working on). So this seems like a weak objection. If you want to say “future agents will have different reasoning frameworks than the ones we currently understand,” that’s well and good, but see below. (That seems a lot like discouraging someone from trying to develop logic because their proto-logic doesn’t resemble the way that humans actually reason.)
This is what I thought you meant; it seems analogous to:
Creation process A [natural and cultural evolution] led to agents who don’t require a formalized deductive system
By analogy, creation process C [people making AGI] will lead to agents who don’t require a formalized deductive system
Therefore, it is not necessary to take special precautions to ensure that deductive systems are formalized.
Do you object to the analogy?
No one thinks that the world will be destroyed because people built AI’s that couldn’t handle the Lobian obstruction. That doesn’t seem like a sensible position, and I think Eliezer explicitly disavows it in the writeup. The point is that we have some frameworks for reasoning about reasoning. Those formalisms don’t capture reflective reasoning, i.e. they don’t provide a formal account of how reflective reasoning could work in principle. The problem Eliezer points to is an obvious problem that any consistent framework for reflective reasoning must resolve.
Working on this problem directly may be less productive than just trying to understand how reflective reasoning works in general—indeed, folks around here definitely try to understand how reflective reasoning works much more broadly, rather than focusing on this problem. The point of this post is to state a precise problem which existing techniques cannot resolve, because that is a common technique for making progress.
Thank you for the example. I do want to say “this is only a problem for an agent who uses proofs” if that’s indeed true. It sounds like you agree, but are saying that some analogous but more complicated problem might arise for probabilistic agents, and that it might not be resolved be whoever else is making AI unless this research is done by MIRI. If you have an example of a complication that you think would plausibly arise in practice and have further thoughts on why we shouldn’t expect this complication to be avoided by default in the course of the ordinary development of AI, I would be interested in hearing more. These do seem like crucial questions to me if we want to argue that this is an important line of research for the future of AI. Do you agree that these questions are crucial?
I do object to this analogy, though I now have a better idea of where you are coming from. Here’s a stab at how the arguments are different (first thing that came to mind):
My argument says that if creation process A led to agents who overcome obstacle X to doing Z, then the ordinary development of AGI will lead to agents who overcome obstacle X to doing Z.
Your argument says that if creation process A led to agents who overcome obstacle X to doing Z in way Y, then the ordinary development of AGI will lead to agents who overcome obstacle X to doing Z in way Y.
We might want to insert some qualifiers like “obstacle X needs to be essential to the proper functioning of the agent” or something along those lines, and other conditions I haven’t thought of may be relevant as well (often the case with analogies). But, basically, though I think the analogy suggests that the ordinary development of AI will overcome Lobian obstacles, I think it is much less supported that AGIs will overcome these obstacles in the same way as humans overcome them. Likewise, humans overcome obstacles to reasoning effectively in certain ways, and I don’t think there is much reason to suspect that AGIs will overcome these obstacles in the same ways. Therefore, I don’t think that the line of argument I was advancing supports the view that formalizing math and doing mathematical logic will be unhelpful in developing AI.
I think what you’re saying is that getting a good framework for reasoning about reasoning could be important for making AGI go well. This is plausible to me. And then you’re also saying that working on this Lobian stuff is a reasonable place to start. This is not obvious to me, but this seems like something that could be subtle, and I understand the position better now. I also don’t think that however you’re doing it should necessarily seem reasonable to me right now, even if it is.
Big picture: the big questions I had about this were:
What are some plausible concrete examples of self-modifications where Lob issues might cause you to stumble?
Do you think that people building AGI in the future will stumble over Lob issues if MIRI doesn’t work on those issues? If so, why?
I would now ask those questions differently:
What are some plausible concrete examples of cases where machines might fail to reason about self-modification properly if this research isn’t done? Why do you think it might happen in these cases?
Do you think that people building AGI in the future will fail to do this research, if it is in fact necessary for building well-functioning AIs? If so, why?
I now have a better understanding of what your answer the first question might look like, though I’m still struggling to imagine what plausibly go wrong in practice if the research isn’t done. As far as I can tell, there hasn’t been any effort directed at addressing the second question in this specific context so far. Maybe that’s because it’s thought that it’s just part of the general question of whether future elites will handle AI development just fine. I’m not sure it is because it sounds like this may be part of making an AGI work at all, and the arguments I’ve heard for future elites not navigating it properly seems to turn on safety issues rather than basic functionality issues.
That’s not it, rather:
Yep. We have reasoning frameworks like the currently dominant forms of decision theory, but they don’t handle reflectivity well.
The Lob Problem isn’t a top-priority scary thing that is carved upon the tombstones of worlds, it’s more like, “Look! We managed to crisply exhibit something very precise that would go wrong with standard methods and get started on analyzing and fixing it! Before we just saw in a more intuitive sense that something would go wrong when we applied standard theories to reflective problems but now we can state three problems very precisely!” (Lob and coherent quantified belief sec. 3, nonmonotonicity of probabilistic reasoning sec. 5.2 & 7, maximizing / satisficing not being good-enough idioms for bounded agents sec. 8.) Problems with reflectivity in general are expectedly carved upon the tombstones of worlds because they expectedly cause problems with goal stability during self-modification. But to make progress on that you need crisp problems to provide fodder for getting started on finding a good shape for a reflective decision theory / tiling self-improving agent.
(As usual, I have somewhat less extreme views here than Eliezer.)
There is a problem here, we have an impossibility proof for a broad class of agents, and we know of no agents that don’t have the problem. Indeed, this limits the relevance of the impossibility proof, but it doesn’t limit the realness of the problem.
I don’t quite see where you are coming from here. It seems like the situation is:
There are problems that reflective reasoners would be expected to solve, which we don’t understand how to resolve in current frameworks for general reasoning (of which mathematical logic is the strongest).
If you think that reflective reasoning may be an important part of AGI, then having formal frameworks for reflective reasoning is an important part of having formal frameworks for AGI.
If you think that having formal frameworks is likely to improve our understanding of AGI, then having formal frameworks that support reflective reasoning is a useful step towards improving our understanding of AGI.
The sort of complication I imagine is: it is possible to build powerful AGI without having good frameworks for understanding its behavior, and then people do that. It seems like all things equal understanding a system is useful, not only for building it but also for having reasonable expectations about its behavior (which is in turn useful for making further preparations, solving safety problems, etc.). To the extent that understanding things at a deep level ends up being necessary to building them at all, then what we’re doing won’t matter (except insofar as people who care about safety making modest technical contributions is indirectly useful).
Same answer. It may be that understanding reasoning well is necessary to building powerful agents (indeed, that would be my mode guess). But it may be that you can influence the relative development of understanding vs. building, in which case pushing on understanding has a predictable effect. For example, if people didn’t know what proofs or probabilities were, it isn’t out of the question that they could build deep belief nets by empirical experimentation. But I feel safe saying that understanding proof and probability helps you better reason about the behavior of extremely powerful deep belief nets.
I agree that the cases differ in many ways. But this distinction doesn’t seem to get at the important thing. To someone working on logic you would say “I don’t know whether deduction systems will be formalized in the future, but I know that agents will be able to reason. So this suggests to me that your particular approach for defining reasoning, via formalization, is unnecessary.” In some sense this is true—if I’m an early mathematician and I don’t do logic, someone else will—but it has relatively little bearing on whether logic is likely to be mathematically productive to work on. If the question is about impact rather than productivity as a research program, then see the discussion above.
OK, helpful. This makes more sense to me.
This reply would make more sense if I was saying that knowing how to overcome Lobian obstacles would never be necessary for building well-functioning AI. But I was making the weaker claim that either it would never be necessary OR it would be solved in the ordinary development of AI. So if someone is formalizing logic a long time ago with the aim of building thinking machines AND they thought that when thinking machines were built logic wouldn’t be formalized properly and the machines wouldn’t work, then I might have complained. But if they had said, “I’d like to build a thinking machine and I think that formalizing logic will help get us there, whether it is done by others or me. And maybe it will go a bit better or come a bit sooner if I get involved. So I’m working on it.” then I wouldn’t have had anything to say.
Anyway, I think we roughly understand each other on this thread of the conversation, so maybe there is no need to continue.
I don’t think it’s the highest-priority issue to think about, but my impression is that among the issues that Eliezer has identified as worth thinking about, it could be the one closest to being completely mathematically formalized, so it’s a good one to focus on for the purpose of getting mathematicians interested in MIRI.
I do appreciate arguments in favor of focusing your effort on tractable problems, even if they are not the most important problems to solve.
It’s certainly hard to answer the question, “Why is this the best project to work on within AI?” since it implicitly requires comparisons will all types of stuff. It’s probably unreasonable to ask Eliezer to answer this question in a comment. However, it is reasonable to ask, “Why will this research help make machine intelligence develop in a way that will benefit humanity?” Most of the other questions in my comment are also relevant to that question.
I also question the importance of working on this problem now, but for a somewhat different reason.
My understanding is that Lobian issues make it impossible for a proof-based AI to decide to not immediately commit suicide, because it can’t prove that it won’t do something worse than nothing in the future. (Let’s say it will have the option to blow up Earth in the future. Since it can’t prove that its own proof system is consistent, it can’t prove that it won’t prove that blowing up Earth maximizes utility at that future time.) To me this problem looks more like a problem with making decisions based purely on proofs, and not much related to self-modification.
Using probabilities instead of proofs seems to eliminate the old obstructions, but it does leave a sequence of challenging problems (hence the work on probabilistic reflection). E.g., we’ve proved that there is an algorithm P using a halting oracle such that:
(Property R): Intuitively, we “almost” have a < P(X | a < P(X) < b) < b. Formally:
For each sentence X, each a, and each b, P(X AND a<P(X)<b ) < b * P(a ⇐ P(X) ⇐ b).
For each sentence X, each a, and each b, P(X AND a<=P(X)<=b) > a * P(a < P(X) < b)
But this took a great deal of work, and we can’t exhibit any algorithm that simultaneously satisfies Property R and has P(Property R) = 1. Do you think this is not an important question? It seems to me that we don’t yet know how many of the Godelian obstructions carry in the probabilistic environment, and there are still real problems that will involve ingenuity to resolve.
Putting the dangers of AI progress aside, we probably ought to first work on understanding logical uncertainty in general, and start with simpler problems. I find it unlikely that we can solve “probabilistic reflection” (or even correctly specify what the problem is) when we don’t yet know what principles allow us to say that P!=NP is more likely to be true than false. Do we even know that using probabilities is the right way to handle logical uncertainty? (People assumed that using probabilities is the right way to handle indexical uncertainty and that turned out to be wrong.)
We have coherent answers at least. See e.g. here for a formalism (and similarly the much older stuff by Gaifman, which didn’t get into priors). MIRI is working much more directly on this problem as well. Can you think of concrete open questions in that space? Basically we are just trying to develop the theory, but having simple concrete problems would surely be good. (I have a bucket of standard toy problems to resolve, and don’t have a good approach that handle all of them, but it’s pretty easy to hack together a solution to them so they don’t really count as open problems.)
I agree that AI progress is probably socially costly (highly positive for currently living folks, modestly negative for the average far future person). I think work with a theoretical bias is more likely to be helpful, and I don’t think it is very bad on net. Moreover, as long as safety-concerned folks are responsible for a very small share of all of the good AI work, the reputation impacts of doing good work seem very large compared to the social benefits or costs.
We don’t know that probabilities are the right way to handle logical uncertainty, nor that our problem statements are correct. I think that the kind of probabilistic reflection we are working on is fairly natural though.
I agree with both you and Nick that the strategic questions are very important, probably more important than the math. I don’t think that is inconsistent with getting the mathematical research program up and going. I would guess that all told the math will help on the strategy front via building the general credibility of AI safety concern (by 1. making it clear that there are concrete policy-relevant questions here, and 2. building status and credibility for safety-concerned communities and individuals), but even neglecting that I think it would still be worth it.
I read that paper before but it doesn’t say why its proposed way of handling logical uncertainty is the correct one, except that it “seem to have some good properties”. It seems like we’re still at a stage when we don’t understand logical uncertainty at a deep level and can offer solutions based on fundamental principles, but just trying out various ideas to see what sticks.
I’m not entirely clear on your position. Are you saying that theoretical AI work by safety-concerned folks has a net social cost, accounting for reputation impacts, or excluding reputation impacts?
Maybe I’m just being dense but I’m still not really getting why you think that (despite your past attempts to explain it to me in conversation). The current paper doesn’t seem to make a strong attempt to explain it either.
This is basically the same as the situation with respect to indexical probabilities. There are dominance arguments for betting odds etc. that don’t quite go through, but it seems like probabilities are still distinguished as a good best guess, and worth fleshing out. And if you accept probabilities prior specification is the clear next question.
I think it’s plausible there are net social costs, excluding reputational impacts, and would certainly prefer to think more about it first. But with reputational impacts it seems like the case is relatively clear (of course this is potentially self-serving reasoning), and there are similar gains in terms of making things seem more concrete etc.
Well, the first claim was that without the epsilons (i.e. with closed instead of open intervals) it would be exactly what you wanted (you would have an inner symbol that exactly corresponded to reality), and the second claim was that the epsilons aren’t so bad (e.g. because exact comparisons between floats are kind of silly anyway). Probably those could be more explicit in the writeup, but it would be helpful to know which steps seem shakiest.
Why do you say “exactly corresponded to reality”? You’d have an inner symbol which corresponded to the outer P, but P must be more like subjective credence than external reality, since in reality each logical statement is presumably either true or false, not a probabilistic mixture of both?
Intuitively, what I’d want is a “math intuition module” which, if it was looking at a mathematical expression denoting the beliefs that a copy of itself would have after running for a longer period of time or having more memory, would assign high probability that those beliefs would better correspond to reality than its own current beliefs. This would in turn license the AI using this MIM to build a more powerful version of itself, or just to believe that “think more” is generally a good idea aside from opportunity costs. I understand that you are not trying to directly build such an MIM, just to do a possibility proof. But your formalism looks very different from my intuitive requirement, and I don’t understand what your intuitive requirement might be.
P is intended to be like objective reality, exactly analogously with the predicate “True.” So we can adjoin P as a symbol and the reflection principle as an axiom schema, and thereby obtain a more expressive language. Depending on architecture, this also may increase the agent’s ability to formulate or reason about hypotheses.
Statements without P’s in them, are indeed either true or false with probability 1. I agree it is a bit odd for statements with P in them to have probabilities, but I don’t see a strong argument it shouldn’t happen. In particular, it seems irrelevant to anything meaningful we would like to do with a truth predicate. In subsequent versions of this result, the probabilities have been removed and the core topological considerations exposed directly.
The relationship between a truth predicate and the kind of reasoning you discuss (a MIM that believes its own computations are trustworthy) is that truth is useful or perhaps necessary for defining the kind of correspondence that you want the MIM to accept, about a general relationship between the algorithm it is running and what is “true”. So having a notion of “truth” seems like the first step.
Also, by attracting thinkers who can initially only be attracted by crisp technical problems, but as they get involved, will turn their substantial brainpower toward the strategic questions as well.
For three additional reasons for MIRI to focus on math for now, see the bullet points under “strategic research will consume a minority of our research budget in 2013” in MIRI’s Strategy for 2013.
Maybe we use the same principle that allows me to say “I guess I left my wallet at home” after I fail to find the wallet in the most likely places it could be, like my pockets. In other words, maybe we do Bayesian updating about the location of the “true” proof or disproof, as we check some apriori likely locations (attempted proofs and disproofs) and fail to find it there. This idea is still very vague, but looks promising to me because it doesn’t assume logical omniscience, unlike Abram’s and Benja’s ideas...
I think I was implicitly assuming that you wouldn’t have an agent making decisions based purely on proofs.
Layman’s answer: we want to predict what some self-modifying AI will do, so we want a decision theory that can ask about the effect of adopting a new decision theory or related processes. (The paper’s issues could easily come up.) The one alternative I can see involves knowing in advance, as humans, how any modification that a super-intelligence could imagine will affect its goals. This seems like exactly what humans are bad at.
Speaking of, you say we “seem capable of overcoming putative Lobian obstacles to self-modification.” But when I think about CEV, this appears dubious. We can’t express exactly what ‘extrapolation’ means, save by imagining a utility function that may not exist. And without a better language for talking about goal stability, how would we even formalize that question? How could we formally ask if CEV is workable?
Another aspect of where I’m coming from is that there should be a high standard of proof for claiming that something is an important technical problem in future AI development because it seems so hard to predict what will and won’t be relevant for distant future technologies. My feeling is that paragraphs like this one, while relevant, don’t provide strong enough arguments to overcome the prior:
I would greatly appreciate further elaboration on why this is the right problem to be working on right now.
On the other hand, trying to solve many things that have a significant probability of being important so that you’re likely to eventually solve something that actually is important as a result, seems like a better idea than not doing anything because you can’t prove that any particular sub-problem is important.
I agree with this principle but think my claims are consistent with it. Doing stuff other than “technical problems in the future of AI” is an alternative worth considering.
I disagree. I would go in the other direction: if it seems relatively plausible that something is relevant, then I’m happy to have someone working on it. This is why I am happy to have MIRI work on this and related problems (to the extent that I even attended one of the Lob workshops), even if I do not personally think they are likely to be relevant.
ETA: My main reason for believing this is that, among scientific fields, willingness to entertain such research programs seems to correlate with the overall health of the field.
Are there other problems that you think it would be better to be working on now?
My question was more of a request for information than a challenge; Eliezer could say some things that would make doing mathematics on the Lob problem look more promising to me. It seems likely to me that I am missing some important aspect of the situation.
If I’m not missing anything major, then I think that, within the realm of AI risk, general strategic work addressing questions like, “Will the world’s elites navigate the creation of AI just fine?” would be preferable. That’s just one example; I do not mean to be claiming that this is the best thing to do. As I said in another comment, it’s very hard to argue that one course is optimal.
Thanks, that, at least for me, provides more context for your questions.
All that said, I do think that where MIRI has technical FAI questions to work on now, I think it is a very reasonable to write up:
what the question is
why answering the question is important for making machine intelligence benefit humanity
why we shouldn’t expect the question to be answered by default by whoever makes AGI
In this particular case, I am asking for more info about the second two questions.
For convenience: “SuperBenefit” = increasing the probability that advanced machine intelligence has a positive impact.
I agree that MIRI has a lot left to explain with respect to questions #2 and #3, but it’s easier to explain those issues when we’ve explained #1 already, and we’ve only just begun to do that with AI forecasting, IEM, and Tiling Agents.
Presumably the relevance of AI forecasting and IEM to SuperBenefit is clear already?
In contrast, it does seem like the relevance of the Tiling Agents work to SuperBenefit is unclear to many people, and that more explanation is needed there. Now that Tiling Agents has been published, Eliezer has begun to explain its relevance to SuperBenefit in various places on this page, but it will take a lot of trial and error for us to discover what is and isn’t clear to people.
As for question #3, we’ve also only just begun to address that issue in detail.
So, MIRI still has a lot of explaining to do, and we’re working on it. But allow me a brief reminder that this gap isn’t unique to MIRI at all. Arguing for the cost effectiveness of any particular intervention given the overwhelming importance of the far future is extremely complicated, whether it be donating to AMF, doing AI risk strategy, spreading rationality, or something else.
E.g. if somebody accepts the overwhelming importance of the far future and is donating to AMF, they have roughly as much explaining to do as MIRI does, if not more.
Yes.
I basically agree with these comments, with a couple of qualifications.
I think it’s unique to MIRI in the sense that it makes sense for MIRI to be expected to explain how its research is going to accomplish its mission of making machine intelligence benefit humanity, whereas it doesn’t make sense for global health charities to be expected to explain why improving global health makes the far future go better. This means MIRI has an asymmetrically hard job, but I do think it’s a reasonable division of labor.
I think it makes sense for other people who care about the far future to evaluate how the other strategies you mentioned are expected to affect the far future, and try to find the best ones. There is an overwhelming amount of work to do.
Right. Very few charities are even claiming to be good for the far future. So there’s an asymmetry between MIRI and other charities w.r.t. responsibility to explain plausible effects on the far future. But among parties (including MIRI) who care principally about the far future and are trying to do something about it, there seems to be no such asymmetry — except for other reasons, e.g. asymmetry in resource use.
Yes.
I agree with this. Typically people justify their research on other grounds than this, for instance by identifying an obstacle to progress and showing how their approach might overcome it in a way that previously tried approaches were not able to. My impression is that one reason for doing this is that it is typically much easier to communicate along these lines, because it brings the discourse towards much more familiar technical questions while still correlating well with progress more generally.
Note that under this paradigm, the main thing MIRI needs to do to justify their work is to explain why the Lob obstacle is insufficiently addressed by other approaches (for instance, statistical learning theory). I would actually be very interested in understanding the relationship of statistics to the Lob obstacle, so look forward to any writeup that might exist in the future.
(These were some comments I had on a slightly earlier draft than this, so the page numbers and such might be slightly off.)
Page 4, footnote 8: I don’t think it’s true that only stronger systems can prove weaker systems consistent. It can happen that system A can prove system B consistent and A and B are incomparable, with neither stronger than the other. For example, Gentzen’s proof of the consistency of PA uses a system which is neither stronger nor weaker than PA.
Page 6: the hypotheses of the second incompleteness theorem are a little more restrictive than this (though not much, I think).
Page 11, problem c: I don’t understand the sentence containing “highly regular and compact formula.” Looks like there’s a typo somewhere.
I think there are more trivial counterexamples to the statement also. Take Robinson arithmetic and throw in an axiom asserting the consistency of PA. This system can trivially prove that PA is consistent, and is much weaker than PA.
Your post confused me for a moment, because Robinson + Con(PA) is of course not weaker than PA. It proves Con(PA), and PA doesn’t.
I see now that your point is that Robinson arithmetic is sufficiently weak compared to PA that PA should not be weaker than Robinson + Con(PA). Is there an obvious proof of this?
(For example, if Robinson + Con(PA) proved all theorems of PA, would this contradict the fact that PA is not finitely axiomatizable?)
Yes, finite axiomatizability is the obvious way of seeing this. You are correct that strictly speaking Robinson + Con(PA) is not weaker than PA, but rather is another incomparable example (which was the intended point). Note that there are other ways of seeing that Robinson + Con(PA) is weaker than PA without using the finite axiomatization of PA if one is willing to be be slightly non-rigorous. For example, one can note that Robinson arithmetic has as a model Z[x]+ so any theorem of Robinson +Con(PA) should be a theorem of Z[x]+ +Con(PA), (this step requires some details).
Ah, so my question was more along the line: does finite axiomatizability of a stronger (consistent) theory imply finite axiomatizability of the weaker theory? (This would of course imply Q + Con(PA) is not stronger than PA, Q being the usual symbol for Robinson arithmetic.)
On the model theoretic side, I think I can make something work, but it depends on distorting the specific definition of Con(PA) in a way that I’m not really happy about. In any case, I agree that your example is trivial to state and trivial to believe correct, but maybe it’s less trivial to prove correct.
Here’s what I was thinking:
Consider the predicate P(x) which says “if x != Sx, then x does not encode a PA-proof of 0=1”, and let ConMinus(PA) say for all x, P(x). Now, I think one could argue that ConMinus is a fair definition of (or substitute for?) Con, in that qualifying a formula with “if x != Sx” does not change its meaning in the standard model. Alternately, you could push this “if x != Sx” clause deeper, into basically every formula you would use to define the primitive recursive functions needed to talk about consistency in the first place, and you would not change the meanings of these formulas in the standard model. (I guess what I’m saying is that “the” formula asserting the consistency of PA is poorly specified.)
Also, PA is smart enough to prove that numbers are not their own successors, so PA believes in the equivalence of Con and ConMinus. In particular, PA does not prove ConMinus(PA), so PA is not stronger than Q + ConMinus(PA).
On the other hand, let M be the non-negative integers, together with one additional point omega. Put S(omega) = omega, put omega + anything = omega = anything + omega, and similarly for multiplication (except 0 * omega = omega * 0 = 0). I am pretty sure this is a model of Q.
Q is smart enough about its standard integers that it knows none of them encode PA-proofs of 0=1 (the “proves” predicate being Delta_0). Thus the model M satisfies Q + ConMinus(PA). But now we can see that Q + ConMinus(PA) is not stronger than PA, because PA proves “for all x, x is not equal to Sx”, yet this statement fails in a model of Q + ConMinus(PA).
EDIT: escape characters for *.
If I’m not mistaken, NBG and ZFC are a counterexample to this: NBG is a conservative extension of ZFC (and therefore stronger than ZFC), but NBG is finitely axiomatizable while ZFC is not.
Yeah, the details of actually proving this are looking like they contain more subtleties than I expected, but I tentatively agree with your analysis. Here’s what may be another proof,. Not only is PA not finitely axiomatizable, but any consistent extension of PA isn’t (I think this is true, the same proof that works for PA should go through here, but I haven’t checked the details). So PA+ConMinus(PA) still isn’t finitely axiomatizable. So now, pick any of the axioms created in the axiom schema of induction that are needed in PA + ConMinus(PA), Q+ConMinus(PA) also can’t prove any of those (since it is strictly weaker than PA+ConMinus(PA),) but all of those statements are theorems of PA (since they are in fact axioms). Does this work?
Overall, this is requiring a lot more subtlety than I initially thought was involved which may make Qiaochu Yuan’s example a better one.
Well if we had this, we would know immediately that Q + Con(PA) is not an extension of PA (which is what we originally wanted), because it certainly is finitely axiomatizable. I know there are several proofs that PA is not finitely axiomatizable, but I have not read any of them, so can’t comment on the strengthened statement, though it sounds true.
Page 4 footnote 8 in the version you saw looks like footnote 9 in mine.
I don’t see how ‘proof-of-bottom → bottom’ makes a system inconsistent. This kind of formula appears all the time in Type Theory, and is interpreted as “not(proof-of-bottom)”.
The ‘principle of explosion’ says ‘forall A, bottom → A’. We can instantiate A to get ‘bottom → not(proof-of-bottom)’, then compose this with “proof-of-bottom → bottom” to get “proof-of-bottom → not(proof-of-bottom)”. This is an inconsistency iff we can show proof-of-bottom. If our system is consistent, we can’t construct a proof of bottom so it remains consistent. If our system is inconsistent then we can construct a proof of bottom and derive bottom, so our system remains inconsistent.
Have I misunderstood this footnote?
[EDIT: Ignore me for now; this is of course Lob’s theorem for bottom. I haven’t convinced myself of the existence of modal fixed points yet though]
That is strictly correct, but not relevant for self-improving AI. You don’t want father AI that cannot prove everything that the child AI can prove. Maybe the footnote should be edited in this sense.
Well, if A can prove everything B can, except for con(A), and B can prove everything A can, except for con(B), then you’re relatively happy.
ETA: retracted (thanks to Joshua Z for pointing out the error).
I don’t think this can’t happen, since A has proven Con(B), then it can now reason using system B for consistency purposes and get from the fact that B proves Con(A) to get A proving Con(A), which is bad.
Thanks for pointing this out. My mathematical logic is rusty.
Why frame this problem as about tiling/self-modification instead of planning/self-prediction? If you do the latter though, the problem looks more like an AGI (or AI capability) problem than an FAI (or AI safety) problem, which makes me wonder if it’s really a good idea to publicize the problem and invite more people to work on it publicly.
Regarding section 4.3 on probabilistic reflection, I didn’t get a good sense from the paper of how Christiano et al’s formalism relates to the concrete problem of AI self-modification or self-prediction. For example what are the functions P and p supposed to translate to in terms of an AI and its descendent or future self?
One argument in favor of this being relevant specifically to FAI is that evolution kludged up us, so there is no strong reason to think that AGI projects with an incomplete understanding of the problem space will eventually kludge up an AGI that is able to solve these problems itself and successfully navigate an intelligence explosion—and then paperclip the universe, since the incomplete understanding of the human researchers creating the seed AI wouldn’t suffice for giving the seed AI stable goals. I.e., solving this in some way looks probably necessary for reaching AI safety at all, but only possibly helpful for AI capability.
I’m not entirely unworried about that concern, but I’m less worried about it than about making AGI more interesting by doing interesting in-principle work on it, and I currently feel that even the latter danger is outweighed by the danger of not tackling the object-level problems early enough to actually make progress before it’s too late.
This sentence has kind of a confusing structure and I’m having trouble understanding the logic of your argument. Could you rewrite it? Also, part of my thinking, which I’m not sure if you’ve addressed, is that an AGI that fails the Lobian obstacle isn’t just unable to stably self-modify, it’s unable to do even the simplest kind of planning because it can’t predict that its future selves won’t do something crazy. A “successful” (ETA: meaning one that FOOMs) AGI project has to solve this planning/self-prediction problem somehow. Why wouldn’t that solution also apply to the self-modification problem?
Sorry for being confusing, and thanks for giving me a chance to try again! (I did write that comment too quickly due to lack of time.)
So, my point is, I think that there is very little reason to think that evolution somehow had to solve the Löbstacle in order to produce humans. We run into the Löbstacle when we try to use the standard foundations of mathematics (first-order logic + PA or ZFC) in the obvious way to make a self-modifying agent that will continue to follow a given goal after having gone through a very large number of self-modifications. We don’t currently have any framework not subject to this problem, and we need one if we want to build a Friendly seed AI. Evolution didn’t have to solve this problem. It’s true that evolution did have to solve the planning/self-prediction problem, but it didn’t have to solve it with extremely high reliability. I see very little reason to think that if we understood how evolution solved the problem it solved, we would then be really close to having a satisfactory Löbstacle-free decision theory to use in a Friendly seed AI—and thus, conversely, I see little reason to think that an AGI project must solve the Löbstacle in order to solve the planning/prediction problem as well as evolution did.
I can more easily conceive of the possibility (but I think it rather unlikely, too) that solving the Löbstacle is fundamentally necessary to build an agent that can go through millions of rewrites without running out of steam: perhaps without solving the Löbstacle, each rewrite step will have an independent probability of making the machine wirehead (for example), so an AGI doing no better than evolution will almost certainly wirehead during an intelligence explosion. But in this scenario, since evolution build us, an AGI project might build an AI that solves the planning/self-prediction as well as we do, and that AI might then go and solve the Löbstacle and go through a billion self-modifications and take over the world. (The human operators might intervene and un-wirehead it every 50,000 rewrites or so until it’s figured out a solution to the Löbstacle, for example.) So even in this scenario, the Löbstacle doesn’t seem a barrier to AI capability to me; but it is a barrier to FAI, because if it’s the AI that eventually solves the Löbstacle, the superintelligence down the line will have the values of the AI at the time it’s solved the problem. This was what I intended to say by saying that the AGI would “successfully navigate an intelligence explosion—and then paperclip the universe”.
(On the other hand, while I only think of the above as an outside possibility, I think there’s more than an outside possibility that a clean reflective decision theory could be helpful for an AGI project, even if I don’t think it’s a necessary prerequisite. So I’m not entirely unsympathetic to your concerns.)
Does the above help to clarify the argument I had in mind?
So you think that humans do not have a built-in solution to the Löbstacle, and you must also think we are capable of building an FAI that does have a built-in solution to the Löbstacle. That means an intelligence without a solution to the Löbstacle can produce another intelligence that shares its values and does have a solution to the Löbstacle. But then why is it necessary for us to solve this problem? (You said earlier “solving this in some way looks probably necessary for reaching AI safety at all, but only possibly helpful for AI capability.”) Why can’t we instead built an FAI without solving this problem, and depend on the FAI to solve the problem while it’s designing the next generation FAI?
Also earlier you said
I’ve been arguing with Eliezer and Paul about this recently, and thought that I should get the details of your views too. Have you been following the discussions under my most recent post?
Sorry for the long-delayed reply, Wei!
Yup.
I have two main reasons in mind. First, if you are willing to grant that (a) this is a problem that would require humans years of serial research to solve and (b) that it looks much easier to build this into an AI designed from scratch rather than bolting it on to an existing AI design that was created without taking these considerations into account, but you still think that (c) it would be a good plan to have the first-generation FAI solve this problem when building the next-generation FAI, then it seems that you need to assume that the FAI will be much better at AGI design than its human designers before it executes its first self-rewrite, since the human team would by assumption still need years to solve the problem at that point and the plan wouldn’t be particularly helpful if the first-generation FAI would need a similar amount of time or longer. But it seems unlikely to me that we first need to build ultraintelligent machines a la I.J. Good, far surpassing humans, before we can get an intelligence explosion: it seems to me that most of the probability mass should be in the required level of AGI research ability being ⇐ the level of the human research team working on the AGI. I admit that one possible strategy could be to continue having humans improve the initial FAI until it is superintelligent and then ask it to write a successor from scratch, solving the Löbstacle in the process, but it doesn’t seem particularly likely that this is cheaper than solving the problem beforehand.
Second, if we followed this plan, when building the initial FAI we would be unable to use mathematical logic (or other tools sufficiently similar to be subject to the same issues) in a straight-forward way when having it reason about its potential successor. This cuts off a large part of design-space that I’d naturally be looking to. Yes, if we can do it then it’s possible in principle to get an FAI to do it, but mimicking human reasoning doesn’t seem likely to me to be the easiest way to build a safe AGI.
I agree with you that relying on an FAI team to solve a large number of philosophical problems correctly seems dangerous, although I’m sympathetic to Eliezer’s criticism of your outside-view arguments—I essentially agree with your conclusions, but I think I use more inside-view reasoning to arrive at them (would need to think longer to tease this apart). I agree with Paul that something like CEV for philosophy in addition to values should probably part of an FAI design. I agree with you that progress in metaphilosophy would be very valuable, but I do not have any concrete leads to follow. But I think that having good solutions to some of these problems is not unlikely to be helpful for FAI design (and more helpful to FAI than uFAI) so I still think that some amount of work allocated to these philosophical problems looks like a good thing; and I also think that working on these problems does on average reduce the probability of making a bad mistake even if we manage to have the FAI do philosophy itself and have it checked by “coherent extrapolated philosophy”.
You quoted my earlier comment that I think that making object-level progress is important enough that it seems a net positive despite making AGI research more interesting, but I don’t really feel that your post or the discussion below that contains much in the way of arguments about that—could you elaborate on the connection?
(I endorse essentially all of Benja’s reply above.)
Thanks, that’s very helpful. (I meant to write a longer reply but haven’t gotten around to it yet. Didn’t want you to feel ignored in the mean time.)
Please augment footnote 4 or otherwise provide a more complete summary of symbols.
I appreciate that some of these symbols have standard meanings in the relevant fields. But different subfields use these with subtle differences.
Note, for example, this list of logic symbols in which single-arrow → and double-arrow ⇒ are said to carry the same meaning, but you use them distinctly. (You also use horizontal line (like division) to indicate implication on page 7. I think you mean the same as single-arrow. Again, that’s standard notation, but it should be described along with the others.)
On page 3 you specify a special meaning for double turnstile ||- . I’d like to see an explanation of your meaning for the symbol, “an agent has cognitively concluded a belief,” instead of the entailment usually meant by the double turnstile, but in any case, please add it to the summary.
It’s good that you keep the paper technical, while using the footnotes to enlighten the rest of us. A more complete list of symbols would be helpful for that.
Congrats to Eliezer and Marcello on the writeup! It has helped me understand Benja’s “parametric polymorphism” idea better.
There’s a slightly different angle that worries me. What happens if you ask an AI to solve the AI reflection problem?
1) If an agent A_1 generates another agent A_0 by consequentialist reasoning, possibly using proofs in PA, then future descendants of A_0 also count as consequences. So at least A_0 should not face the problem of “telomere shortening”, because PA can see the possible consequences of “telomere shortening” already. But what will A_0 look like? That’s a mystery.
2) To figure out the answer to (1), it’s natural to try devising a toy problem where we could test different implementations of A_1. Benja made a good attempt, then Wei came up with an interesting quining solution to that. Eliezer has now formalized his objection to quining solutions as the “Vingean principle” (no, actually “naturalistic principle”, thx Eliezer), which is a really nice step. Now I just want a toy problem where we’re forced to apply that principle :-) Why such problems are hard to devise is another mystery.
(Quick note: Wei’s quining violates the naturalistic principle, not the Vingean principle. Wei’s actions were still inside quantifiers but had separate forms for self-modification and action. So did Benja’s original proposal in the Quirrell game, which Wei modified—I was surprised and impressed when Benja’s polymorphism approach carried over to a naturalistic system.)
Was it UDT1.1 (as a solution to this problem) that violates the Vingean principle?
Also, I’m wondering if Benja’s polymorphism approach solves the “can’t decide whether or not to commit suicide” problem that I described here. Your paper doesn’t seem to address this problem since the criteria of action you use all talk about “NULL or GOAL” and since suicide leads to NULL, an AI using your criterion of action has trouble deciding whether or not to commit suicide for an even more immediate reason. Do you have any ideas how your framework might be changed to allow this problem to be addressed?
As I remarked in that thread, there are many possible designs that violate the Vingean principle, AFAICT UDT 1.1 is one of them.
Suicide being permitted by the NULL option is a different issue from suicide being mandated by self-distrust. Benja’s TK gets rid of distrust of offspring. Work on reflective/naturalistic trust is ongoing.
Thanks! Corrected.
Page 14, Remarks. Typo:
This should be “T_0 can prove certain exact theorems which T_1 cannot”.
If ever I wanted to upvote something twice, it’s this.
As a non-expert fan of AI research, I simply wanted to mention that this and other recent papers seem to go a fair ways toward addressing one of the Karnofsky’s criticisms of the SI I remember agreeing with:
Hopefully more interesting stuff will follow, even if I am not in a position to evaluate its validity.
Cool.
I’ll get my math team working on this right away, and we eagerly await more of these. ;)
EDIT: slight sarcasm on the “my math team”.
(Your math team?)
(Friends that are studying math for the purpose of solving FAI problems and parts of myself that can be similarly described.)
I am guessing nyan_sandwich is referring to a team that participates in the various nation-/world-wide mathematics competitions that exist. It’s an interesting way for math-talented college students and the like to exercise their mathematical abilities.
nope. Unfortunately nothing so formal.
I use “my math team” the same way people refer to lawyer-friends as “my legal team”; slightly sarcastically.
This post does answer some questions I had regarding the relevance of mathematical proof to AI safety, and the motivations behind using mathematical proof in the first place. I don’t believe I’ve seen this bit before:
...I’ve actually said it many, many times before but there’s a lot of people out there depicting that particular straw idea (e.g. Mark Waser).
I don’t read a lot of other people’s stuff about your ideas (e.g. Mark Waser) but I have read most of the things you’ve published. I’m surprised to hear you’ve said it many times before.
For the record, I personally think this statement is too strong. Instead, I would say something more like what Paul said:
Probably a stupid question but… on the issue of goal stability more generally, might the Lyapunov stability theorems (see sec. 3) be of use? For more mathematical detail, see here.
I like to think of this paper as being as the philosophical edge of AI research, which doesn’t yet have its own subfield, ala a quote from Boden in Bobrow & Hayes (1985):
I like the idea of a “philosophical edge”, but what it brings to mind is more the Dennett quote (don’t remember whether the idea originates with him, would expect that it doesn’t but don’t know) to the effect that philosophy (as opposed to science) is what you do when you haven’t yet figured out what the right questions to ask are. (Not 100% right match for the tiling paper, but going in the right direction.)
On the other hand, I never liked the famous “it stops being called AI as soon as people start using it” meme you’re quoting, because that always struck me as a completely reasonable position to take. Surely pattern recognition, image-processing and rule-based systems aren’t obviously huge steps towards passing the Turing test, and although I’m willing to call narrow AI “narrow artificial intelligence” because I see no reason to embark on the fool’s errand of trying to change that terminology, I can’t really blame people for measuring “AI” research against the standard of general intelligence. And yes, it’s quite possible that pattern recognition, image processing and rule-based systems are necessary baby steps on the road to AGI, but if someone in their best judgment thinks that they’re probably not, I don’t see why they’re obviously wrong. And just because your research into alchemy lead to important insights into chemistry, you don’t get to call all chemistry research “alchemy” (with obvious analogy caveat that the metal-to-gold-by-magic-symbols goal of alchemy is bunk and we have an existence proof of AGI).
For another take on why some people expect this kind of work on the Lobian obstacle to be useful for FAI, see this interview with Benja Fallenstein.
I read the first two pages of the publication, and wasn’t convinced that the problem that the paper attempts to address has non-negligible relevance to AI safety. I would be interested in seeing you spell out your thoughts on this point in detail.
[Edit: Thinking this over a little bit, I realize that maybe my comment isn’t so helpful in isolation. I corresponded with Luke about this, and would be happy to flesh out my thinking either publicly or in private correspondence.]
The LW post may address some of your concerns. The idea here is that we need a tiling decision criterion, and the paper isn’t supposed to be an AI design, it’s supposed to get us a little conceptually closer to a tiling decision criterion. If you don’t understand why a tiling decision criterion is a good thing in a self-improving AI which is supposed to have a stable goal system, then I’m not quite sure what issue needs addressing.
Thanks for your courtesy, and again, sorry for not being more specific in my original comment.
Yes, I’m questioning why a self-improving AI which is intended to have a stable goal system needs a tiling decision criterion. In your publication, you wrote
I don’t see why the model of the sequence of agents is a good operationalization. My intuition is that
A self modifying AI would modify itself by modifying its modules one by one.
It would reconstruct a given module whole-cloth, rather than doing so by incrementally changing the module in small steps.
To elaborate, and for concreteness, I’ll comment on
I haven’t read the technical portions of the paper, but my surface impression is that the operationalization in the paper is analogous modifying your arms by successively shaving slivers of tissue off of them, and grafting slivers of tissue onto them, with a view toward making them really long. Another way to go would be to grow the long arms in a lab, chop off your current arms, and then graft the newly created long arms onto yourself. In the context of self-modifying AIs, the latter possibility seems to me to be significantly more likely than the former possibility.
Is my surface impression of the operationalization right? If so, what do you think about the points that I raise in the previous paragraphs?
Jonah, some self-modifications will potentially be large, but others might be smaller. More importantly we don’t want each self-modification to involve wrenching changes like altering the mathematics you believe in, or even worse, your goals. Most of the core idea in this paper is to prevent those kinds of drastic or deleterious changes from being forced by a self-modification.
But it’s also possible that there’ll be many gains from small self-modifications, and it would be nicer not to need a special case for those, and for this it is good to have (in theoretical principle) a highly regular bit of cognition/verification that needs to be done for the change (e.g. for logical agents the proof of a certain theorem) so that small local changes only call for small bits of the verification to be reconsidered.
Another way of looking at it is that we’re trying to have the AI be as free as possible to self-modify while still knowing that it’s sane and stable, and the more overhead is forced or the more small changes are ruled out, the less free it is.
Thanks for engaging.
I’m very sympathetic to this in principle, but don’t see why there would be danger of these things in practice.
Humans constantly perform small self-modifications, and this doesn’t cause serious problems. People’s goals do change, but not drastically, and people who are determined can generally keep their goals pretty close to their original goals. Why do you think that AI would be different?
To ensure that one gets a Friendly AI, it suffices to start with good goal system, and to ensure that the goal system remains pretty stable over time. It’s not necessary that the AI be as free as possible.
You might argue that an limited AI wouldn’t be able to realize as good as a future as one without limitations.
But if this is the concern, why not work to build a limited AI that can itself solve the problems about having a stable goal system under small modifications? Or, if it’s not possible to get a superhuman AI subject to such limitations, why not build a subhuman AI and then work in conjunction with it to build Friendly AI that’s as free as possible?
Many things in AI that look like they ought to be easy have hidden gotchas which only turn up once you start trying to code them, and we can make a start on exposing some of these gotchas by figuring out how to do things using unbounded computing power (albeit this is not a reliable way of exposing all gotchas, especially in the hands of somebody who prefers to hide difficulties, or even someone who makes a mistake about how a mathematical object behaves, but it sure beats leaving everything up to verbal arguments).
Human beings don’t make billions of sequential self-modifications, so they’re not existence proofs that human-quality reasoning is good enough for that.
I’m not sure how to go about convincing you that stable-goals self-modification is not something which can be taken for granted to the point that there is no need to try to make the concepts crisp and lay down mathematical foundations. If this is a widespread reaction beyond yourself then it might not be too hard to get a quote from Peter Norvig or a similar mainstream authority that, “No, actually, you can’t take that sort of thing for granted, and while what MIRI’s doing is incredibly preliminary, just leaving this in a state of verbal argument is probably not a good idea.”
Depending on your math level, reading Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference by Judea Pearl might present you with a crisper idea of why it can be a good idea to formalize certain types of AI problems in general, and it would be a life-enriching experience, but I anticipate that’s more effort than you’d want to put into this exact point.
I don’t disagree (though I think that I’m less confident on this point than you are).
Why do you think that an AI would need to make billions of sequential self-modifications when humans don’t need to?
I agree that it can’t be taken for granted. My questions are about the particular operationalization of a self-modifying AI that you use in your publication. Why do you think that the particular operationalization is going to be related to the sorts of AIs that people might build in practice?
The paper is meant to be interpreted within an agenda of “Begin tackling the conceptual challenge of describing a stably self-reproducing decision criterion by inventing a simple formalism and confronting a crisp difficulty”; not as “We think this Godelian difficulty will block AI”, nor “This formalism would be good for an actual AI”, nor “A bounded probabilistic self-modifying agent would be like this, only scaled up and with some probabilistic and bounded parts tacked on”. If that’s not what you meant, please clarify.
Ok, that is what I meant, so your comment has helped me better understand your position.
Why do you think that
is cost-effective relative to other options on the table?
For “other options on the table,” I have in mind things such as spreading rationality, building the human capital of people who care about global welfare, increasing the uptake of important information into the scientific community, and building transferable skills and connections for later use.
Personally, I feel like that kind of metawork is very important, but that somebody should also be doing something that isn’t just metawork. If there’s nobody making concrete progress on the actual problem that we’re supposed to be solving, there’s a major risk of the whole thing becoming a lost purpose, as well as of potentially-interested people wandering off to somewhere where they can actually do something that feels more real.
From inside MIRI, I’ve been able to feel this one viscerally as genius-level people come to me and say “Wow, this has really opened my eyes. Where do I get started?” and (until now) I’ve had to reply “Sorry, we haven’t written down our technical research agenda anywhere” and so they go back to machine learning or finance or whatever because no, they aren’t going to learn 10 different fields and become hyper-interdisciplinary philosophers working on important but slippery meta stuff like Bostrom and Shulman.
Yes, that’s a large de-facto part of my reasoning.
I think that in addition to this being true, it is also how it looks from the outside—at least, it’s looked that way to me, and I imagine many others who have been concerned about SI focusing on rationality and fanfiction are coming from a similar perspective. It may be the case that without the object-level benefits, the boost to MIRI’s credibility from being seen to work on the actual technical problem wouldn’t justify the expense of doing so, but whether or not it would be enough to justify the investment by itself, I think it’s a really significant consideration.
[ETA: Of course, in the counterfactual where working on the object problem actually isn’t that important, you could try to explain this to people and maybe that would work. But since I think that it is actually important, I don’t particularly expect that option to be available.]
Yes. I’ve had plenty of conversations with people who were unimpressed with MIRI, in part because the organization looked like it was doing nothing but idle philosophy. (Of course, whether that was the true rejection of the skeptics in question is another matter.)
I understand your position, but believe that your concerns are unwarranted, though I don’t think that this is obvious.
If I gave you a list of people who in fact expressed interest but then, when there were no technical problems for them to work on, “wandered off to somewhere where they can actually do something that feels more real,” would you change your mind? (I may not be able to produce such a list, because I wasn’t writing down people’s names as they wandered away, but I might be able to reconstruct it.)
Sounds like me two years ago, before I committed to finishing my doctorate. Oops.
Well, I’m not sure the “oops” is justified, given that two years ago, I really couldn’t help you contribute to a MIRI technical research program, since it did not exist.
No, the oops is on me for not realizing how shallow “working on something that feels more real” would feel after the novelty of being able to explain what I work on to laypeople wore off.
Ah, I see.
I don’t doubt you: I have different reasons for believing Kaj’s concerns to be unwarranted:
It’s not clear to me that offering people problems in mathematical logic is a good way to get people to work on Friendly AI problems. I think that the mathematical logic work is pretty far removed from the sort of work that will be needed for friendliness.
I believe that people who are interested in AI safety will not forget about AI safety entirely, independently of whether they have good problems to work on now.
I believe that people outside of MIRI will organically begin to work on AI safety without MIRI’s advocacy when AI is temporally closer.
Mathematical logic problems are FAI problems. How are we going to build something self-improving that can reason correctly without having a theory of what “reasoning correctly” (ie logic) even looks like?
Based on what?
I’ll admit I don’t know that I would settle on mathematical logic as an important area of work, but EY being quite smart, working on this for ~10 years, and being able to convince quite a few people who are in a position to judge on this is good confirmation of the plausible idea that work in reflectivity of formal systems is a good place to be.
If you do have some domain knowledge that I don’t have that makes stable reflectivity seem less important and puts you in a position to disagree with an expert (EY), please share.
People can get caught in other things. Maybe without something to work on now, they get deep into something else and build their skills in that and then the switching costs are too high to justify it. Mind you there is a steady stream of smart people, but opportunity costs.
Also, MIRI may be burning reputation capital by postponing actual work such that there may be less interested folks in the future. This could go either way, but it’s a risk that should be accounted for.
(I for one (as a donor and wannabe contributor) appreciate that MIRI is getting these (important-looking) problems available to the rest of us now)
How will they tell? What if it happens too fast? What if the AI designs that are furthest along are incompatible with stable reflection? Hence MIRI working on stategic questions like “how close are we, how much warning can we expect” (Intelligence Explosion Microecon), and “What fundamental architectures are even compatible with friendliness” (this Lob stuff).
See my responses to paper-machine on this thread for (some reasons) why I’m questioning the relevance of mathematical logic.
I don’t see this as any more relevant than Penrose’s views on consciousness, which I recently discussed. Yes, there are multiple people who are convinced, but their may be spurious correlations which are collectively driving their interests. Some that come to mind are
Subject-level impressiveness of Eliezer.
Working on these problems offering people a sense of community.
Being interested in existential risk reduction and not seeing any other good options on the table for reducing existential risk.
Intellectual interestingness of the problems.
Also, I find Penrose more impressive than all of the involved people combined. (This is not intended as a slight – rather, the situation is that Penrose’s accomplishments are amazing.)
The idea isn’t plausible to me, again, for reasons that I give in my responses to paper-machine (among others).
No, my reasons are at a meta-level rather than an object level, just as most members of the Less Wrong community (rightly) believe that Penrose’s views on consciousness are very likely wrong without having read his arguments in detail.
This is possible, but I don’t think that it’s a major concern.
Note this as a potential source of status quo bias.
Place yourself in the shoes of the creators of the early search engines, online book store, and social network websites. If you were in their positions, would you feel justified in concluding “if we don’t do it then no one else will”? If not, why do you think that AI safety will be any different?
I agree that it’s conceivable that it could happen to fast, but I believe that there’s strong evidence that it won’t happen within the next 20 years, and 20 years is a long time for people to become interested in AI safety.
People keep saying that. I don’t understand why “planning fallacy” is not a sufficient reply. See also my view on why we’re still alive.
I agree that my view is not a priori true and requires further argumentation.
Why?
I say something about this here.
Okay; why specifically isn’t mathematical logic the right domain?
EDIT: Or, to put it another way, there’s nothing in the linked comment about mathematical logic.
The question to my mind is why is mathematical logic the right domain? Why not game theory, or solid state physics, or neural networks? I don’t see any reason to privilege mathematical logic – a priori it seems like a non sequitur to me. The only reason that I give some weight to the possibility that it’s relevant is that other people believe that it is.
AI’s do Reasoning. If you can’t see the relevance of logic to reasoning, I can’t help.
Further, do you have some other domain of inquiry that has higher expected return? I’ve seen a lot of stated meta-level skepticism, but no strong arguments either on the meta level (why should MIRI be as uncertain as you) or the object level (are there arguments against studying logic, or arguments for doing something else).
Now I imagine it seems to you that MIRI is privileging the mathematical logic hypothesis, but as above, it looks to me rather obviously relevant such that it would take some evidence against it to put me in your epistemic position.
(Though strictly speaking given strong enough evidence against MIRI’s strategy I would go more towards “I don’t know what’s going on here, everything is confusing” rather than your (I assume) “There’s no good reason one way or the other”)
You seem to be taking a position of normative ignorance (I don’t know and neither can you), in what looks like the face of plenty of information. I would expect rational updating exposed to such information to yield a strong position one way or the other or epistemic panic, not calm (normative!) ignorance.
Note that to take a position of normative uncertainty you have to believe that not only have you seen no evidence, there is no evidence. I’m seeing normative uncertainty and no strong reason to occupy a position of normative uncertainty, so I’m confused.
Humans do reasoning without mathematical logic. I don’t know why anyone would think that you need mathematical logic to do reasoning.
See each part of my comment here as well as my response to Kawoobma here.
I want to hedge because I find some of the people involved in MIRI’s Friendly AI research to be impressive, but putting that aside, I think that the likelihood of the research being useful for AI safety is vanishingly small, at the level of the probability of a random conjunctive statement of similar length being true.
Right. Humans do reasoning, but don’t really understand reasoning. Since ancient times, when people try to understand something they try to formalize it, hence the study of logic.
If we want to build something that can reason we have to understand reasoning or we basically won’t know what we are getting. We can’t just say “humans reason based on some ad-hoc kludgy nonformal system” and then magically extract an AI design from that. We need to build something we can understand or it won’t work, and right now, understanding reasoning in the abstract means logic and it’s extensions.
It’s a double need, though, because not only do we need to understand reasoning, self-improvement means the created thing needs to understand reasoning. Right now we don’t have a formal theory of reasoning that can handle understanding it’s own reasoning without losing power. So that’s we need to solve that.
There is no viable alternate path.
Note that this is different from what you were saying before, and that commenting along the lines of “AI’s do Reasoning. If you can’t see the relevance of logic to reasoning, I can’t help” without further explanation doesn’t adhere to the principle of charity.
I’m very familiar with the argument that you’re making, and have discussed it with dozens of people. The reason why I didn’t respond to the argument before you made it is because I wanted to isolate our core point(s) of disagreement, rather than making presumptions. The same holds for my points below.
This argument has the form “If we want to build something that does X, we have to understand X, or we won’t know what we’re getting.” But this isn’t true in full generality. For example, we can build a window shade without knowing how the window shade blocks light, and still know that we’ll be getting something that blocks light. Why do you think that AI will be different?
Why do you think that it’s at all viable to create an AI based on a formal system? (For the moment putting aside safety considerations.)
As to the rest of your comment — returning to my “Chinese economy” remarks — the Chinese economy is a recursively self-improving system with “goal” of maximizing GDP. It could be that there’s goal drift, and that the Chinese economy starts optimizing for something random. But I think that the Chinese economy does a pretty good job of keeping this “goal” intact, and that it’s been doing a better and better job over time. Why do you think that it’s harder to ensure that an AI keeps its goal intact than it is to ensure that the Chinese economy keeps its “goal” intact.
AI have to come to conclusions about the state of the world, where “world” also includes their own being. Model theory is the field that deals with such things formally.
These could be relevant, but it seems to me that “mind of an AI” is an emergent phenomena of the underlying solid state physics, where “emergent” here means “technically explained by, but intractable to study as such.” Game theory and model theory are intrinsically linked at the hip, and no comment on neural networks.
But the most intelligent beings that we know of are humans, and they don’t use mathematical logic.
Did humans have another choice in inventing the integers? (No. The theory of integers has only one model, up to isomorphism and cardinality.) In general, the ontology a mind creates is still under the aegis of mathematical logic, even if that mind didn’t use mathematical logic to invent it.
Sure, but that’s only one perspective. You can say that it’s under the aegis of particle physics, or chemistry, or neurobiology, or evolutionary psychology, or other things that I’m not thinking of. Why single out mathematical logic.
Going back to humans, getting an explanation of minds out of any of these areas requires computational resources that don’t currently exist. (In the case of particle physics, one might rather say “cannot exist.”)
Because we can prove theorems that will apply to whatever ontology AIs end up dreaming up. Unreasonable effectiveness of mathematics, and all that. But now I’m just repeating myself.
I’m puzzled by your remark. It sounds like a fully general argument. One could equally well say that one should use mathematical logic to build a successful marriage, or fly an airplane, or create a political speech. Would you say this? If not, why do you think that studying mathematical logic is the best way to approach AI safety in particular?
No, a fully general argument is something like “well, that’s just one perspective.” Mathematical logic will not tell you anything about marriage, other than the fact that it is an relation of variable arity (being kind to the polyamorists for the moment).
I have no idea why a reasonable person would say any of these things.
I’d call it the best currently believed way with a chance of developing something actionable without probably requiring more computational power than a matryoshka brain. That’s because it’s the formal study of models and theories in general. Unless you’re willing to argue that AIs will have neither cognitive feature? That’s kind of rhetorical, though—I’m growing tired.
Given that the current Lob paper is non-constructive (invoking the axiom of choice) and hence is about as uncomputable as possible, I don’t understand why you think mathematical logic will help with computational concerns.
The paper on probabilistic reflection in logic is non-constructive, but that’s only sec. 4.3 of the Lob paper. Nothing non-constructive about T-n or TK.
I believe one of the goals this particular avenue of research is to make this result constructive. Also, He was talking about the study of mathematical logic in general not just this paper.
I have little patience for people who believe invoking the axiom of choice in a proof makes the resulting theorem useless.
That was rather rude. I certainly don’t claim that proofs involving choice are useless, merely that they don’t address the particular criterion of computational feasibility.
What do you mean by “something actionable” ?
That sounds like a very long conversation if we’re supposed to be giving quantitative estimates on everything. The qualitative version is just that this sort of thing can take a long time, may not parallelize easily, and can potentially be partially factored out to academia, and so it is wise to start work on it as soon as you’ve got enough revenue to support even a small team, so long as you can continue to scale your funding while that’s happening.
This reply takes for granted that all astronomical benefits bottleneck through a self-improving AI at some point.
Thanks for clarifying your position.
My understanding based on what you say is that the research in your paper is intended to spearhead a field of research, rather than to create something that will be directly used for friendliness in the first AI. Is this right?
If so, our differences are about the sociology of the scientific, technological and political infrastructure rather than about object level considerations having to do with AI.
Sounds about right. You might mean a different thing from “spearhead a field of research” than I do, my phrasing would’ve been “Start working on the goddamned problem.”
From your other comments I suspect that you have a rather different visualization of object-level considerations to do with AI and this is relevant to your disagreement.
Ok. I think that MIRI could communicate more clearly by highlighting this. My previous understanding had been that MIRI staff think that by default, one should expect to need to solve the Lob problem in order to build a Friendly AI. Is there anything in the public domain that would have suggested otherwise to me? If not, I’d suggest writing this up and highlighting it.
AFAIK, the position is still “need to ‘solve’ Lob to get FAI”, where ‘solve’ means find a way to build something that doesn’t have that problem, given that all the obvious formalisms do have such problems. Did EY suggest otherwise?
See my response to EY here.
By default, if you can build a Friendly AI you can solve the Lob problem. That working on the Lob Problem gets you closer to being able to build FAI is neither obvious nor certain, but everything has to start somewhere...
EDIT: Moved the rest of this reply to a new top-level comment because it seemed important and I didn’t want it buried.
http://lesswrong.com/lw/hmt/tiling_agents_for_selfmodifying_ai_opfai_2/#943i
For readers who want to read more about this point, see FAI Research as Effective Altruism.
BTW, I spent a large fraction of the first few months of 2013 weighing FAI research vs. other options before arriving at MIRI’s 2013 strategy (which focuses heavily on FAI research). So it’s not as though I think FAI research is obviously the superior path, and it’s also not as though we haven’t thought through all these different options, and gotten feedback from dozens of people about those options, and so on.
Also note that MIRI did, in fact, decide to focus on (1) spreading rationality, and (2) building a community of people who care about rationality, the far future, and x-risk, before turning its head to FAI research: see (in chronological order) the Singularity Summit, Less Wrong and CFAR.
But the question of which interventions are most cost effective (given astronomical waste) is a huge and difficult topic, one that will require thousands of hours to examine properly. Building on Beckstead and Bostrom, I’ve tried to begin that examination here. Before jumping over to that topic, I wonder: do you now largely accept the case Eliezer made for this latest paper as an important first step on an important sub-problem of the Friendly AI problem? And if not, why not?
My comments were addressed at Eliezer’s paper specifically, rather than MIRI’s general strategy, or your own views.
Sure – what I’m thinking about is cost-effectiveness at the margin.
Based on Eliezer’s recent comments, my impression is that Eliezer is not making such a case, and is rather making a case for the paper being of sociological/motivational value. Is your understanding different?
No, that’s not what I’ve been saying at all.
I’m sorry if this seems rude in some sense, but I need to inquire after your domain knowledge at this point. What is your level of mathematical literacy and do you have any previous acquaintance with AI problems? It may be that, if we’re to proceed on this disagreement, MIRI should try to get an eminent authority in the field to briefly confirm basic, widespread, and correct ideas about the relevance of doing math to AI, rather than us trying to convince you of that via object-level arguments that might not be making any sense to you.
By ‘the relevance of math to AI’ I don’t mean mathematical logic, I mean the relevance of trying to reduce an intuitive concept to a crisp form. In this case, like it says in the paper and like it says in the LW post, FOL is being used not because it’s an appropriate representational fit to the environment… though as I write this, I realize that may sound like random jargon on your end… but because FOL has a lot of standard machinery for self-reflection of which we could then take advantage, like the notion of Godel numbering or ZF proving that every model entails every tautology… which probably doesn’t mean anything to you either. But then I’m not sure how to proceed; if something can’t be settled by object-level arguments then we probably have to find an authority trusted by you, who knows about the (straightforward, common) idea of ‘crispness is relevant to AI’ and can quickly skim the paper and confirm ‘this work crispifies something about self-modification that wasn’t as crisp before’ and testify that to you. This sounds like a fair bit of work, but I expect we’ll be trying to get some large names to skim the paper anyway, albeit possibly not the Early Draft for that.
Quick Googling suggest someone named “Jonah Sinick” is a mathematician in number theory. It appears to be the same person.
I really wish Jonah had mentioned that some number of comments ago, there’s a lot of arguments I don’t even try to use unless I know I’m talking to a mathematical literati.
It’s mentioned explicitly at the beginning of his post Mathematicians and the Prevention of Recessions, strongly implied in The Paucity of Elites Online, and the website listed under his username and karma score is http://www.mathisbeauty.org.
Ok, I look forward to better understanding :-)
I have a PhD in pure math, I know the basic theory of computation and of computational complexity, but I don’t have deep knowledge of these domains, and I have no acquaintance with AI problems.
Yes, this could be what’s most efficient. But my sense is that our disagreement is at a non-technical level rather than at a technical level.
My interpretation of
was that you were asserting only very weak confidence in the relevance the paper to AI safety, and that you were saying “Our purpose in writing this was to do something that could conceivably have something to do with AI safety, so that people take notice and start doing more work on AI safety.” Thinking it over, I realize that you might have meant “We believe that this paper is an important first step on a technical level. Can you clarify here?
If the latter interpretation is right, I’d recur to my question about why the operationalization is a good one, which I feel that you still haven’t addressed, and which I see as crucial.
...
Do you not see that what Luke wrote was a direct response to your question?
There are really two parts to the justification for working on the this paper: 1) Direct FAI research is a good thing to do now. 2) This is a good problem to work on within FAI research. Luke’s comment gives context explaining why MIRI is focusing on direct FAI research, in support of 1. And it’s clear from what you list as other options that you weren’t asking about 2.
It sounds like what you want is for this problem to be compared on its own to every other possible intervention. In theory that would be the rational thing to do to ensure you were always doing the most cost-effective work on the margin. But that only makes sense if it’s computationally practical to do that evaluation at every step.
What MIRI has chosen to do instead is to invest some time up front coming up with a strategic plan, and then follow through on that. This seem entirely reasonable to me.
If the probability is too small, then it isn’t worth it. The activities that I mention plausibly reduce astronomical waste to a nontrivial degree. Arguing that you can do better than them requires an argument that establishes the expected impact of MIRI Friendly AI research on AI safety above a nontrivial threshold.
Which question?
Sure, I acknowledge this.
I don’t think that it’s computationally intractable to come up with better alternatives. Indeed, I think that there are a number of concrete alternatives that are better.
I wasn’t disputing this. I was questioning the relevance of MIRI’s current research to AI safety, not saying that MIRI’s decision process is unreasonable.
The one I quoted: “Why do you think that … is cost-effective relative to other options on the table?”
Yes, you have a valid question about whether this Lob problem is relevant to AI safety.
What I found frustrating as a reader was that you asked why Eliezer was focusing on this problem as opposed to other options such as spreading rationality, building human capital, etc. Then when Luke responded with an explanation that MIRI had chosen to focus on FAI research, rather than those other types of work, you say, no I’m not asking about MIRI’s strategy or Luke’s views, I’m asking about this paper. But the reason Eliezer is working on this paper is because of MIRI’s strategy!
So that just struck me as sort of rude and/or missing the point of what Luke was trying to tell you. My apologies if I’ve been unnecessarily uncharitable in interpreting your comments.
I read Luke’s comment differently, based on the preliminary “BTW.” My interpretation was that his purpose in making thecomment was to give a tangentially related contextual remark rather than to answer my question. (I wasn’t at all bothered by this – I’m just explaining why I didn’t respond to it as if it were intended to address my question.)
Ah, thanks for the clarification.
The way I’m using these words, my “this latest paper as an important first step on an important sub-problem of the Friendly AI problem” is equivalent to Eliezer’s “begin tackling the conceptual challenge of describing a stably self-reproducing decision criterion by inventing a simple formalism and confronting a crisp difficulty.”
Ok. I disagree that the paper is an important first step.
Because Eliezer is making an appeal based on psychological and sociological considerations, spelling out my reasoning requires discussion of what sorts of efforts are likely to impact the scientific community, and whether one can expect such research to occur by default. Discussing these requires discussion of psychology, sociology and economics, partly as related to whether the world’s elites will navigate the creation of AI just fine.
I’ve described a little bit of my reasoning, and will be elaborating on it in detail in future posts.
I look forward to it! Our models of how the scientific community works may be substantially different. To consider just one particularly relevant example, consider what the field of machine ethics looks like without the Yudkowskian line.
I agree that Eliezer has substantially altered the field of machine ethics. My view here is very much contingent on the belief that elites will navigate the creation of AI just fine, which, if true, is highly nonobvious.
Other options on the table are not mutually exclusive. There is a lot of wealth and intellectual brain power in the world, and a lot of things to work on. We can’t and shouldn’t all work on one most important problem. We can’t all work on the thousand most important problems. We can’t even agree on what those problems are.
I suspect Eliezer has a comparative advantage in working on this type of AI research, and he’s interested in it, so it makes sense for him to work on this. It especially makes sense to the extent that this is an area no one else is addressing. We’re only talking about an expenditure of several careers and a few million dollars. Compared to the world economy, or even compared to the non-profit sector, this is a drop in the bucket.
Now if instead Eliezer was the 10,000th smart person working on string theory, or if there was an Apollo-style government-funded initiative to develop an FAI by 2019, then my estimate of the comparative advantage of MIRI would shift. But given the facts as they are, MIRI seems like a plausible use of the limited resources it consumes.
If Eliezer feels that this is his comparative advantage then it’s fine for him to work on this sort of research — I’m not advocating that such research be stopped. My own impression is that Eliezer has comparative advantage in spreading rationality and that he could have a bigger impact by focusing on doing so.
I’m not arguing that such research shouldn’t be funded. The human capital question is genuinely more dicey, insofar as I think that Eliezer has contributed substantial value through his work on spreading rationality, and my best guess is that the opportunity cost of not doing more is large.
For starters, humans aren’t able to make changes as easily as an AI can. We don’t have direct access to our source code that we can change effortlessly, any change we make costs either time, money, or both.
That doesn’t address the question. It says that an AI could more easily make self-modifications. It doesn’t suggest that an AI needs to make such self-modifications. Human intelligence is an existence proof that human-level intelligence does not require “billions of sequential self-modifications”. Whether greater than human intelligence requires it, in fact whether greater than human intelligence is even possible, is still an open question.
So I reiterate, “Why do you think that an AI would need to make billions of sequential self-modifications when humans don’t need to?”
Human intelligence required billions of sequential modifications (though not selfmodifications). An AI in general would not need self-modifications, but for a AGI it seems that it would be necessary. I don’t doubt a formal reasoning for the latter statement has been written by someone smarter than me before, but a very informal argument would be something like this:
If an AGI doesn’t need to self-modify, then that AGI is already perfect (or close enough that it couldn’t possibly matter). Since practically no software humans ever built was ever perfect in all respects, that seems exceedingly unlikely. Therefore, the first AGI would (very likely) need to be modified. Of course, at the begining it might be modified by humans (thus, not selfmodified), but the point of building AGI is to make it smarter than us. Thus, once it is smarter than us by a certain amount, it wouldn’t make sense for us (stupider intellects) to improve it (smarter intellect). Thus, it would need to self-modify, and do it a lot, unless by some ridiculously fortuitous accident of math (a) human intelligence is very close to the ideal, or (b) human intelligence will build something very close to the ideal on the first try.
It would be nice if those modifications would be things that are good for us, even if we can’t understand them.
″...need to make billions of sequential self-modifications when humans don’t need to” to do what? Exist, maximize utility, complete an assignment, fulfill a desire...? Some of those might be better termed as “wants” than “needs” but that info is just as important in predicting behavior.
FWIW, Jonah has a PhD in math and has probably read Pearl or a similar graphical models book.
(Not directly relevant to the conversation, but just trying to lower your probability estimate that Jonah’s objections are naieve.)
I don’t see it is a decisive point, one of “many weak arguments,” but I think the analogy with human self-modification is relevant. I would like to see more detailed discussion of the issue.
Aspects of this that seem relevant to me:
Genetic and cultural modifications to human thinking patterns have been extremely numerous. If you take humanity as a whole as an entity doing self-modification on itself, there have been an extremely large number of successful self-modifications.
Genetic and cultural evolution have built humans individually capable of self-modification without stumbling over Lobian obstacles. Evolution and culture likely used relatively simple and easy search processes to do this, rather than ones that rely on very sophisticated mathematical insights. Analogously, one might expect that people will develop AGI in a way that overcomes these problems as well.
Self-modification is to be interpreted to include ‘directly editing one’s own low-level algorithms using high-level deliberative process’ but not include ‘changing one’s diet to change one’s thought processes’. If you are uncomfortable using the word ‘self-modification’ for this please substitute a new word ‘fzoom’ which means only that and consider everything I said about self-modification to be about fzoom.
Humans wouldn’t look at their own source code and say, “Oh dear, a Lobian obstacle”, on this I agree, but this is because humans would look at their own source code and say “What?”. Humans have no idea under what exact circumstances they will believe something, which comes with its own set of problems. The Lobian obstacle shows up when you approach things from the end we can handle, namely weak but well-defined systems which can well-define what they will believe, whereas human mathematicians are stronger than ZF plus large cardinals but we don’t know how they work or what might go wrong or what might change if we started editing neural circuit #12,730,889,136.
As Christiano’s work shows, allowing for tiny finite variances of probability might well dissipate the Lobian obstacle, but that’s the sort of thing you find out by knowing what a Lobian obstacle is.
Very helpful. This seems like something that could lead to a satisfying answer to my question. And don’t worry, I won’t engage in a terminological dispute about “self-modification.”
Can you clarify a bit what you mean by “low-level algorithms”? I’ll give you a couple of examples related to what I’m wondering about.
Suppose I am working with a computer to make predictions about the the weather, and we consider the operations of the computer along with my brain as a single entity for the purposes testing whether the Lobian obstacles you are thinking of arise in practice. Now suppose I make basic modifications to the computer, expecting that the joint operation of my brain with the computer will yield improved output. This will not cause me to trip over Lobian obstacles. Why does whatever concern you have about the Lob problem predict that it would not, but also predict that future AIs might stumble over the Lob problem?
Another example. Humans learn different mental habits without stumbling over Lobian obstacles, and they can convince themselves that adopting the new mental habits is an improvement. Some of these are more derivative (“Don’t do X when I have emotion Y”) and others are perhaps more basic (“Try to update through explicit reasoning via Bayes’ Rule in circumstances C”). Why does whatever concern you have about the Lob problem predict that humans can make these modifications without stumbling, but also predict that future AIs might stumble over the Lob problem?
If the answer to both examples is “those are not cases of directly editing one’s low-level algorithms using high-level deliberative processes,” can you explain why your concern about Lobian issues only arises in that type of case? This is not me questioning your definition of “fzoom,” it is my asking why Lobian issues only arise when you are worrying about fzoom.
The first example is related to what I had in mind when I talked about fundamental epistemic standards in a previous comment:
Well, part of this is because modern humans are monstrous in the eyes of many pre-modern humans. To them, the future has been lost because they weren’t using a self-modification procedure that provably preserved their values.
Here is my general response to your concern.
I’m confused by this sentence. There are many statistical testing methods that output what are essentially proofs; e.g. statements of the form “probability of a failure existing is at most 10^(-100)”. Why would this not be sufficient?
More generally, as I’ve said in another comment, I would really like to understand how the Lob obstacle relates to statistical learning methods, especially since those seem like our best guess as to what an AI paradigm would look like.
What sort of statistical testing method would output a failure probability of at most 10^(-100) for generic optimization problems without trying 10^100 examples? You can get this in some mathematical situations but only because if X doesn’t have property Y then it has an independent 50% chance of showing property Z on many different trials of Z. For more generic optimization problems, if you haven’t tested fitness on 10^100 occasions you can’t rule out a >10^100 probability of any sort of possible blowup. And even if you test 10^100 samples the guarantee is only as strong as your belief that the samples were taken from a probability distribution exactly the same as real-world contexts likely to be encountered, down to the 100th decimal place.
It depends on the sort of guarantee you want. Certainly I can say things of the form “X and Y differ from each other in mean by at most 0.01” with a confidence that high, without 10^100 samples (as long as the samples are independent or at least not too dependent).
If your optimization problem is completely unstructured then you probably can’t do better than the number of samples you have, but if it is completely unstructured then you also can’t prove anything about it, so I’m not sure what point you’re trying to make. It seems a bit unimaginative to think that you can’t come up with any statistical structure to exploit, especially if you think there is enough mathematical structure to prove strong statements about self-modification.
If you can get me a conditionally independent failure probability of 10^-100 per self-modification by statistical techniques whose assumptions are true, I’ll take it and not be picky about the source. It’s the ‘true assumptions’ part that seems liable to be a sticking point. I understand how to get probabilities like this by doing logical-style reasoning on transistors with low individual failure probabilities and proving a one-wrong-number assumption over the total code (i.e., total code functions if any one instruction goes awry) but how else would you do that?
It seems as though they would involve a huge number of trials.
“Evolutionary” algorithms aren’t typically used to change fitness functions anyway. They are more usually associated with building representations of the world to make predictions with. This complaint would seem to only apply to a few “artificial life” models—in which all parts of the system are up for grabs.
(Approximate orders of magnitude:)
Number of atoms in universe : 10^80
Number of atoms in a human being: 10^28
Number of humans that have existed: 10^10
Number of AGI-creating-level inventions expected to be made by humans: 10^0–10^1
Number of AGI-creating-level inventions expected to be made by 1% (10^-2) of the universe turned into computronium, with no more that human level thought-to-matter efficiency, extrapolating linearly: 10^(80 − 2 − 10 − 28) = 10^40.
Hmm, that doesn’t sound that bad, but we got from 10^(-100) to 10^(-60) really fast. Also, I don’t think Eliezer was talking about that kind of statistical method.
I mean, I could easily make the 100 into a 400, so I don’t think this is that relevant.
Yes, the last sentence is probably my real “objection”. (Well, I don’t object to your statements, I just don’t think that’s what Eliezer meant. Even if you run a non-statistical, deterministic theorem prover, using current hardware the probability of failure is much above 10^-100.)
The silly part of the comment was just a reminder (partly to myself) that AGI problems can span orders of magnitude so ridiculously outside the usual human scale that one can’t quite approximate (the number of atoms in the universe)^-1 as zero without thinking carefully about it.
Possible typo: Equation 4.2 subscript: T- n+1
Should this be T- (n+1) ?
On page 12, when you talk about the different kinds of trust, it seems like tiling trust is just a subtype of naturalistic trust. If something running on T can trust some arbitrary physical system if that arbitrary physical system implements T, then it should be able to trust its successor if that successor is a physical system that implements T. Not sure if that means anything.
This is correct; naturalistic trust subsumes indefinitely tiling trust.
Can’t you require that the agents you swap to spend at least some fraction of their effort on meliorizing? Each swap could lower that fraction, based on how much the expected value had increased (the closer we are to the goal, the less we need to search more) and how much effort had already been expended (if we’ve searched enough, we can be pretty sure that there’s not a better solution). More formally, you would want to spend meliorizing effort relative to the optimality gap you’re facing (or whatever crude approximation to it you have), and the cost of spending more effort relative to your current best plan (you might have another day you can spend looking, or it might be that if you don’t stop planning and start doing now, you lose everyone).
All optimization involves a generate-and-test procedure. Insisting on proofs is a lot like insisting on testing every variant in a given set. It’s a constraint on the optimization processes used—and such constraints seem at least as likely to lead to worse results as they do to better ones.
By analogy, being able to prove there’s no mate in three doesn’t rule out a mate in four—that a more sensible and less constrained algorithm might easily have found.
Basically, optimizing using proofs (in this way) is like trying to fight with both of your hands tied. Yes, that stops you from hitting yourself in the face—but that isn’t the biggest problem in the first place.