This is all assuming an ontology where there exists a utility function that an AI is optimizing, and changes to the AI seem especially likely to change the utility function in a random direction. In such a scenario, yes, you probably should be worried.
However, in practice, I expect that powerful AI systems will not look like they are explicitly maximizing some utility function. In this scenario, if you change some component of the system for the worse, you are likely to degrade its performance, but not likely to drastically change its behavior to cause human extinction. For example, even in RL (which is the closest thing to expected utility maximization), you can have serious bugs and still do relatively well on the objective. A public example of this is in OpenAI Five (https://blog.openai.com/openai-five/), but I also hear this expressed when talking to RL researchers (and see this myself).
While you still want to be very careful with self-modification, it seems generally fine not to have a formal proof before making the change, and evaluating the change after it has taken place. (This would fail dramatically if the change drastically changed behavior, but if it only degrades performance, I expect the AI would still be competent enough to notice and undo the change.)
You could worry about daemons exploiting these bugs under this view. I think this is a reasonable worry, but don’t expect formalizing zero-shot reasoning to help with it. It seems to me that daemons occur by falling into a local optimum when you are trying to optimize for doing some task—the daemon does that task well in order to gain influence, and then backstabs you. This can arise both in ideal zero-shot reasoning, and when introducing approximations to it (as we will have to do when building any practical system).
In particular, the one context where we’re most confident that daemons arise is Solomonoff induction, which is one of the best instances of formalizing zero-shot reasoning that we have. Solomonoff gives you strong guarantees, of the sort you can use in proofs—and yet, daemons arise.
I would be very surprised if we were able to handle daemons without some sort of daemon-specific research.
you can have serious bugs and still do relatively well on the objective. A public example of this is in OpenAI Five (https://blog.openai.com/openai-five/), but I also hear this expressed when talking to RL researchers (and see this myself).
My impression is that most of these ‘serious bugs’ are something like “oops, our gradient descent is actually gradient ascent, but it worked out alright because our utility function is also the negative of what it should be” which is not particularly heartening.
While you still want to be very careful with self-modification, it seems generally fine not to have a formal proof before making the change, and evaluating the change after it has taken place. (This would fail dramatically if the change drastically changed behavior, but if it only degrades performance, I expect the AI would still be competent enough to notice and undo the change.)
Even to changes to how to performs or evaluates self-modification? Eurisko comes to mind as a program that could and did give itself cancer, requiring its programmer to notice that it had died and restart it, and the sort of thing that AI programmers would do by default.
My impression is that most of these ‘serious bugs’ are something like “oops, our gradient descent is actually gradient ascent, but it worked out alright because our utility function is also the negative of what it should be” which is not particularly heartening.
Even if this were true I would not update much. If you actually had only one of those and not the other, you would notice _really fast_, so it’s not going to harm you.
The bugs I’m imagining are more like “we did a bunch of math and got out an equation, but missed a minus sign in one of the terms that’s usually quite small, resulting in a small error in the value calculated, making learning less efficient”. OpenAI had a bug where their bots would get a negative reward for reaching level 25. If you introduce these kinds of bugs with a change, you’ll notice less efficient learning, and hopefully correct it, and it only leads to degraded performance, not catastrophic outcomes.
Even to changes to how to performs or evaluates self-modification? Eurisko comes to mind as a program that could and did give itself cancer, requiring its programmer to notice that it had died and restart it, and the sort of thing that AI programmers would do by default.
I agree that you want to be extra careful with self-modification, but there are lots of easy steps you can do to in fact be extra careful, eg. creating a copy of yourself with the modification and seeing what it tends to do on a suite of problems where you expect the modification to be helpful/harmful.
We may also have different pictures of self-modification looks like. Under your view, it seems like AI researchers are going to add a self-modification routine to the AI, which can unilaterally rewrite the source code of the AI as it wants. Under my view, AI researchers don’t really think much about self-modification, and just build an AI system capable of learning and performing general tasks, one of which could be the task of improving the AI system with very high confidence that the proposed improvement will work.
Do you generally trust that you personally could be handed the key to human self-modification? I feel reasonably confident that such a tool would help me (or at least, not harm me, in that I might decide not to use it). Since it’s much easier for an AI to run experiments on copies of itself, it should be a much easier task for the AI to use such a tool well.
Under your view, it seems like AI researchers are going to add a self-modification routine to the AI, which can unilaterally rewrite the source code of the AI as it wants . Under my view, AI researchers don’t really think much about self-modification, and just build an AI system capable of learning and performing general tasks, one of which could be the task of improving the AI system with very high confidence that the proposed improvement will work.
I think the current standard approach is unilateral modifications (what checks do we put on gradient descent modifiying parameter values?), and that this is unlikely to change as AI researchers figure out how to do bolder and bolder variations. How would you classify the meta-learning approaches under development?
I think it’s likely that there will be some safeguards in place, much in the way that you don’t get robust multicellular life without some mechanisms of correcting cancers when they develop. The root of my worry here is that I don’t expect this problem to be solved well if researchers aren’t thinking much about self-modification (and thus how to solve it well).
Do you generally trust that you personally could be handed the key to human self-modification?
I think this depends a lot on how the key is shaped. If I can write rules for moving around cells in my body, or modifying the properties of those cells, probably not, because I don’t have enough transparency for the consequences. If I have a dial with my IQ on it, probably, or if I have a set of dials related to the strength of various motivations, probably, but here I would still feel like there are significant risks associated with moving outside normal bounds that I would be accepting because we live in weird times. [For example, it seems likely that some genes that increase intelligence also increase brain cancer risk, and it seems possible that ‘turning the IQ dial’ with this key would similarly increase my chance of having brain cancer.]
Similarly, being able to print the genome for potential children rather than rolling randomly or selecting from a few options seems like it would be useful and I would use it, but is not making the situation significantly safer and could easily lead to systematic problems because of correlated choices.
The thing I’m worried about is fixing only one of them
Right, I’m arguing that if you only fixed one of them, you would notice _immediately_, and either revert back to the version with both bugs, or find the other bug. I’m also claiming that this should be what happens in general, assuming sufficient caution around self-modification (on the AI’s part, not the researcher’s part).
I think the current standard approach is unilateral modifications (what checks do we put on gradient descent modifiying parameter values?), and that this is unlikely to change as AI researchers figure out how to do bolder and bolder variations. How would you classify the meta-learning approaches under development?
I don’t think of gradient descent as self-modification. If an AI system were able to choose (i.e. learned a policy for) when to run gradient descent on itself, and what training data it should use for that, that might be self-modification. Meta-learning feels similar to me—the AI doesn’t get to choose training data or what to run gradient descent on. The only learned part in current meta-learning approaches is how to perform a task, not how to learn.
I don’t expect this problem to be solved well if researchers aren’t thinking much about self-modification (and thus how to solve it well).
This might be all of our disagreement actually. Like you, I’m quite pessimistic about any system where researchers put in place a protocol for self-modification, which seems to be what you are imagining. Either the protocol is too lax and we get the sort of issues you’re talking about, or it’s too strict and self-modification never happens. However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have “thoughts” of the form “Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox”.
I think this depends a lot on how the key is shaped. If I can write rules for moving around cells in my body, or modifying the properties of those cells, probably not, because I don’t have enough transparency for the consequences.
But in that case, you would simply choose not to use it, or to do a lot of research before trying to use it.
However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have “thoughts” of the form “Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox”.
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification). It seems likely that it could be in a situation like some of the Cake or Death problems, where it views a change to itself as impacting only part of its future behavior (like affecting actions but not values, such that it suspects that a future it that took path A would be disappointed in itself and fix that bug, without realizing that the change it’s making will cause future it to not be disappointed by path A), or is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification).
I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes “sufficiently complex”. I’m not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.
is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
+100. This is why I feel queasy about “OK, I judge this self-modification to be fine” when the self-modifications are sufficiently complex, if this judgment isn’t based off something like zero-shot reasoning (in which case we’d have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).
I agree this seems like a crux for me as well, subject to the caveat that I think we have different ideas of what “self-modification” is (though I’m not sure it matters that much).
Both of the comments feel to me like you’re making the AI system way dumber than humans, and I don’t understand why I should expect that. I think I could make a better human with high confidence/robustness if you give me a human-modification-tool that I understand reasonably well and I’m allowed to try and test things before committing to the better human.
This is all assuming an ontology where there exists a utility function that an AI is optimizing, and changes to the AI seem especially likely to change the utility function in a random direction. In such a scenario, yes, you probably should be worried.
I’m mostly concerned with daemons, not utility functions changing in random directions. If I knew that corrigibility were robust and that a corrigible AI would never encounter daemons, I’d feel pretty good about it recursively self-improving without formal zero-shot reasoning.
You could worry about daemons exploiting these bugs under this view. I think this is a reasonable worry, but don’t expect formalizing zero-shot reasoning to help with it. It seems to me that daemons occur by falling into a local optimum when you are trying to optimize for doing some task—the daemon does that task well in order to gain influence, and then backstabs you. This can arise both in ideal zero-shot reasoning, and when introducing approximations to it (as we will have to do when building any practical system).
I’m imagining the AI zero-shot reasoning about the correctness and security of its source code (including how well it’s performing zero-shot reasoning), making itself nigh-impossible for daemons to exploit.
In particular, the one context where we’re most confident that daemons arise is Solomonoff induction, which is one of the best instances of formalizing zero-shot reasoning that we have. Solomonoff gives you strong guarantees, of the sort you can use in proofs—and yet, daemons arise.
I think of Solomonoff induction less as a formalization of zero-shot reasoning, and more as a formalization of some unattainable ideal of rationality that will eventually lead to better conceptual understandings of bounded rational agents, which will in turn lead to progress on formalizing zero-shot reasoning.
I would be very surprised if we were able to handle daemons without some sort of daemon-specific research.
In my mind, there’s no clear difference between preventing daemons and securing complex systems. For example, I think there’s a fundamental similarity between the following questions:
How can we build an organization that we trust to optimize for its founders’ original goals for 10,000 years?
How can ensure a society of humans flourishes for 1,000,000,000 years without falling apart?
How can we build an AGI which, when run for 1,000,000,000 years, still optimizes for its original goals with > 99% probability? (If it critically malfunctions, e.g. if it “goes insane”, it will not be optimizing for its original goals.)
How can we build an AGI which, after undergoing an intelligence explosion, still optimizes for its original goals with > 99% probability?
I think of AGIs as implementing miniature societies teeming with subagents that interact in extraordinarily sophisticated ways (for example they might play politics or Goodhart like crazy). On this view, ensuring the robustness of an AGI entails ensuring the robustness of a society at least as complex as human society, which seems to me like it requires zero-shot reasoning.
It seems like a simpler task would be building a spacecraft that can explore distant galaxies for 1,000,000,000 years without critically malfunctioning (perhaps with the help of self-correction mechanisms). Maybe it’s just a failure of my imagination, but I can’t think of any way to accomplish even this task without delegating it to a skilled zero-shot reasoner.
I’m imagining the AI zero-shot reasoning about the correctness and security of its source code (including how well it’s performing zero-shot reasoning), making itself nigh-impossible for daemons to exploit.
This seems like it’s using a bazooka to kill a fly. I’m not sure if I agree that zero-shot reasoning saves you from daemons, but even if so, why not try to attack the problem of daemons directly?
I think of Solomonoff induction less as a formalization of zero-shot reasoning, and more as a formalization of some unattainable ideal of rationality that will eventually lead to better conceptual understandings of bounded rational agents, which will in turn lead to progress on formalizing zero-shot reasoning.
Okay, sure, but then my claim is that Solomonoff induction is _better_ than zero-shot reasoning on the axes you seem to care about, and yet it still has daemons. Why expect zero-shot reasoning to do better?
In my mind, there’s no clear difference between preventing daemons and securing complex systems.
(I can’t seem to blockquote the bullet points, imagine I had done that.)
My reason for not having high confidence is that the time spans are incredibly long and many things could happen that I can’t predict. But in scenarios where we have an AGI, yet we fail to achieve these objectives, the reason that seems most likely to me is “the AGI was incompetent at some point, made a mistake, and bad things happened”. I don’t know how to evaluate the probability of this and so become uncertain. But, if you are correct that we can formalize zero-shot reasoning and actually get high confidence, then the AGI could do that too. The hard problem is in getting the AGI to “want” to do that.
However, I expect that the way we actually get high confidence answers to those questions, is that we implement a control mechanism (i.e. the AI) that gets to act over the entire span of 10,000 or 1 billion years or whatever, and it keeps course correcting in order to stay on the path.
It seems like a simpler task would be building a spacecraft that can explore distant galaxies for 1,000,000,000 years without critically malfunctioning (perhaps with the help of self-correction mechanisms).
If you’re trying to do this without putting some general intelligence into it, this sounds way harder to me, because you can’t build in a sufficiently general control mechanism for the spacecraft. I agree that (without access to general-intelligence-routines for the spacecraft) such a task would need very strong zero-shot reasoning. (It _feels_ impossible to me that any actual system could do this, including AGI, but that does feel like a failure of imagination on my part.)
This seems like it’s using a bazooka to kill a fly. I’m not sure if I agree that zero-shot reasoning saves you from daemons, but even if so, why not try to attack the problem of daemons directly?
I agree that zero-shot reasoning doesn’t save us from daemons by itself, and I think there’s important daemon-specific research to be done independently of zero-shot reasoning. I more think that zero-shot reasoning may end up being critically useful in saving us from a specific class of daemons.
Okay, sure, but then my claim is that Solomonoff induction is _better_ than zero-shot reasoning on the axes you seem to care about, and yet it still has daemons. Why expect zero-shot reasoning to do better?
The daemons I’m focusing on here mostly arise from embedded agency, which Solomonoff induction doesn’t capture at all. (It’s worth nothing that I consider there to be a substantial difference between Solomonoff induction daemons and “internal politics”/”embedded agency” daemons.) I’m interested in hashing this out further, but probably at some future point, since this doesn’t seem central to our disagreement.
But in scenarios where we have an AGI, yet we fail to achieve these objectives, the reason that seems most likely to me is “the AGI was incompetent at some point, made a mistake, and bad things happened”. I don’t know how to evaluate the probability of this and so become uncertain. But, if you are correct that we can formalize zero-shot reasoning and actually get high confidence, then the AGI could do that too. The hard problem is in getting the AGI to “want” to do that.
However, I expect that the way we actually get high confidence answers to those questions, is that we implement a control mechanism (i.e. the AI) that gets to act over the entire span of 10,000 or 1 billion years or whatever, and it keeps course correcting in order to stay on the path.
....
If you’re trying to [build the spacecraft] without putting some general intelligence into it, this sounds way harder to me, because you can’t build in a sufficiently general control mechanism for the spacecraft. I agree that (without access to general-intelligence-routines for the spacecraft) such a task would need very strong zero-shot reasoning. (It _feels_ impossible to me that any actual system could do this, including AGI, but that does feel like a failure of imagination on my part.)
I’m surprised by how much we seem to agree about everything you’ve written here. :P Let me start by clarifying my position a bit:
When I imagine the AGI making a “plan that will work in one go”, I’m not imagining it going like “OK, here’s a plan that will probably work for 1,000,000,000 years! Time to take my hands off the wheel and set it in motion!” I’m imagining the plan to look more like “set a bunch of things in motion, reevaluate and update it based on where things are, and repeat”. So the overall shape of this AGI’s cognition will look something like “execute on some plan for a while, reevaluate and update it, execute on it again for a while, reevaluate and update it again, etc.”, happening miliions or billions of times over (which seems a lot like a control mechanism that course-corrects). The zero-shot reasoning is mostly for ensuring that each step of reevaluation and updating doesn’t introduce any critical errors.
I think an AGI competently optimizing for our values should almost certainly be exploring distant galaxies for billions of years (given the availability of astronomical computing resources). On this view, building a spacecraft that can explore the universe for 1,000,000,000 years without critical malfunctions is strictly easier than building an AGI that competently optimizes for our values for 1,000,000,000 years.
Millions of years of human cognitive labor (or much more) might happen in an intelligence explosion that occurs over the span of hours. So undergoing a safe intelligence explosion seems at least as difficult as getting an earthbound AGI doing 1,000,000 years’ worth of human cognition without any catastrophic failures.
I’m less concerned about the AGI killing its operators than I am about the AGI failing to capture a majority of our cosmic endowment. It’s plausible that the latter usually leads to the former (particularly if there’s a fast takeoff on Earth that completes in a few hours), but that’s mostly not what I’m concerned about.
In terms of actual disagreement, I suspect I’m much more pessimistic than you about daemons taking over the control mechanism that course-corrects our AI, especially if it’s doing something like 1,000,000 years’ worth of human cognition, unless we can continuously zero-shot reason that this control mechanism will remain intact. (Equivalently, I feel very pessimistic about the process of executing and reevaluating plans millions/billions+ times over, unless the evaluation process is extraordinarily robust.) What’s your take on this?
I agree that zero-shot reasoning doesn’t save us from daemons by itself, and I think there’s important daemon-specific research to be done independently of zero-shot reasoning. I more think that zero-shot reasoning may end up being critically useful in saving us from a specific class of daemons.
You must be really pessimistic about our chances.
The daemons I’m focusing on here mostly arise from embedded agency, which Solomonoff induction doesn’t capture at all.
Huh, okay. I still don’t know the mechanism by which zero-shot reasoning helps us avoid x-risk, so it might be useful to describe these daemons in more detail. I continue to think that zero-shot reasoning does not seem necessary for eg. ensuring a flourishing human civilization for a billion years.
So the overall shape of this AGI’s cognition will look something like “execute on some plan for a while, reevaluate and update it, execute on it again for a while, reevaluate and update it again, etc.”, happening miliions or billions of times over (which seems a lot like a control mechanism that course-corrects)
Agreed that this is a control mechanism that course-corrects.
The zero-shot reasoning is mostly for ensuring that each step of reevaluation and updating doesn’t introduce any critical errors.
But why is the probability of introducing critical errors so high without zero-shot reasoning? Perhaps over a billion years even a 1-in-a-million chance is way too high, but couldn’t we spend some fraction of the first 10,000 years figuring that out?
Millions of years of human cognitive labor (or much more) might happen in an intelligence explosion that occurs over the span of hours.
This seems extraordinarily unlikely to me (feels like ~0%), if we’re talking about the first AGI that we build. If we’re not talking about that, then I’d want to know what “intelligence explosion” means—how intelligent was the most intelligent thing before the intelligence explosion? (I think this is tangential though, so feel free to ignore it.)
building an AGI that competently optimizes for our values for 1,000,000,000 years.
Just, don’t build that. No one is trying to build that. It’s a hard problem, we don’t know what our values are, it’s difficult. We can instead build an AGI that wants to help humans do whatever it is they want to do, and help them figure out what they want to do, and assists human development and flourishing at any given point in time, and continually learns from humans to figure out what should be done. In this version, humans are a control mechanism for the AI, and so we should expect the problem to be a lot easier.
I suspect I’m much more pessimistic than you about daemons taking over the control mechanism that course-corrects our AI, especially if it’s doing something like 1,000,000 years’ worth of human cognition, unless we can continuously zero-shot reason that this control mechanism will remain intact. (Equivalently, I feel very pessimistic about the process of executing and reevaluating plans millions/billions+ times over, unless the evaluation process is extraordinarily robust.) What’s your take on this?
I was talking about the AI as a control mechanism for the task that needs to be done, not a control mechanism that course-corrects the AI. I don’t expect there to be a particular subsystem of the AI that is responsible for course correction, just as there isn’t a particular subsystem in the human brain responsible for thinking “Huh, I guess now that condition X has arisen, I should probably take action Y in order to deal with it”.
I have no intuition for why there should be daemons, what they look like, how they take over the AI, etc. especially if they are different in kind from Solomonoff daemons. This basically sounds to me like positing the existence of a bad thing, and so concluding that we get bad things. I’m sure there’s more to your intuition, but I don’t know what it is and don’t share the intuition.
Executing and reevaluating plans many times seems like it’s fine, ignoring cases where the environment gets too difficult for the AI to deal with (i.e. the AI is incompetent). I expect the evaluation process to be robust by constantly getting human input.
I think zero-shot reasoning is probably not very helpful for the first AGI, and will probably not help much with daemons in our first AGI.
I agree that right now, nobody is trying to (or should be trying to) build an AGI that’s competently optimizing for our values for 1,000,000,000 years. (I’d want an aligned, foomed AGI to be doing that.)
I agree that if we’re not doing anything as ambitious as that, it’s probably fine to rely on human input.
I agree that if humanity builds a non-fooming AGI, they could coordinate around solving zero-shot reasoning before building a fooming AGI in a small fraction of the first 10,000 years (perhaps with the help of the first AGI), in which case we don’t have to worry about zero-shot reasoning today.
Conditioning on reasonable international coordination around AGI at all, I give 50% to coordination around intelligence explosions. I think the likelihood of this outcome rises with the amount of legitimacy zero-shot shot reasoning has at coordination time, which is my main reason for wanting to work on it today. (If takeoff is much slower I’d give something more like 80% to coordination around intelligence explosions, conditional on international coordination around AGIs.)
Let me now clarify what I mean by “foomed AGI”:
A rough summary is included in my footnote: [6] By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year [or hours], at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation. (“Optimally optimized optimizer” is another way of putting it.)
You could imagine analogizing the first AGI we build to the first dynamite we ever build. You could analogize a foomed AGI to a really big dynamite, but I think it’s more accurate to analogize it to a nuclear bomb, given the positive feedback loops involved.
I expect the intelligence differential between our first AGI and a foomed AGI to be numerous orders of magnitude larger than the intelligence differential between a chimp and a human.
In this “nuclear explosion” of intelligence, I expect the equivalent of millions of years of human cognitive labor to elapse, if not many more.
In this comment thread, I was referring primarily to foomed AGIs, not the first AGIs we build. I imagine you either having a different picture of takeoff, or thinking something like “Just don’t build a foomed AGI. Just like it’s way too hard to build AGIs that competently optimize for our values for 1,000,000,000 years, it’s way too hard to build a safe foomed AGI, so let’s just not do it”. And my position is something like “It’s probably inevitable, and I think it will turn out well if we make a lot of intellectual progress (probably involving solutions to metaphilosophy and zero-shot reasoning, which I think are deeply related). In the meantime, let’s do what we can to ensure that nation-states and individual actors will understand this point well enough to coordinate around not doing it until the time is right.”
I’m happy to delve into your individual points, but before I do so, I’d like to get your sense of what you think our remaining disagreements are, and where you think we might still be talking about different things.
A rough summary is included in my footnote: [6] By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year [or hours],at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation. (“Optimally optimized optimizer” is another way of putting it.)
I have a strong intuition that “optimal algorithms for intelligence per relevant unit of computation” don’t exist. There are lots of no-free lunch theorems around this. Intelligence is contextual; as a concrete example, children are better than adults in novel situations with unusual causal factors (https://cocosci.berkeley.edu/tom/papers/LabPublications/GopnicketalYoungLearners.pdf). In AI, the explore-exploit tradeoff is quite fundamental and it seems unlikely that you can find a fully general solution to it.
I still don’t know what “intelligence explosion within a year” means; is it relative to human intelligence? The intelligence of the previous AGI? Along what metric are you measuring intelligence? If I consider the “reasonable view” of what these terms mean, I expect that there will never be an intelligence explosion that would be considered “fast” (in the way that AGI intelligence explosion in a year would be “fast” to us) by the next-most intelligent system that exists.
You could imagine analogizing the first AGI we build to the first dynamite we ever build. You could analogize a foomed AGI to a really big dynamite, but I think it’s more accurate to analogize it to a nuclear bomb, given the positive feedback loops involved.
I’m not sure what I’m supposed to get out of the analogy. If you’re saying that a foomed AGI is way more powerful than the first AGI, sure. If you’re saying they can do qualitatively different things, sure.
I expect the intelligence differential between our first AGI and a foomed AGI to be numerous orders of magnitude larger than the intelligence differential between a chimp and a human.
I don’t know if I’d say I expect this, but I do consider this scenario often so I’m happy to talk about it, and I have been assuming that during this discussion.
In this “nuclear explosion” of intelligence, I expect the equivalent of millions of years of human cognitive labor to elapse, if not many more.
I’m still very unclear on how you’re operationalizing an intelligence explosion. If an intelligence explosion happens only after a million iterations of AGI systems improving themselves, then this seems true to me, but also the humans will have AGI systems that are way smarter than them to assist them during this time.
I imagine you either having a different picture of takeoff, or thinking something like “Just don’t build a foomed AGI. Just like it’s way too hard to build AGIs that competently optimize for our values for 1,000,000,000 years, it’s way too hard to build a safe foomed AGI, so let’s just not do it”.
I think it’s the first. I’m much more sympathetic to the picture of “slow” takeoff in Will AI See Sudden Progress? and Takeoff speeds. I don’t imagine ever building a very capable AI that explicitly optimizes a utility function, since a multiagent system (i.e. humanity) is unlikely to have a utility function. However, I can imagine building a safe foomed AGI.
And my position is something like “It’s probably inevitable, and I think it will turn out well if we make a lot of intellectual progress (probably involving solutions to metaphilosophy and zero-shot reasoning, which I think are deeply related). In the meantime, let’s do what we can to ensure that nation-states and individual actors will understand this point well enough to coordinate around not doing it until the time is right.”
It would be quite surprising to me if the right thing to do to ensure that nation states and individual actors understand this point would be to formalize zero-shot reasoning.
In addition, I could imagine building a safe foomed AGI that is corrigible and so does not require a solution to metaphilosophy; but I’m happy to consider the case where that is necessary (which seems decently likely to me), in those worlds I expect that we are able to use the first AGI systems to help us figure out metaphilosophy.
I’m happy to delve into your individual points, but before I do so, I’d like to get your sense of what you think our remaining disagreements are, and where you think we might still be talking about different things.
What takeoff looks like, what the notion of “intelligence” is, what an “intelligence explosion” consists of, the usefulness of initial AI systems in aligning future, more powerful AI systems, what daemons are.
Also, on a more epistemic note, how much weight to put on long chains of reasoning that rely on soft, intuitive concepts, and how much to trust intuitions about tasks longer than ~100 years.
This is all assuming an ontology where there exists a utility function that an AI is optimizing, and changes to the AI seem especially likely to change the utility function in a random direction. In such a scenario, yes, you probably should be worried.
However, in practice, I expect that powerful AI systems will not look like they are explicitly maximizing some utility function. In this scenario, if you change some component of the system for the worse, you are likely to degrade its performance, but not likely to drastically change its behavior to cause human extinction. For example, even in RL (which is the closest thing to expected utility maximization), you can have serious bugs and still do relatively well on the objective. A public example of this is in OpenAI Five (https://blog.openai.com/openai-five/), but I also hear this expressed when talking to RL researchers (and see this myself).
While you still want to be very careful with self-modification, it seems generally fine not to have a formal proof before making the change, and evaluating the change after it has taken place. (This would fail dramatically if the change drastically changed behavior, but if it only degrades performance, I expect the AI would still be competent enough to notice and undo the change.)
You could worry about daemons exploiting these bugs under this view. I think this is a reasonable worry, but don’t expect formalizing zero-shot reasoning to help with it. It seems to me that daemons occur by falling into a local optimum when you are trying to optimize for doing some task—the daemon does that task well in order to gain influence, and then backstabs you. This can arise both in ideal zero-shot reasoning, and when introducing approximations to it (as we will have to do when building any practical system).
In particular, the one context where we’re most confident that daemons arise is Solomonoff induction, which is one of the best instances of formalizing zero-shot reasoning that we have. Solomonoff gives you strong guarantees, of the sort you can use in proofs—and yet, daemons arise.
I would be very surprised if we were able to handle daemons without some sort of daemon-specific research.
My impression is that most of these ‘serious bugs’ are something like “oops, our gradient descent is actually gradient ascent, but it worked out alright because our utility function is also the negative of what it should be” which is not particularly heartening.
Even to changes to how to performs or evaluates self-modification? Eurisko comes to mind as a program that could and did give itself cancer, requiring its programmer to notice that it had died and restart it, and the sort of thing that AI programmers would do by default.
Even if this were true I would not update much. If you actually had only one of those and not the other, you would notice _really fast_, so it’s not going to harm you.
The bugs I’m imagining are more like “we did a bunch of math and got out an equation, but missed a minus sign in one of the terms that’s usually quite small, resulting in a small error in the value calculated, making learning less efficient”. OpenAI had a bug where their bots would get a negative reward for reaching level 25. If you introduce these kinds of bugs with a change, you’ll notice less efficient learning, and hopefully correct it, and it only leads to degraded performance, not catastrophic outcomes.
I agree that you want to be extra careful with self-modification, but there are lots of easy steps you can do to in fact be extra careful, eg. creating a copy of yourself with the modification and seeing what it tends to do on a suite of problems where you expect the modification to be helpful/harmful.
We may also have different pictures of self-modification looks like. Under your view, it seems like AI researchers are going to add a self-modification routine to the AI, which can unilaterally rewrite the source code of the AI as it wants. Under my view, AI researchers don’t really think much about self-modification, and just build an AI system capable of learning and performing general tasks, one of which could be the task of improving the AI system with very high confidence that the proposed improvement will work.
Do you generally trust that you personally could be handed the key to human self-modification? I feel reasonably confident that such a tool would help me (or at least, not harm me, in that I might decide not to use it). Since it’s much easier for an AI to run experiments on copies of itself, it should be a much easier task for the AI to use such a tool well.
The thing I’m worried about is fixing only one of them—see Reason as Memetic Immune Disorder.
I think the current standard approach is unilateral modifications (what checks do we put on gradient descent modifiying parameter values?), and that this is unlikely to change as AI researchers figure out how to do bolder and bolder variations. How would you classify the meta-learning approaches under development?
I think it’s likely that there will be some safeguards in place, much in the way that you don’t get robust multicellular life without some mechanisms of correcting cancers when they develop. The root of my worry here is that I don’t expect this problem to be solved well if researchers aren’t thinking much about self-modification (and thus how to solve it well).
I think this depends a lot on how the key is shaped. If I can write rules for moving around cells in my body, or modifying the properties of those cells, probably not, because I don’t have enough transparency for the consequences. If I have a dial with my IQ on it, probably, or if I have a set of dials related to the strength of various motivations, probably, but here I would still feel like there are significant risks associated with moving outside normal bounds that I would be accepting because we live in weird times. [For example, it seems likely that some genes that increase intelligence also increase brain cancer risk, and it seems possible that ‘turning the IQ dial’ with this key would similarly increase my chance of having brain cancer.]
Similarly, being able to print the genome for potential children rather than rolling randomly or selecting from a few options seems like it would be useful and I would use it, but is not making the situation significantly safer and could easily lead to systematic problems because of correlated choices.
Right, I’m arguing that if you only fixed one of them, you would notice _immediately_, and either revert back to the version with both bugs, or find the other bug. I’m also claiming that this should be what happens in general, assuming sufficient caution around self-modification (on the AI’s part, not the researcher’s part).
I don’t think of gradient descent as self-modification. If an AI system were able to choose (i.e. learned a policy for) when to run gradient descent on itself, and what training data it should use for that, that might be self-modification. Meta-learning feels similar to me—the AI doesn’t get to choose training data or what to run gradient descent on. The only learned part in current meta-learning approaches is how to perform a task, not how to learn.
This might be all of our disagreement actually. Like you, I’m quite pessimistic about any system where researchers put in place a protocol for self-modification, which seems to be what you are imagining. Either the protocol is too lax and we get the sort of issues you’re talking about, or it’s too strict and self-modification never happens. However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have “thoughts” of the form “Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox”.
But in that case, you would simply choose not to use it, or to do a lot of research before trying to use it.
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification). It seems likely that it could be in a situation like some of the Cake or Death problems, where it views a change to itself as impacting only part of its future behavior (like affecting actions but not values, such that it suspects that a future it that took path A would be disappointed in itself and fix that bug, without realizing that the change it’s making will cause future it to not be disappointed by path A), or is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes “sufficiently complex”. I’m not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.
+100. This is why I feel queasy about “OK, I judge this self-modification to be fine” when the self-modifications are sufficiently complex, if this judgment isn’t based off something like zero-shot reasoning (in which case we’d have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).
I agree this seems like a crux for me as well, subject to the caveat that I think we have different ideas of what “self-modification” is (though I’m not sure it matters that much).
Both of the comments feel to me like you’re making the AI system way dumber than humans, and I don’t understand why I should expect that. I think I could make a better human with high confidence/robustness if you give me a human-modification-tool that I understand reasonably well and I’m allowed to try and test things before committing to the better human.
I’m mostly concerned with daemons, not utility functions changing in random directions. If I knew that corrigibility were robust and that a corrigible AI would never encounter daemons, I’d feel pretty good about it recursively self-improving without formal zero-shot reasoning.
I’m imagining the AI zero-shot reasoning about the correctness and security of its source code (including how well it’s performing zero-shot reasoning), making itself nigh-impossible for daemons to exploit.
I think of Solomonoff induction less as a formalization of zero-shot reasoning, and more as a formalization of some unattainable ideal of rationality that will eventually lead to better conceptual understandings of bounded rational agents, which will in turn lead to progress on formalizing zero-shot reasoning.
In my mind, there’s no clear difference between preventing daemons and securing complex systems. For example, I think there’s a fundamental similarity between the following questions:
How can we build an organization that we trust to optimize for its founders’ original goals for 10,000 years?
How can ensure a society of humans flourishes for 1,000,000,000 years without falling apart?
How can we build an AGI which, when run for 1,000,000,000 years, still optimizes for its original goals with > 99% probability? (If it critically malfunctions, e.g. if it “goes insane”, it will not be optimizing for its original goals.)
How can we build an AGI which, after undergoing an intelligence explosion, still optimizes for its original goals with > 99% probability?
I think of AGIs as implementing miniature societies teeming with subagents that interact in extraordinarily sophisticated ways (for example they might play politics or Goodhart like crazy). On this view, ensuring the robustness of an AGI entails ensuring the robustness of a society at least as complex as human society, which seems to me like it requires zero-shot reasoning.
It seems like a simpler task would be building a spacecraft that can explore distant galaxies for 1,000,000,000 years without critically malfunctioning (perhaps with the help of self-correction mechanisms). Maybe it’s just a failure of my imagination, but I can’t think of any way to accomplish even this task without delegating it to a skilled zero-shot reasoner.
This seems like it’s using a bazooka to kill a fly. I’m not sure if I agree that zero-shot reasoning saves you from daemons, but even if so, why not try to attack the problem of daemons directly?
Okay, sure, but then my claim is that Solomonoff induction is _better_ than zero-shot reasoning on the axes you seem to care about, and yet it still has daemons. Why expect zero-shot reasoning to do better?
(I can’t seem to blockquote the bullet points, imagine I had done that.)
My reason for not having high confidence is that the time spans are incredibly long and many things could happen that I can’t predict. But in scenarios where we have an AGI, yet we fail to achieve these objectives, the reason that seems most likely to me is “the AGI was incompetent at some point, made a mistake, and bad things happened”. I don’t know how to evaluate the probability of this and so become uncertain. But, if you are correct that we can formalize zero-shot reasoning and actually get high confidence, then the AGI could do that too. The hard problem is in getting the AGI to “want” to do that.
However, I expect that the way we actually get high confidence answers to those questions, is that we implement a control mechanism (i.e. the AI) that gets to act over the entire span of 10,000 or 1 billion years or whatever, and it keeps course correcting in order to stay on the path.
If you’re trying to do this without putting some general intelligence into it, this sounds way harder to me, because you can’t build in a sufficiently general control mechanism for the spacecraft. I agree that (without access to general-intelligence-routines for the spacecraft) such a task would need very strong zero-shot reasoning. (It _feels_ impossible to me that any actual system could do this, including AGI, but that does feel like a failure of imagination on my part.)
I agree that zero-shot reasoning doesn’t save us from daemons by itself, and I think there’s important daemon-specific research to be done independently of zero-shot reasoning. I more think that zero-shot reasoning may end up being critically useful in saving us from a specific class of daemons.
The daemons I’m focusing on here mostly arise from embedded agency, which Solomonoff induction doesn’t capture at all. (It’s worth nothing that I consider there to be a substantial difference between Solomonoff induction daemons and “internal politics”/”embedded agency” daemons.) I’m interested in hashing this out further, but probably at some future point, since this doesn’t seem central to our disagreement.
I’m surprised by how much we seem to agree about everything you’ve written here. :P Let me start by clarifying my position a bit:
When I imagine the AGI making a “plan that will work in one go”, I’m not imagining it going like “OK, here’s a plan that will probably work for 1,000,000,000 years! Time to take my hands off the wheel and set it in motion!” I’m imagining the plan to look more like “set a bunch of things in motion, reevaluate and update it based on where things are, and repeat”. So the overall shape of this AGI’s cognition will look something like “execute on some plan for a while, reevaluate and update it, execute on it again for a while, reevaluate and update it again, etc.”, happening miliions or billions of times over (which seems a lot like a control mechanism that course-corrects). The zero-shot reasoning is mostly for ensuring that each step of reevaluation and updating doesn’t introduce any critical errors.
I think an AGI competently optimizing for our values should almost certainly be exploring distant galaxies for billions of years (given the availability of astronomical computing resources). On this view, building a spacecraft that can explore the universe for 1,000,000,000 years without critical malfunctions is strictly easier than building an AGI that competently optimizes for our values for 1,000,000,000 years.
Millions of years of human cognitive labor (or much more) might happen in an intelligence explosion that occurs over the span of hours. So undergoing a safe intelligence explosion seems at least as difficult as getting an earthbound AGI doing 1,000,000 years’ worth of human cognition without any catastrophic failures.
I’m less concerned about the AGI killing its operators than I am about the AGI failing to capture a majority of our cosmic endowment. It’s plausible that the latter usually leads to the former (particularly if there’s a fast takeoff on Earth that completes in a few hours), but that’s mostly not what I’m concerned about.
In terms of actual disagreement, I suspect I’m much more pessimistic than you about daemons taking over the control mechanism that course-corrects our AI, especially if it’s doing something like 1,000,000 years’ worth of human cognition, unless we can continuously zero-shot reason that this control mechanism will remain intact. (Equivalently, I feel very pessimistic about the process of executing and reevaluating plans millions/billions+ times over, unless the evaluation process is extraordinarily robust.) What’s your take on this?
You must be really pessimistic about our chances.
Huh, okay. I still don’t know the mechanism by which zero-shot reasoning helps us avoid x-risk, so it might be useful to describe these daemons in more detail. I continue to think that zero-shot reasoning does not seem necessary for eg. ensuring a flourishing human civilization for a billion years.
Agreed that this is a control mechanism that course-corrects.
But why is the probability of introducing critical errors so high without zero-shot reasoning? Perhaps over a billion years even a 1-in-a-million chance is way too high, but couldn’t we spend some fraction of the first 10,000 years figuring that out?
This seems extraordinarily unlikely to me (feels like ~0%), if we’re talking about the first AGI that we build. If we’re not talking about that, then I’d want to know what “intelligence explosion” means—how intelligent was the most intelligent thing before the intelligence explosion? (I think this is tangential though, so feel free to ignore it.)
Just, don’t build that. No one is trying to build that. It’s a hard problem, we don’t know what our values are, it’s difficult. We can instead build an AGI that wants to help humans do whatever it is they want to do, and help them figure out what they want to do, and assists human development and flourishing at any given point in time, and continually learns from humans to figure out what should be done. In this version, humans are a control mechanism for the AI, and so we should expect the problem to be a lot easier.
I was talking about the AI as a control mechanism for the task that needs to be done, not a control mechanism that course-corrects the AI. I don’t expect there to be a particular subsystem of the AI that is responsible for course correction, just as there isn’t a particular subsystem in the human brain responsible for thinking “Huh, I guess now that condition X has arisen, I should probably take action Y in order to deal with it”.
I have no intuition for why there should be daemons, what they look like, how they take over the AI, etc. especially if they are different in kind from Solomonoff daemons. This basically sounds to me like positing the existence of a bad thing, and so concluding that we get bad things. I’m sure there’s more to your intuition, but I don’t know what it is and don’t share the intuition.
Executing and reevaluating plans many times seems like it’s fine, ignoring cases where the environment gets too difficult for the AI to deal with (i.e. the AI is incompetent). I expect the evaluation process to be robust by constantly getting human input.
I should clarify a few more background beliefs:
I think zero-shot reasoning is probably not very helpful for the first AGI, and will probably not help much with daemons in our first AGI.
I agree that right now, nobody is trying to (or should be trying to) build an AGI that’s competently optimizing for our values for 1,000,000,000 years. (I’d want an aligned, foomed AGI to be doing that.)
I agree that if we’re not doing anything as ambitious as that, it’s probably fine to rely on human input.
I agree that if humanity builds a non-fooming AGI, they could coordinate around solving zero-shot reasoning before building a fooming AGI in a small fraction of the first 10,000 years (perhaps with the help of the first AGI), in which case we don’t have to worry about zero-shot reasoning today.
Conditioning on reasonable international coordination around AGI at all, I give 50% to coordination around intelligence explosions. I think the likelihood of this outcome rises with the amount of legitimacy zero-shot shot reasoning has at coordination time, which is my main reason for wanting to work on it today. (If takeoff is much slower I’d give something more like 80% to coordination around intelligence explosions, conditional on international coordination around AGIs.)
Let me now clarify what I mean by “foomed AGI”:
A rough summary is included in my footnote: [6] By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year [or hours], at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation. (“Optimally optimized optimizer” is another way of putting it.)
You could imagine analogizing the first AGI we build to the first dynamite we ever build. You could analogize a foomed AGI to a really big dynamite, but I think it’s more accurate to analogize it to a nuclear bomb, given the positive feedback loops involved.
I expect the intelligence differential between our first AGI and a foomed AGI to be numerous orders of magnitude larger than the intelligence differential between a chimp and a human.
In this “nuclear explosion” of intelligence, I expect the equivalent of millions of years of human cognitive labor to elapse, if not many more.
In this comment thread, I was referring primarily to foomed AGIs, not the first AGIs we build. I imagine you either having a different picture of takeoff, or thinking something like “Just don’t build a foomed AGI. Just like it’s way too hard to build AGIs that competently optimize for our values for 1,000,000,000 years, it’s way too hard to build a safe foomed AGI, so let’s just not do it”. And my position is something like “It’s probably inevitable, and I think it will turn out well if we make a lot of intellectual progress (probably involving solutions to metaphilosophy and zero-shot reasoning, which I think are deeply related). In the meantime, let’s do what we can to ensure that nation-states and individual actors will understand this point well enough to coordinate around not doing it until the time is right.”
I’m happy to delve into your individual points, but before I do so, I’d like to get your sense of what you think our remaining disagreements are, and where you think we might still be talking about different things.
I have a strong intuition that “optimal algorithms for intelligence per relevant unit of computation” don’t exist. There are lots of no-free lunch theorems around this. Intelligence is contextual; as a concrete example, children are better than adults in novel situations with unusual causal factors (https://cocosci.berkeley.edu/tom/papers/LabPublications/GopnicketalYoungLearners.pdf). In AI, the explore-exploit tradeoff is quite fundamental and it seems unlikely that you can find a fully general solution to it.
I still don’t know what “intelligence explosion within a year” means; is it relative to human intelligence? The intelligence of the previous AGI? Along what metric are you measuring intelligence? If I consider the “reasonable view” of what these terms mean, I expect that there will never be an intelligence explosion that would be considered “fast” (in the way that AGI intelligence explosion in a year would be “fast” to us) by the next-most intelligent system that exists.
I’m not sure what I’m supposed to get out of the analogy. If you’re saying that a foomed AGI is way more powerful than the first AGI, sure. If you’re saying they can do qualitatively different things, sure.
I don’t know if I’d say I expect this, but I do consider this scenario often so I’m happy to talk about it, and I have been assuming that during this discussion.
I’m still very unclear on how you’re operationalizing an intelligence explosion. If an intelligence explosion happens only after a million iterations of AGI systems improving themselves, then this seems true to me, but also the humans will have AGI systems that are way smarter than them to assist them during this time.
I think it’s the first. I’m much more sympathetic to the picture of “slow” takeoff in Will AI See Sudden Progress? and Takeoff speeds. I don’t imagine ever building a very capable AI that explicitly optimizes a utility function, since a multiagent system (i.e. humanity) is unlikely to have a utility function. However, I can imagine building a safe foomed AGI.
It would be quite surprising to me if the right thing to do to ensure that nation states and individual actors understand this point would be to formalize zero-shot reasoning.
In addition, I could imagine building a safe foomed AGI that is corrigible and so does not require a solution to metaphilosophy; but I’m happy to consider the case where that is necessary (which seems decently likely to me), in those worlds I expect that we are able to use the first AGI systems to help us figure out metaphilosophy.
What takeoff looks like, what the notion of “intelligence” is, what an “intelligence explosion” consists of, the usefulness of initial AI systems in aligning future, more powerful AI systems, what daemons are.
Also, on a more epistemic note, how much weight to put on long chains of reasoning that rely on soft, intuitive concepts, and how much to trust intuitions about tasks longer than ~100 years.