This seems like it’s using a bazooka to kill a fly. I’m not sure if I agree that zero-shot reasoning saves you from daemons, but even if so, why not try to attack the problem of daemons directly?
I agree that zero-shot reasoning doesn’t save us from daemons by itself, and I think there’s important daemon-specific research to be done independently of zero-shot reasoning. I more think that zero-shot reasoning may end up being critically useful in saving us from a specific class of daemons.
Okay, sure, but then my claim is that Solomonoff induction is _better_ than zero-shot reasoning on the axes you seem to care about, and yet it still has daemons. Why expect zero-shot reasoning to do better?
The daemons I’m focusing on here mostly arise from embedded agency, which Solomonoff induction doesn’t capture at all. (It’s worth nothing that I consider there to be a substantial difference between Solomonoff induction daemons and “internal politics”/”embedded agency” daemons.) I’m interested in hashing this out further, but probably at some future point, since this doesn’t seem central to our disagreement.
But in scenarios where we have an AGI, yet we fail to achieve these objectives, the reason that seems most likely to me is “the AGI was incompetent at some point, made a mistake, and bad things happened”. I don’t know how to evaluate the probability of this and so become uncertain. But, if you are correct that we can formalize zero-shot reasoning and actually get high confidence, then the AGI could do that too. The hard problem is in getting the AGI to “want” to do that.
However, I expect that the way we actually get high confidence answers to those questions, is that we implement a control mechanism (i.e. the AI) that gets to act over the entire span of 10,000 or 1 billion years or whatever, and it keeps course correcting in order to stay on the path.
....
If you’re trying to [build the spacecraft] without putting some general intelligence into it, this sounds way harder to me, because you can’t build in a sufficiently general control mechanism for the spacecraft. I agree that (without access to general-intelligence-routines for the spacecraft) such a task would need very strong zero-shot reasoning. (It _feels_ impossible to me that any actual system could do this, including AGI, but that does feel like a failure of imagination on my part.)
I’m surprised by how much we seem to agree about everything you’ve written here. :P Let me start by clarifying my position a bit:
When I imagine the AGI making a “plan that will work in one go”, I’m not imagining it going like “OK, here’s a plan that will probably work for 1,000,000,000 years! Time to take my hands off the wheel and set it in motion!” I’m imagining the plan to look more like “set a bunch of things in motion, reevaluate and update it based on where things are, and repeat”. So the overall shape of this AGI’s cognition will look something like “execute on some plan for a while, reevaluate and update it, execute on it again for a while, reevaluate and update it again, etc.”, happening miliions or billions of times over (which seems a lot like a control mechanism that course-corrects). The zero-shot reasoning is mostly for ensuring that each step of reevaluation and updating doesn’t introduce any critical errors.
I think an AGI competently optimizing for our values should almost certainly be exploring distant galaxies for billions of years (given the availability of astronomical computing resources). On this view, building a spacecraft that can explore the universe for 1,000,000,000 years without critical malfunctions is strictly easier than building an AGI that competently optimizes for our values for 1,000,000,000 years.
Millions of years of human cognitive labor (or much more) might happen in an intelligence explosion that occurs over the span of hours. So undergoing a safe intelligence explosion seems at least as difficult as getting an earthbound AGI doing 1,000,000 years’ worth of human cognition without any catastrophic failures.
I’m less concerned about the AGI killing its operators than I am about the AGI failing to capture a majority of our cosmic endowment. It’s plausible that the latter usually leads to the former (particularly if there’s a fast takeoff on Earth that completes in a few hours), but that’s mostly not what I’m concerned about.
In terms of actual disagreement, I suspect I’m much more pessimistic than you about daemons taking over the control mechanism that course-corrects our AI, especially if it’s doing something like 1,000,000 years’ worth of human cognition, unless we can continuously zero-shot reason that this control mechanism will remain intact. (Equivalently, I feel very pessimistic about the process of executing and reevaluating plans millions/billions+ times over, unless the evaluation process is extraordinarily robust.) What’s your take on this?
I agree that zero-shot reasoning doesn’t save us from daemons by itself, and I think there’s important daemon-specific research to be done independently of zero-shot reasoning. I more think that zero-shot reasoning may end up being critically useful in saving us from a specific class of daemons.
You must be really pessimistic about our chances.
The daemons I’m focusing on here mostly arise from embedded agency, which Solomonoff induction doesn’t capture at all.
Huh, okay. I still don’t know the mechanism by which zero-shot reasoning helps us avoid x-risk, so it might be useful to describe these daemons in more detail. I continue to think that zero-shot reasoning does not seem necessary for eg. ensuring a flourishing human civilization for a billion years.
So the overall shape of this AGI’s cognition will look something like “execute on some plan for a while, reevaluate and update it, execute on it again for a while, reevaluate and update it again, etc.”, happening miliions or billions of times over (which seems a lot like a control mechanism that course-corrects)
Agreed that this is a control mechanism that course-corrects.
The zero-shot reasoning is mostly for ensuring that each step of reevaluation and updating doesn’t introduce any critical errors.
But why is the probability of introducing critical errors so high without zero-shot reasoning? Perhaps over a billion years even a 1-in-a-million chance is way too high, but couldn’t we spend some fraction of the first 10,000 years figuring that out?
Millions of years of human cognitive labor (or much more) might happen in an intelligence explosion that occurs over the span of hours.
This seems extraordinarily unlikely to me (feels like ~0%), if we’re talking about the first AGI that we build. If we’re not talking about that, then I’d want to know what “intelligence explosion” means—how intelligent was the most intelligent thing before the intelligence explosion? (I think this is tangential though, so feel free to ignore it.)
building an AGI that competently optimizes for our values for 1,000,000,000 years.
Just, don’t build that. No one is trying to build that. It’s a hard problem, we don’t know what our values are, it’s difficult. We can instead build an AGI that wants to help humans do whatever it is they want to do, and help them figure out what they want to do, and assists human development and flourishing at any given point in time, and continually learns from humans to figure out what should be done. In this version, humans are a control mechanism for the AI, and so we should expect the problem to be a lot easier.
I suspect I’m much more pessimistic than you about daemons taking over the control mechanism that course-corrects our AI, especially if it’s doing something like 1,000,000 years’ worth of human cognition, unless we can continuously zero-shot reason that this control mechanism will remain intact. (Equivalently, I feel very pessimistic about the process of executing and reevaluating plans millions/billions+ times over, unless the evaluation process is extraordinarily robust.) What’s your take on this?
I was talking about the AI as a control mechanism for the task that needs to be done, not a control mechanism that course-corrects the AI. I don’t expect there to be a particular subsystem of the AI that is responsible for course correction, just as there isn’t a particular subsystem in the human brain responsible for thinking “Huh, I guess now that condition X has arisen, I should probably take action Y in order to deal with it”.
I have no intuition for why there should be daemons, what they look like, how they take over the AI, etc. especially if they are different in kind from Solomonoff daemons. This basically sounds to me like positing the existence of a bad thing, and so concluding that we get bad things. I’m sure there’s more to your intuition, but I don’t know what it is and don’t share the intuition.
Executing and reevaluating plans many times seems like it’s fine, ignoring cases where the environment gets too difficult for the AI to deal with (i.e. the AI is incompetent). I expect the evaluation process to be robust by constantly getting human input.
I think zero-shot reasoning is probably not very helpful for the first AGI, and will probably not help much with daemons in our first AGI.
I agree that right now, nobody is trying to (or should be trying to) build an AGI that’s competently optimizing for our values for 1,000,000,000 years. (I’d want an aligned, foomed AGI to be doing that.)
I agree that if we’re not doing anything as ambitious as that, it’s probably fine to rely on human input.
I agree that if humanity builds a non-fooming AGI, they could coordinate around solving zero-shot reasoning before building a fooming AGI in a small fraction of the first 10,000 years (perhaps with the help of the first AGI), in which case we don’t have to worry about zero-shot reasoning today.
Conditioning on reasonable international coordination around AGI at all, I give 50% to coordination around intelligence explosions. I think the likelihood of this outcome rises with the amount of legitimacy zero-shot shot reasoning has at coordination time, which is my main reason for wanting to work on it today. (If takeoff is much slower I’d give something more like 80% to coordination around intelligence explosions, conditional on international coordination around AGIs.)
Let me now clarify what I mean by “foomed AGI”:
A rough summary is included in my footnote: [6] By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year [or hours], at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation. (“Optimally optimized optimizer” is another way of putting it.)
You could imagine analogizing the first AGI we build to the first dynamite we ever build. You could analogize a foomed AGI to a really big dynamite, but I think it’s more accurate to analogize it to a nuclear bomb, given the positive feedback loops involved.
I expect the intelligence differential between our first AGI and a foomed AGI to be numerous orders of magnitude larger than the intelligence differential between a chimp and a human.
In this “nuclear explosion” of intelligence, I expect the equivalent of millions of years of human cognitive labor to elapse, if not many more.
In this comment thread, I was referring primarily to foomed AGIs, not the first AGIs we build. I imagine you either having a different picture of takeoff, or thinking something like “Just don’t build a foomed AGI. Just like it’s way too hard to build AGIs that competently optimize for our values for 1,000,000,000 years, it’s way too hard to build a safe foomed AGI, so let’s just not do it”. And my position is something like “It’s probably inevitable, and I think it will turn out well if we make a lot of intellectual progress (probably involving solutions to metaphilosophy and zero-shot reasoning, which I think are deeply related). In the meantime, let’s do what we can to ensure that nation-states and individual actors will understand this point well enough to coordinate around not doing it until the time is right.”
I’m happy to delve into your individual points, but before I do so, I’d like to get your sense of what you think our remaining disagreements are, and where you think we might still be talking about different things.
A rough summary is included in my footnote: [6] By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year [or hours],at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation. (“Optimally optimized optimizer” is another way of putting it.)
I have a strong intuition that “optimal algorithms for intelligence per relevant unit of computation” don’t exist. There are lots of no-free lunch theorems around this. Intelligence is contextual; as a concrete example, children are better than adults in novel situations with unusual causal factors (https://cocosci.berkeley.edu/tom/papers/LabPublications/GopnicketalYoungLearners.pdf). In AI, the explore-exploit tradeoff is quite fundamental and it seems unlikely that you can find a fully general solution to it.
I still don’t know what “intelligence explosion within a year” means; is it relative to human intelligence? The intelligence of the previous AGI? Along what metric are you measuring intelligence? If I consider the “reasonable view” of what these terms mean, I expect that there will never be an intelligence explosion that would be considered “fast” (in the way that AGI intelligence explosion in a year would be “fast” to us) by the next-most intelligent system that exists.
You could imagine analogizing the first AGI we build to the first dynamite we ever build. You could analogize a foomed AGI to a really big dynamite, but I think it’s more accurate to analogize it to a nuclear bomb, given the positive feedback loops involved.
I’m not sure what I’m supposed to get out of the analogy. If you’re saying that a foomed AGI is way more powerful than the first AGI, sure. If you’re saying they can do qualitatively different things, sure.
I expect the intelligence differential between our first AGI and a foomed AGI to be numerous orders of magnitude larger than the intelligence differential between a chimp and a human.
I don’t know if I’d say I expect this, but I do consider this scenario often so I’m happy to talk about it, and I have been assuming that during this discussion.
In this “nuclear explosion” of intelligence, I expect the equivalent of millions of years of human cognitive labor to elapse, if not many more.
I’m still very unclear on how you’re operationalizing an intelligence explosion. If an intelligence explosion happens only after a million iterations of AGI systems improving themselves, then this seems true to me, but also the humans will have AGI systems that are way smarter than them to assist them during this time.
I imagine you either having a different picture of takeoff, or thinking something like “Just don’t build a foomed AGI. Just like it’s way too hard to build AGIs that competently optimize for our values for 1,000,000,000 years, it’s way too hard to build a safe foomed AGI, so let’s just not do it”.
I think it’s the first. I’m much more sympathetic to the picture of “slow” takeoff in Will AI See Sudden Progress? and Takeoff speeds. I don’t imagine ever building a very capable AI that explicitly optimizes a utility function, since a multiagent system (i.e. humanity) is unlikely to have a utility function. However, I can imagine building a safe foomed AGI.
And my position is something like “It’s probably inevitable, and I think it will turn out well if we make a lot of intellectual progress (probably involving solutions to metaphilosophy and zero-shot reasoning, which I think are deeply related). In the meantime, let’s do what we can to ensure that nation-states and individual actors will understand this point well enough to coordinate around not doing it until the time is right.”
It would be quite surprising to me if the right thing to do to ensure that nation states and individual actors understand this point would be to formalize zero-shot reasoning.
In addition, I could imagine building a safe foomed AGI that is corrigible and so does not require a solution to metaphilosophy; but I’m happy to consider the case where that is necessary (which seems decently likely to me), in those worlds I expect that we are able to use the first AGI systems to help us figure out metaphilosophy.
I’m happy to delve into your individual points, but before I do so, I’d like to get your sense of what you think our remaining disagreements are, and where you think we might still be talking about different things.
What takeoff looks like, what the notion of “intelligence” is, what an “intelligence explosion” consists of, the usefulness of initial AI systems in aligning future, more powerful AI systems, what daemons are.
Also, on a more epistemic note, how much weight to put on long chains of reasoning that rely on soft, intuitive concepts, and how much to trust intuitions about tasks longer than ~100 years.
I agree that zero-shot reasoning doesn’t save us from daemons by itself, and I think there’s important daemon-specific research to be done independently of zero-shot reasoning. I more think that zero-shot reasoning may end up being critically useful in saving us from a specific class of daemons.
The daemons I’m focusing on here mostly arise from embedded agency, which Solomonoff induction doesn’t capture at all. (It’s worth nothing that I consider there to be a substantial difference between Solomonoff induction daemons and “internal politics”/”embedded agency” daemons.) I’m interested in hashing this out further, but probably at some future point, since this doesn’t seem central to our disagreement.
I’m surprised by how much we seem to agree about everything you’ve written here. :P Let me start by clarifying my position a bit:
When I imagine the AGI making a “plan that will work in one go”, I’m not imagining it going like “OK, here’s a plan that will probably work for 1,000,000,000 years! Time to take my hands off the wheel and set it in motion!” I’m imagining the plan to look more like “set a bunch of things in motion, reevaluate and update it based on where things are, and repeat”. So the overall shape of this AGI’s cognition will look something like “execute on some plan for a while, reevaluate and update it, execute on it again for a while, reevaluate and update it again, etc.”, happening miliions or billions of times over (which seems a lot like a control mechanism that course-corrects). The zero-shot reasoning is mostly for ensuring that each step of reevaluation and updating doesn’t introduce any critical errors.
I think an AGI competently optimizing for our values should almost certainly be exploring distant galaxies for billions of years (given the availability of astronomical computing resources). On this view, building a spacecraft that can explore the universe for 1,000,000,000 years without critical malfunctions is strictly easier than building an AGI that competently optimizes for our values for 1,000,000,000 years.
Millions of years of human cognitive labor (or much more) might happen in an intelligence explosion that occurs over the span of hours. So undergoing a safe intelligence explosion seems at least as difficult as getting an earthbound AGI doing 1,000,000 years’ worth of human cognition without any catastrophic failures.
I’m less concerned about the AGI killing its operators than I am about the AGI failing to capture a majority of our cosmic endowment. It’s plausible that the latter usually leads to the former (particularly if there’s a fast takeoff on Earth that completes in a few hours), but that’s mostly not what I’m concerned about.
In terms of actual disagreement, I suspect I’m much more pessimistic than you about daemons taking over the control mechanism that course-corrects our AI, especially if it’s doing something like 1,000,000 years’ worth of human cognition, unless we can continuously zero-shot reason that this control mechanism will remain intact. (Equivalently, I feel very pessimistic about the process of executing and reevaluating plans millions/billions+ times over, unless the evaluation process is extraordinarily robust.) What’s your take on this?
You must be really pessimistic about our chances.
Huh, okay. I still don’t know the mechanism by which zero-shot reasoning helps us avoid x-risk, so it might be useful to describe these daemons in more detail. I continue to think that zero-shot reasoning does not seem necessary for eg. ensuring a flourishing human civilization for a billion years.
Agreed that this is a control mechanism that course-corrects.
But why is the probability of introducing critical errors so high without zero-shot reasoning? Perhaps over a billion years even a 1-in-a-million chance is way too high, but couldn’t we spend some fraction of the first 10,000 years figuring that out?
This seems extraordinarily unlikely to me (feels like ~0%), if we’re talking about the first AGI that we build. If we’re not talking about that, then I’d want to know what “intelligence explosion” means—how intelligent was the most intelligent thing before the intelligence explosion? (I think this is tangential though, so feel free to ignore it.)
Just, don’t build that. No one is trying to build that. It’s a hard problem, we don’t know what our values are, it’s difficult. We can instead build an AGI that wants to help humans do whatever it is they want to do, and help them figure out what they want to do, and assists human development and flourishing at any given point in time, and continually learns from humans to figure out what should be done. In this version, humans are a control mechanism for the AI, and so we should expect the problem to be a lot easier.
I was talking about the AI as a control mechanism for the task that needs to be done, not a control mechanism that course-corrects the AI. I don’t expect there to be a particular subsystem of the AI that is responsible for course correction, just as there isn’t a particular subsystem in the human brain responsible for thinking “Huh, I guess now that condition X has arisen, I should probably take action Y in order to deal with it”.
I have no intuition for why there should be daemons, what they look like, how they take over the AI, etc. especially if they are different in kind from Solomonoff daemons. This basically sounds to me like positing the existence of a bad thing, and so concluding that we get bad things. I’m sure there’s more to your intuition, but I don’t know what it is and don’t share the intuition.
Executing and reevaluating plans many times seems like it’s fine, ignoring cases where the environment gets too difficult for the AI to deal with (i.e. the AI is incompetent). I expect the evaluation process to be robust by constantly getting human input.
I should clarify a few more background beliefs:
I think zero-shot reasoning is probably not very helpful for the first AGI, and will probably not help much with daemons in our first AGI.
I agree that right now, nobody is trying to (or should be trying to) build an AGI that’s competently optimizing for our values for 1,000,000,000 years. (I’d want an aligned, foomed AGI to be doing that.)
I agree that if we’re not doing anything as ambitious as that, it’s probably fine to rely on human input.
I agree that if humanity builds a non-fooming AGI, they could coordinate around solving zero-shot reasoning before building a fooming AGI in a small fraction of the first 10,000 years (perhaps with the help of the first AGI), in which case we don’t have to worry about zero-shot reasoning today.
Conditioning on reasonable international coordination around AGI at all, I give 50% to coordination around intelligence explosions. I think the likelihood of this outcome rises with the amount of legitimacy zero-shot shot reasoning has at coordination time, which is my main reason for wanting to work on it today. (If takeoff is much slower I’d give something more like 80% to coordination around intelligence explosions, conditional on international coordination around AGIs.)
Let me now clarify what I mean by “foomed AGI”:
A rough summary is included in my footnote: [6] By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year [or hours], at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation. (“Optimally optimized optimizer” is another way of putting it.)
You could imagine analogizing the first AGI we build to the first dynamite we ever build. You could analogize a foomed AGI to a really big dynamite, but I think it’s more accurate to analogize it to a nuclear bomb, given the positive feedback loops involved.
I expect the intelligence differential between our first AGI and a foomed AGI to be numerous orders of magnitude larger than the intelligence differential between a chimp and a human.
In this “nuclear explosion” of intelligence, I expect the equivalent of millions of years of human cognitive labor to elapse, if not many more.
In this comment thread, I was referring primarily to foomed AGIs, not the first AGIs we build. I imagine you either having a different picture of takeoff, or thinking something like “Just don’t build a foomed AGI. Just like it’s way too hard to build AGIs that competently optimize for our values for 1,000,000,000 years, it’s way too hard to build a safe foomed AGI, so let’s just not do it”. And my position is something like “It’s probably inevitable, and I think it will turn out well if we make a lot of intellectual progress (probably involving solutions to metaphilosophy and zero-shot reasoning, which I think are deeply related). In the meantime, let’s do what we can to ensure that nation-states and individual actors will understand this point well enough to coordinate around not doing it until the time is right.”
I’m happy to delve into your individual points, but before I do so, I’d like to get your sense of what you think our remaining disagreements are, and where you think we might still be talking about different things.
I have a strong intuition that “optimal algorithms for intelligence per relevant unit of computation” don’t exist. There are lots of no-free lunch theorems around this. Intelligence is contextual; as a concrete example, children are better than adults in novel situations with unusual causal factors (https://cocosci.berkeley.edu/tom/papers/LabPublications/GopnicketalYoungLearners.pdf). In AI, the explore-exploit tradeoff is quite fundamental and it seems unlikely that you can find a fully general solution to it.
I still don’t know what “intelligence explosion within a year” means; is it relative to human intelligence? The intelligence of the previous AGI? Along what metric are you measuring intelligence? If I consider the “reasonable view” of what these terms mean, I expect that there will never be an intelligence explosion that would be considered “fast” (in the way that AGI intelligence explosion in a year would be “fast” to us) by the next-most intelligent system that exists.
I’m not sure what I’m supposed to get out of the analogy. If you’re saying that a foomed AGI is way more powerful than the first AGI, sure. If you’re saying they can do qualitatively different things, sure.
I don’t know if I’d say I expect this, but I do consider this scenario often so I’m happy to talk about it, and I have been assuming that during this discussion.
I’m still very unclear on how you’re operationalizing an intelligence explosion. If an intelligence explosion happens only after a million iterations of AGI systems improving themselves, then this seems true to me, but also the humans will have AGI systems that are way smarter than them to assist them during this time.
I think it’s the first. I’m much more sympathetic to the picture of “slow” takeoff in Will AI See Sudden Progress? and Takeoff speeds. I don’t imagine ever building a very capable AI that explicitly optimizes a utility function, since a multiagent system (i.e. humanity) is unlikely to have a utility function. However, I can imagine building a safe foomed AGI.
It would be quite surprising to me if the right thing to do to ensure that nation states and individual actors understand this point would be to formalize zero-shot reasoning.
In addition, I could imagine building a safe foomed AGI that is corrigible and so does not require a solution to metaphilosophy; but I’m happy to consider the case where that is necessary (which seems decently likely to me), in those worlds I expect that we are able to use the first AGI systems to help us figure out metaphilosophy.
What takeoff looks like, what the notion of “intelligence” is, what an “intelligence explosion” consists of, the usefulness of initial AI systems in aligning future, more powerful AI systems, what daemons are.
Also, on a more epistemic note, how much weight to put on long chains of reasoning that rely on soft, intuitive concepts, and how much to trust intuitions about tasks longer than ~100 years.