Thanks for recording this conversation! Some thoughts:
AI development will be relatively gradual and AI researchers will correct safety issues that come up.
I was pretty surprised to read the above—most of my intuitions about AI come down to repeatedly hearing the point that safety issues are very unpredictable and high variance, and that once a major safety issue happens, it’s already too late. The arguments I’ve seen for this (many years of Eliezer-ian explanations of how hard it is to come out on top against superintelligent agents who care about different things than you) also seem pretty straightforward. And Rohin Shah isn’t a stranger to them. So what gives?
Well, look at the summary on top of the full transcript link. Here are some quotes reflecting the point that Rohin is making which is most interesting to me--
From the summary:
Shah doesn’t believe that any sufficiently powerful AI system will look like an expected utility maximizer.
and, in more detail, from the transcript:
Rohin Shah: … I have an intuition that AI systems are not well-modeled as, “Here’s the objective function and here is the world model.” Most of the classic arguments are: Suppose you’ve got an incorrect objective function, and you’ve got this AI system with this really, really good intelligence, which maybe we’ll call it a world model or just general intelligence. And this intelligence can take in any utility function, and optimize it, and you plug in the incorrect utility function, and catastrophe happens.
This does not seem to be the way that current AI systems work. It is the case that you have a reward function, and then you sort of train a policy that optimizes that reward function, but… I explained this the wrong way around. But the policy that’s learned isn’t really… It’s not really performing an optimization that says, “What is going to get me the most reward? Let me do that thing.”
If I was very convinced of this perspective, I think I’d share Rohin’s impression that AI Safety is attainable. This is because I also do not expect highly strategic and agential actions focused on a single long-term goal to be produced by something that “has been given a bunch of heuristics by gradient descent that tend to correlate well with getting high reward and then it just executes those heuristics.” To elaborate on some of this with my own perspective:
If our superintelligent AI is just a bunch of well developed heuristics, it is unlikely that those heuristics will be generatively strategic enough to engage in super-long-term planning
If our superintelligent AI gets punished based on any proxy for “misleading humans” and it can’t do super-long-term planning, it is unlikely to come up with a good reward-attaining strategy that involves misleading humans
If our superintelligent AI does somehow develop a heuristic that misleads humans, it is yet more unlikely that the heuristic will be immediately well-developed enough to mislead humans long enough to cause an extinction level event. Instead, it will probably mislead the humans for more short-term gains at first—which will allow us to identify safety measures in advance
So I agree that we have a good chance of ensuring that this kind of AI is safe—mainly because I don’t think the level of heuristics involved invoke an AI take-off slow enough to clearly indicate safety risks before they become x-risks.
On the other hand, while I agree with Rohin and Hanson’s side that there isn’t One True Learning Algorithm, there are potentially a multitude of advanced heuristics that approximate extremely agent-y and strategic long-term optimizations. We even have a real-life, human-level example of this. His name is Eliezer Yudkowsky[1]. Moreover, if I got an extra fifty IQ points and a slightly different set of ethics, I wouldn’t be surprised if the set of heuristics composing my brain could be an existential threat. I think Rohin would agree with this belief in heuristic kludges that are effecively agential despite not being a One True Algorithm and, alone, this belief doesn’t imply existential risk. If these agenty heuristics manifest gradually over time, we can easily stop them just by noticing them and turning the AI off before they get refined into something truly dangerous.
However, I don’t think that machine-learned heuristics are the only way we can get highly dangerous agenty heuristics. We’ve made a lot of mathematical process on understanding logic, rationality and decision theory and, while machine-learned heuristics may figure out approximately Perfect Reasoning Capabilities just by training, I think it’s possible that we can directly hardcode heuristics that do the same thing based on our current understanding of things we associate with Perfect Reasoning Capabilities.
In other words, I think that the dangerously agent-y heuristics which we can develop through gradual machine-learning processes could also be developed by a bunch of mathematicians teaming up and building a kludge that is similarly agent-y right out of the box. The former possibility is something we can mitigate gradually (for instance, by not continuing to build AI once they start doing things that look too agent-y) but the latter seems much more dangerous.
Of course, even if mathematicians could directly kludge some heuristics that can perform long-term strategic planning, implementing such a kludge seems obviously dangerous to me. It also seems rather unnecessary. If we could also just get superintelligent AI that doesn’t do scary agent-y stuff by just developing it as a gradual extension of our current machine-learning technology, why would you want to do it the risky and unpredictable way? Maybe it’d be orders of magnitude faster but this doesn’t seem worth the trade—especially when you could just directly improve AI-compute capabilities instead.
As of finishing this comment, I think I’m less worried about AI existential risks than I was before.
[1] While this sentence might seem glib, I phrased it the way I did specifically most, while most people display agentic behaviors, most of us aren’t that agentic in general. I do not know Eliezer personally but the person who wrote a whole set of sequences on rationality, developed a new decision theory and started up a new research institute focused on saving the world is the best example of an agenty person I can come up with off the top of my head.
I enjoyed this comment, thanks for thinking it through! Some comments:
If our superintelligent AI is just a bunch of well developed heuristics, it is unlikely that those heuristics will be generatively strategic enough to engage in super-long-term planning
This is not my belief. I think that powerful AI systems, even if they are a bunch of well developed heuristics, will be able to do super-long-term planning (in the same way that I’m capable of it, and I’m a bunch of heuristics, or Eliezer is to take your example).
Obviously this depends on how good the heuristics are, but I do think that heuristics will get to the point where they do super-long-term planning, and my belief that we’ll be safe by default doesn’t depend on assuming that AI won’t do long-term planning.
I think Rohin would agree with this belief in heuristic kludges that are effecively agential despite not being a One True Algorithm
Yup, that’s correct.
So I agree that we have a good chance of ensuring that this kind of AI is safe—mainly because I don’t think the level of heuristics involved invoke an AI take-off slow enough to clearly indicate safety risks before they become x-risks.
Should “I don’t think” be “I do think”? Otherwise I’m confused. With that correction, I basically agree.
However, I don’t think that machine-learned heuristics are the only way we can get highly dangerous agenty heuristics. We’ve made a lot of mathematical process on understanding logic, rationality and decision theory and, while machine-learned heuristics may figure out approximately Perfect Reasoning Capabilities just by training, I think it’s possible that we can directly hardcode heuristics that do the same thing based on our current understanding of things we associate with Perfect Reasoning Capabilities.
I would be very surprised if this worked in the near term. Like, <1% in 5 years, <5% in 20 years, and really I want to say < 1% that this is the first way we get AGI (no matter when), but I can’t actually be that confident.
My impression is that many researchers at MIRI would qualitatively agree with me on this, though probably with less confidence.
This is not my belief. I think that powerful AI systems, even if they are a bunch of well developed heuristics, will be able to do super-long-term planning (in the same way that I’m capable of it, and I’m a bunch of heuristics, or Eliezer is to take your example).
Yeah, I intended that statement to be more of an elaboration on my own perspective than to imply that it represented your beliefs. I also agree that its wrong in the context of superintelligent AI we are discussing.
Should “I don’t think” be “I do think”? Otherwise I’m confused.
Yep! Thanks for the correction.
I would be very surprised if this worked in the near term. Like, <1% in 5 years, <5% in 20 years and really I want to say < 1% that this is the first way we get AGI (no matter when)
Huh, okay… On reflection, I agree that directly hardcoded agent-y heuristics are unlikely to happen because AI-Compute tends to beat it. However, I continue to think that mathematicians may be able to use their knowledge of probability & logic to cause heuristics to develop in ways that are unusually agent-y at a fast enough rate to imply surprising x-risks.
This mainly boils down to my understanding that similarly well-performing but different heuristics for agential behavior may have very different potentials for generalizing to agential behavior on longer time-scales/chains-of-reasoning than the ones trained on. Consequently, I think there are particular ways of defining AI problem objectives and AI architecture that are uniquely suited to AI becoming generally agential over arbitrarily long time-frames and chains of reasoning.
However, I think we can address this kind of risk with the same safety solutions that could help us deal with AI that just have significantly better reasoning capabilities than us (but have not reasoning capabilities that have fully generalized!). Paul Christiano’s work on amplification, for instance.
So the above is only a concern if people a) deliberately try to get AI in the most reckless way possible and b) get lucky enough that it doesn’t get bottle-necked somewhere else. I’ll buy the low estimates you’re providing.
Suppose [...] you’ve got this AI system with this really, really good intelligence, which maybe we’ll call it a world model or just general intelligence. And this intelligence can take in any utility function, and optimize it, and you plug in the incorrect utility function, and catastrophe happens.
I’ve seen various people make the argument that this is not how AI works and it’s not how AGI will work—it’s basically the old “tool AI” vs “agent AI” debate. But I think the only reason current AI doesn’t do this is because we can’t make it do this yet: the default customer requirement for a general intelligence is that it should be able to do whatever task the user asks it to do.
So far the ability of AI to understand a request is very limited (poor natural language skills). But once you have an agent that can understand what you’re asking, of course you would design it to optimize new objectives on request, bounded of course by some built-in rules about not committing crimes or manipulating people or seizing control of the world (easy, I assume). Otherwise, you’d need to build a new system for every type of goal, and that’s basically just narrow AI.
If our superintelligent AI is just a bunch of well developed heuristics, it is unlikely that those heuristics will be generatively strategic enough to engage in super-long-term planning
If the heuristics are optimized for “be able to satisfy requests from humans” and those requests sometimes require long-term planning, then the skill will develop. If it’s only good at satisfying simple requests that don’t require planning, in what sense is it superintelligent?
I am not arguing that we’ll end up building tool AI; I do think it will be agent-like. At a high level, I’m arguing that the intelligence and agentiness will increase continuously over time, and as we notice the resulting (non-existential) problems we’ll fix them, or start over.
I agree with your point that long-term planning will develop even with a bunch of heuristics.
If the heuristics are optimized for “be able to satisfy requests from humans” and those requests sometimes require long-term planning, then the skill will develop. If it’s only good at satisfying simple requests that don’t require planning, in what sense is it superintelligent?
Yeah, that statement is wrong. I was trying to make a more subtle point about how an AI that learns long-term planning on a shorter time-frame is not necessarily going to be able to generalize to longer time-frames (but in the context of superintelligent AIs capable of doing human leve tasks, I do think it will generalize—so that point is kind of irrelevant). I agree with Rohin’s response.
Thanks for recording this conversation! Some thoughts:
I was pretty surprised to read the above—most of my intuitions about AI come down to repeatedly hearing the point that safety issues are very unpredictable and high variance, and that once a major safety issue happens, it’s already too late. The arguments I’ve seen for this (many years of Eliezer-ian explanations of how hard it is to come out on top against superintelligent agents who care about different things than you) also seem pretty straightforward. And Rohin Shah isn’t a stranger to them. So what gives?
Well, look at the summary on top of the full transcript link. Here are some quotes reflecting the point that Rohin is making which is most interesting to me--
From the summary:
and, in more detail, from the transcript:
If I was very convinced of this perspective, I think I’d share Rohin’s impression that AI Safety is attainable. This is because I also do not expect highly strategic and agential actions focused on a single long-term goal to be produced by something that “has been given a bunch of heuristics by gradient descent that tend to correlate well with getting high reward and then it just executes those heuristics.” To elaborate on some of this with my own perspective:
If our superintelligent AI is just a bunch of well developed heuristics, it is unlikely that those heuristics will be generatively strategic enough to engage in super-long-term planning
If our superintelligent AI gets punished based on any proxy for “misleading humans” and it can’t do super-long-term planning, it is unlikely to come up with a good reward-attaining strategy that involves misleading humans
If our superintelligent AI does somehow develop a heuristic that misleads humans, it is yet more unlikely that the heuristic will be immediately well-developed enough to mislead humans long enough to cause an extinction level event. Instead, it will probably mislead the humans for more short-term gains at first—which will allow us to identify safety measures in advance
So I agree that we have a good chance of ensuring that this kind of AI is safe—mainly because I don’t think the level of heuristics involved invoke an AI take-off slow enough to clearly indicate safety risks before they become x-risks.
On the other hand, while I agree with Rohin and Hanson’s side that there isn’t One True Learning Algorithm, there are potentially a multitude of advanced heuristics that approximate extremely agent-y and strategic long-term optimizations. We even have a real-life, human-level example of this. His name is Eliezer Yudkowsky[1]. Moreover, if I got an extra fifty IQ points and a slightly different set of ethics, I wouldn’t be surprised if the set of heuristics composing my brain could be an existential threat. I think Rohin would agree with this belief in heuristic kludges that are effecively agential despite not being a One True Algorithm and, alone, this belief doesn’t imply existential risk. If these agenty heuristics manifest gradually over time, we can easily stop them just by noticing them and turning the AI off before they get refined into something truly dangerous.
However, I don’t think that machine-learned heuristics are the only way we can get highly dangerous agenty heuristics. We’ve made a lot of mathematical process on understanding logic, rationality and decision theory and, while machine-learned heuristics may figure out approximately Perfect Reasoning Capabilities just by training, I think it’s possible that we can directly hardcode heuristics that do the same thing based on our current understanding of things we associate with Perfect Reasoning Capabilities.
In other words, I think that the dangerously agent-y heuristics which we can develop through gradual machine-learning processes could also be developed by a bunch of mathematicians teaming up and building a kludge that is similarly agent-y right out of the box. The former possibility is something we can mitigate gradually (for instance, by not continuing to build AI once they start doing things that look too agent-y) but the latter seems much more dangerous.
Of course, even if mathematicians could directly kludge some heuristics that can perform long-term strategic planning, implementing such a kludge seems obviously dangerous to me. It also seems rather unnecessary. If we could also just get superintelligent AI that doesn’t do scary agent-y stuff by just developing it as a gradual extension of our current machine-learning technology, why would you want to do it the risky and unpredictable way? Maybe it’d be orders of magnitude faster but this doesn’t seem worth the trade—especially when you could just directly improve AI-compute capabilities instead.
As of finishing this comment, I think I’m less worried about AI existential risks than I was before.
[1] While this sentence might seem glib, I phrased it the way I did specifically most, while most people display agentic behaviors, most of us aren’t that agentic in general. I do not know Eliezer personally but the person who wrote a whole set of sequences on rationality, developed a new decision theory and started up a new research institute focused on saving the world is the best example of an agenty person I can come up with off the top of my head.
I enjoyed this comment, thanks for thinking it through! Some comments:
This is not my belief. I think that powerful AI systems, even if they are a bunch of well developed heuristics, will be able to do super-long-term planning (in the same way that I’m capable of it, and I’m a bunch of heuristics, or Eliezer is to take your example).
Obviously this depends on how good the heuristics are, but I do think that heuristics will get to the point where they do super-long-term planning, and my belief that we’ll be safe by default doesn’t depend on assuming that AI won’t do long-term planning.
Yup, that’s correct.
Should “I don’t think” be “I do think”? Otherwise I’m confused. With that correction, I basically agree.
I would be very surprised if this worked in the near term. Like, <1% in 5 years, <5% in 20 years, and really I want to say < 1% that this is the first way we get AGI (no matter when), but I can’t actually be that confident.
My impression is that many researchers at MIRI would qualitatively agree with me on this, though probably with less confidence.
Thanks for replying!
Yeah, I intended that statement to be more of an elaboration on my own perspective than to imply that it represented your beliefs. I also agree that its wrong in the context of superintelligent AI we are discussing.
Yep! Thanks for the correction.
Huh, okay… On reflection, I agree that directly hardcoded agent-y heuristics are unlikely to happen because AI-Compute tends to beat it. However, I continue to think that mathematicians may be able to use their knowledge of probability & logic to cause heuristics to develop in ways that are unusually agent-y at a fast enough rate to imply surprising x-risks.
This mainly boils down to my understanding that similarly well-performing but different heuristics for agential behavior may have very different potentials for generalizing to agential behavior on longer time-scales/chains-of-reasoning than the ones trained on. Consequently, I think there are particular ways of defining AI problem objectives and AI architecture that are uniquely suited to AI becoming generally agential over arbitrarily long time-frames and chains of reasoning.
However, I think we can address this kind of risk with the same safety solutions that could help us deal with AI that just have significantly better reasoning capabilities than us (but have not reasoning capabilities that have fully generalized!). Paul Christiano’s work on amplification, for instance.
So the above is only a concern if people a) deliberately try to get AI in the most reckless way possible and b) get lucky enough that it doesn’t get bottle-necked somewhere else. I’ll buy the low estimates you’re providing.
I’ve seen various people make the argument that this is not how AI works and it’s not how AGI will work—it’s basically the old “tool AI” vs “agent AI” debate. But I think the only reason current AI doesn’t do this is because we can’t make it do this yet: the default customer requirement for a general intelligence is that it should be able to do whatever task the user asks it to do.
So far the ability of AI to understand a request is very limited (poor natural language skills). But once you have an agent that can understand what you’re asking, of course you would design it to optimize new objectives on request, bounded of course by some built-in rules about not committing crimes or manipulating people or seizing control of the world (easy, I assume). Otherwise, you’d need to build a new system for every type of goal, and that’s basically just narrow AI.
If the heuristics are optimized for “be able to satisfy requests from humans” and those requests sometimes require long-term planning, then the skill will develop. If it’s only good at satisfying simple requests that don’t require planning, in what sense is it superintelligent?
I am not arguing that we’ll end up building tool AI; I do think it will be agent-like. At a high level, I’m arguing that the intelligence and agentiness will increase continuously over time, and as we notice the resulting (non-existential) problems we’ll fix them, or start over.
I agree with your point that long-term planning will develop even with a bunch of heuristics.
Yeah, that statement is wrong. I was trying to make a more subtle point about how an AI that learns long-term planning on a shorter time-frame is not necessarily going to be able to generalize to longer time-frames (but in the context of superintelligent AIs capable of doing human leve tasks, I do think it will generalize—so that point is kind of irrelevant). I agree with Rohin’s response.