[Disclaimer: I’m reading this post for the first time now, as of 1/11/2020. I also already have a broad understanding of the importance of AI safety. While I am skeptical about MIRI’s approach to things, I am also a fan of MIRI. Where this puts me relative to the target demographic of this post, I cannot say.]
Overall Summary
I think this post is pretty good. It’s a solid and well-written introduction to some of the intuitions behind AI alignment and the fundamental research that MIRI does. At the same time, the use of analogy made the post more difficult for me to parse and hid some important considerations about AI alignment from view. Though it may be good (but not optimal) for introducing some people to the problem of AI alignment and a subset of MIRI’s work, it did not raise or lower my opinion of MIRI as someone who already understood AGI safety to be important.
To be clear, I do not consider any of these weaknesses serious because I believe them to be partially irrelevant to the audience of people who don’t appreciate the importance of AI-Safety. Still, they are relevant to the audience of people who give AI-Safety the appropriate scrutiny but remain skeptical of MIRI. And I think this latter audience is important enough to assign this article a “pretty good” instead of a “great”.
I hope a future post directly explores the merit of MIRI’s work on the context AI alignment without use of analogy.
Below is an overview of my likes and dislikes in this post. I will go into more detail about them in the next section, “Evaluating Analogies.”
Things I liked:
It’s a solid introduction to AI-alignment, covering a broad range of topics including:
Why we shouldn’t expect aligned AGI by default
How modern conversation about AGI behavior is problematically underspecified
Why fundamental deconfusion research is necessary for solving AI-alignment
It directly explains the value/motivation of particular pieces of MIRI work via analogy—which is especially nice given that it’s hard for the layman to actually appreciate the mathematically complex stuff MIRI is doing
On the whole, the analogy is elegant
Things I disliked:
Analogizing AI alignment to rocket alignment created a framing that hid important aspects of AI alignment from view and (unintentionally) stacked the deck in favor of MIRI.
A criticism of rocket alignment research with a plausible AI alignment analog was neglected (and could only be addressed by breaking the analogy).
An argument in favor of MIRI for rocket alignment had an AI analog that was much less convincing when considered in the context of AI alignment unique facts.
The cognitive effort I spent mapping the rocket alignment problem to the AI alignment problem took more cognitive effort than just directly reading justifications of AI alignment and MIRI
The world-building wasn’t great
The actual world of the dialogue is counterintuitive—imagine a situation where planes and rockets exist (or don’t exist, but are being theorized about), but no one knows calculus (despite modeling cannonballs pretty well) or how centripetal force+gravity works. It’s hard for me to parse the exact epistemic meaning of any given statement relative to the world
The world-building wasn’t particularly clear—it took me a while to completely parse that calculus hadn’t been invented.
There’s a lot of asides where Beth (a stand-in for a member of MIRI) makes nontrivial scientific claims that we know to be true. While this is technically justified (MIRI does math and is unlikely to make claims that are wrong; and Eliezer has been right about about a lot of stuff and does deserve credit), it probably just feels smug and irritating to people who are MIRI-skeptics, aka this post’s probable target.
Evaluating Analogies
Since this post is intended as an analogy to AI alignment, evaluating its insights requires two steps. First, one must re-interpret the post in the context of AI alignment. Second, one must take that re-interpretation and see whether it holds up. This means that, if I criticize the content of this post—my criticism might be directly in error or my interpretation could be in error.
1. The Alignment Problem Analogy:
Overall, I think the analogy between the Rocket Alignment Problem and the AI Alignment Problem is pretty good. Structurally speaking, they’re identical and I can convert one to the other by swapping words around:
Rocket Alignment: “We know the conditions rockets fly under on Earth but, as we make our rockets fly higher and higher, we have reasons to expect those conditions to break down. Things like wind and weather conditions will stop being relevant and other weird conditions (like whatever keeps the Earth moving around the sun) will take hold! If we don’t understand those, we’ll never get to the moon!”
AI Alignment: “We know the conditions that modern AI performs under right now, but as we make our AI solve more and more complex problems, we have reason to expect those conditions to break down. Things like model overfitting and sample-size limitations will stop being relevant and other weird conditions (like noticing problems so subtle and possible decisions so clever that you as a human can’t reason about them) will take hold! If we don’t understand those, we’ll never make an AI that does what we want!”
1a. Flaws In the Alignment Problem Analogy:
While the alignment problem is pretty good, it leaves out the key and fundamentally important fact that failed AI Alignment will end the world. While it’s often not a big deal when an analogy isn’t completely accurate, missing this fact leaves MIRI-skeptics with a pretty strong counter-argument that can only exist outside of the analogy:
In Rocket Alignment terms -- “Why bother thinking about all this stuff now? If conditions are different in space, we’ll learn that when we start launching things into space and see things happen to them? This sounds more efficient than worrying about cannonballs.”
In AI Alignment terms -- “Why bother thinking about all this stuff now? If conditions are different when AI start getting clever, we’ll learn about those differences once we start making actual AI that are clever enough to behave like agents. This sounds more efficient than navel-gazing about mathematical constructs.”
If you explore this counter-argument and its counter-counter-argument deeper, the conversation gets pretty interesting:
MIRI-Skeptic: Fine okay. The analogy breaks down there. We can’t empirically study a superintelligent AI safely. But we can make AI that are slightly smarter than us but put security mechanisms around them that only AI extremely smarter than us would be expected to break. Then we can learn experimentally from the behavior of those AI about how to make clever AI safe. Again, easier than navel-gazing about mathematical constructs and we might expect this to happen because slow take-off.
MIRI-Defender: First of all, there’s no theoretical reason we would expect to be able to extrapolate the behavior of slightly clever AI to the behavior of extremely clever AI. Second, we have empirical reasons for thinking your empirical approach won’t work. We already did a test-run of your experiment proposal with a slightly clever being; we put Eliezer Yudkowsky in an inescapable box armed with only a communication tool and the guard let him out (twice!).
MIRI-Skeptic: Fair enough but… [Author’s Note: There are further replies to MIRI-Defender but this is a dialogue for another day]
Given that this post is supposed to address MIRI skeptics and that the aforementioned conversation is extremely relevant to judging the benefits of MIRI, I consider the inabillity to address this argument to be a flaw—despite it being an understandable flaw in the context of the analogy used.
2. The Understanding Intractably Complicated Things with Simple Things Analogy:
I think that this is a cool insight (with parallels to inverse-inverse problems) and the above post captures it very well. Explicitly, the analogy is this: “Rocket Alignment to Cannonballs is like AI Alignment to tiling agents.” Structurally speaking, they’re identical and I can convert one to the other by swapping words around:
Rocket Modeling: “We can’t think about rocket trajectories using actual real rockets under actual real conditions because there are so many factors and complications that can affect them. But, per the rocket alignment problem, we need to understand the weird conditions that rockets need to deal with when they’re really high up and these conditions should apply to a lot of things that are way simpler than rockets. So instead of dealing with the incredibly hard problem of modeling rockets, let’s try really simple problems using other high-up fast-moving objects like cannonballs.”
AI Alignment: “We can’t think about AI behavior using actual AI under actual real conditions because there are so many factors and complications that can affect them. But, per the AI alignment problem, we need to understand the weird conditions that AI need to deal with when they’re extremely intelligent and these conditions should apply to a lot of things that are way simpler than modern AI. So instead of dealing with the incredibly hard problem of modeling AI, let’s try the really simple problem of using other intelligent decision-making things like Tiling Agents.”
3. The “We Need Better Mathematics to Know What We’re Talking About” Analogy
I really like just how perfect this analogy is. The way that AI “trajectory” and literal physical rocket trajectory line-up feels nice.
Rocket Alignment: “There’s a lot of trouble figuring out exactly where a rocket will go at any given moment as it’s going higher and higher. We need calculus to make claims about this.”
AI alignment: “There’s a lot of trouble figuring out exactly what an AI will do at any given moment as it gets smarter and smarter (ie self-modification but also just in general). We need to understand how to model logical uncertainty to even say anything about its decisions.”
4. The “Mathematics Won’t Give Us Accurate Models But It Will Give Us the Ability to Talk Intelligently” Analogy
This analogy basically works...
Rocket Alignment: “We can’t use math to accurately predict rockets in real life but we need some of if so we can even reason about what rockets might do. Also we expect our math to get more accurate when the rockets get higher up.”
I also enjoy the way this discussion lightly captures the frustration that the AI Safety community has felt. Many skeptics have claimed their AGIs won’t become misaligned but then never specify the details of why that wouldn’t have it. And when AI Safety proponents produce situations where the AGI does become misaligned, the skeptics move the goal posts.
4a. Flaws in the “Mathematics Won’t Give Us Accurate Models But It Will Give Us the Ability to Talk Intelligently” Analogy
On a cursory glance, the above analogy seems to make sense. But, again, this analogy breaks down on the object level. I’d expect being able to talk precisely about what conditions affect movement in space to help us make better claims about how a rocket would go to the moon because that is just moving in space in a particular way. The research (if successful) completes the set of knowledge needed to reach the goal.
But being able to talk precisely about the trajectory of an AGI doesn’t really help us talk precisely about getting to the “destination” of friendly AGI for a couple reasons:
For rocket trajectories, there are clear control parameters that can be used to exploit the predictions made by a good understanding of how trajectories work. But for AI alignment, I’m not sure what would constitute a control parameter that would exploit a hypothetical good understanding of what strategies superintelligent beings use to make decisions.
For rocket trajectories, the knowledge set of how to get a rocket into a point in outer-space and how to predict the trajectories of objects in outer-space basically encompass the things one would need to know to get that rocket to the moon. For AGI trajectories, the trajectories depend on three things: it’s decision theory (a la logical uncertainty, tiling agents, decision theory...), the actual state of the world that the AGI perceives (which is fundamentally unknowable to us humans, since the AGI will be much more perceptive than us), and its goals (which are well-known to be orthogonal to the AGI’s actual strategy algorithms).
Given the above, we know scenarios where we understand agent foundations but not the goals of our agents won’t work. But, if we do figure out the goals of our agents, it’s not obvious that controlling those superintelligent agents’ rationality skills will be a good use of our time. After all, they’ll come up with better strategies than we would.
Like I guess you could argue that we can view our goals as the initial conditions and then use our agent foundations to reason about the AGI behavior given those goals and decide if we like its choices… But again, the AGI is more perceptive than us. I’m not sure if we could capably design toy circumstances for an AGI to behave under that would reflect the circumstances of reality in a meaningful way
Also, to be fair, MIRI does work on goal-oriented stuff in addition to agent-oriented stuff. Corrigibility ,which the post later links to, is an example of this. But, frankly, my expectation that this kind of thing will pan out is pretty low.
In principle, the rocket alignment analogy could’ve written in a way that captured the above concerns. For instance, instead of asking the question “How do we get this rocket to the moon when we don’t understand how things move in outer-space?”, we could ask “How do we get this rocket to the moon when we don’t understand how things move in outer-space, we have a high amount of uncertainty about what exactly is up there in outer-space, and we don’t have specifics about what exactly the moon is?”
But that would make this a much different, and much more epistemologically labyrinthian post.
Minor Comments
1. I appreciate the analogizing of an awesome thing (landing on the moon) to another awesome thing (making a friendly AGI). The AI safety community is quite rationally focused mostly on how bad a misaligned AI would be but I always enjoy spending some time thinking about the positives.
2. I noticed that Alfonso keeps using the term “spaceplanes” and Beth never does. I might be reading into it but my understanding is that this is done to capture how deeply frustrating when people talk about the thing you’re studying (AGI) like it’s something superficially similar but fundamentally different (modern machine-learning but like, with better data).
However, coming into this dialogue without any background on the world involved, the apparent interchangeability of spaceplane and rocket just felt confusing.
3.
As an example of work we’re presently doing that’s aimed at improving our understanding, there’s what we call the “tiling positions” problem. The tiling positions problem is how to fire a cannonball from a cannon in such a way that the cannonball circumnavigates the earth over and over again, “tiling” its initial coordinates like repeating tiles on a tessellated floor –
Because of the deliberate choice to analogize tiling agents and tiling positions, I spent probably five minutes trying to figure out exactly what the relationship between tiling positions and rocket alignment meant about tiling agents and AI alignment. It seems to me tiling isn’t clearly necessary in the former (understanding any kind of trajectory should do the job) while it is in the latter (understanding how AI can guarantee similar behavior in agents it creates seems fundamentally important).
My impression now is that this was just a conceptual pun on the idea of tiling. I appreciate that but I’m not sure it’s good for this post. The reason I thought so hard about this was also because the Logical Discreteness/Logical Uncertainty analogy seemed deeper.
[Disclaimer: I’m reading this post for the first time now, as of 1/11/2020. I also already have a broad understanding of the importance of AI safety. While I am skeptical about MIRI’s approach to things, I am also a fan of MIRI. Where this puts me relative to the target demographic of this post, I cannot say.]
Overall Summary
I think this post is pretty good. It’s a solid and well-written introduction to some of the intuitions behind AI alignment and the fundamental research that MIRI does. At the same time, the use of analogy made the post more difficult for me to parse and hid some important considerations about AI alignment from view. Though it may be good (but not optimal) for introducing some people to the problem of AI alignment and a subset of MIRI’s work, it did not raise or lower my opinion of MIRI as someone who already understood AGI safety to be important.
To be clear, I do not consider any of these weaknesses serious because I believe them to be partially irrelevant to the audience of people who don’t appreciate the importance of AI-Safety. Still, they are relevant to the audience of people who give AI-Safety the appropriate scrutiny but remain skeptical of MIRI. And I think this latter audience is important enough to assign this article a “pretty good” instead of a “great”.
I hope a future post directly explores the merit of MIRI’s work on the context AI alignment without use of analogy.
Below is an overview of my likes and dislikes in this post. I will go into more detail about them in the next section, “Evaluating Analogies.”
Things I liked:
It’s a solid introduction to AI-alignment, covering a broad range of topics including:
Why we shouldn’t expect aligned AGI by default
How modern conversation about AGI behavior is problematically underspecified
Why fundamental deconfusion research is necessary for solving AI-alignment
It directly explains the value/motivation of particular pieces of MIRI work via analogy—which is especially nice given that it’s hard for the layman to actually appreciate the mathematically complex stuff MIRI is doing
On the whole, the analogy is elegant
Things I disliked:
Analogizing AI alignment to rocket alignment created a framing that hid important aspects of AI alignment from view and (unintentionally) stacked the deck in favor of MIRI.
A criticism of rocket alignment research with a plausible AI alignment analog was neglected (and could only be addressed by breaking the analogy).
An argument in favor of MIRI for rocket alignment had an AI analog that was much less convincing when considered in the context of AI alignment unique facts.
The cognitive effort I spent mapping the rocket alignment problem to the AI alignment problem took more cognitive effort than just directly reading justifications of AI alignment and MIRI
The world-building wasn’t great
The actual world of the dialogue is counterintuitive—imagine a situation where planes and rockets exist (or don’t exist, but are being theorized about), but no one knows calculus (despite modeling cannonballs pretty well) or how centripetal force+gravity works. It’s hard for me to parse the exact epistemic meaning of any given statement relative to the world
The world-building wasn’t particularly clear—it took me a while to completely parse that calculus hadn’t been invented.
There’s a lot of asides where Beth (a stand-in for a member of MIRI) makes nontrivial scientific claims that we know to be true. While this is technically justified (MIRI does math and is unlikely to make claims that are wrong; and Eliezer has been right about about a lot of stuff and does deserve credit), it probably just feels smug and irritating to people who are MIRI-skeptics, aka this post’s probable target.
Evaluating Analogies
Since this post is intended as an analogy to AI alignment, evaluating its insights requires two steps. First, one must re-interpret the post in the context of AI alignment. Second, one must take that re-interpretation and see whether it holds up. This means that, if I criticize the content of this post—my criticism might be directly in error or my interpretation could be in error.
1. The Alignment Problem Analogy:
Overall, I think the analogy between the Rocket Alignment Problem and the AI Alignment Problem is pretty good. Structurally speaking, they’re identical and I can convert one to the other by swapping words around:
Rocket Alignment: “We know the conditions rockets fly under on Earth but, as we make our rockets fly higher and higher, we have reasons to expect those conditions to break down. Things like wind and weather conditions will stop being relevant and other weird conditions (like whatever keeps the Earth moving around the sun) will take hold! If we don’t understand those, we’ll never get to the moon!”
AI Alignment: “We know the conditions that modern AI performs under right now, but as we make our AI solve more and more complex problems, we have reason to expect those conditions to break down. Things like model overfitting and sample-size limitations will stop being relevant and other weird conditions (like noticing problems so subtle and possible decisions so clever that you as a human can’t reason about them) will take hold! If we don’t understand those, we’ll never make an AI that does what we want!”
1a. Flaws In the Alignment Problem Analogy:
While the alignment problem is pretty good, it leaves out the key and fundamentally important fact that failed AI Alignment will end the world. While it’s often not a big deal when an analogy isn’t completely accurate, missing this fact leaves MIRI-skeptics with a pretty strong counter-argument that can only exist outside of the analogy:
In Rocket Alignment terms -- “Why bother thinking about all this stuff now? If conditions are different in space, we’ll learn that when we start launching things into space and see things happen to them? This sounds more efficient than worrying about cannonballs.”
In AI Alignment terms -- “Why bother thinking about all this stuff now? If conditions are different when AI start getting clever, we’ll learn about those differences once we start making actual AI that are clever enough to behave like agents. This sounds more efficient than navel-gazing about mathematical constructs.”
If you explore this counter-argument and its counter-counter-argument deeper, the conversation gets pretty interesting:
MIRI-Skeptic: Fine okay. The analogy breaks down there. We can’t empirically study a superintelligent AI safely. But we can make AI that are slightly smarter than us but put security mechanisms around them that only AI extremely smarter than us would be expected to break. Then we can learn experimentally from the behavior of those AI about how to make clever AI safe. Again, easier than navel-gazing about mathematical constructs and we might expect this to happen because slow take-off.
MIRI-Defender: First of all, there’s no theoretical reason we would expect to be able to extrapolate the behavior of slightly clever AI to the behavior of extremely clever AI. Second, we have empirical reasons for thinking your empirical approach won’t work. We already did a test-run of your experiment proposal with a slightly clever being; we put Eliezer Yudkowsky in an inescapable box armed with only a communication tool and the guard let him out (twice!).
MIRI-Skeptic: Fair enough but… [Author’s Note: There are further replies to MIRI-Defender but this is a dialogue for another day]
Given that this post is supposed to address MIRI skeptics and that the aforementioned conversation is extremely relevant to judging the benefits of MIRI, I consider the inabillity to address this argument to be a flaw—despite it being an understandable flaw in the context of the analogy used.
2. The Understanding Intractably Complicated Things with Simple Things Analogy:
I think that this is a cool insight (with parallels to inverse-inverse problems) and the above post captures it very well. Explicitly, the analogy is this: “Rocket Alignment to Cannonballs is like AI Alignment to tiling agents.” Structurally speaking, they’re identical and I can convert one to the other by swapping words around:
Rocket Modeling: “We can’t think about rocket trajectories using actual real rockets under actual real conditions because there are so many factors and complications that can affect them. But, per the rocket alignment problem, we need to understand the weird conditions that rockets need to deal with when they’re really high up and these conditions should apply to a lot of things that are way simpler than rockets. So instead of dealing with the incredibly hard problem of modeling rockets, let’s try really simple problems using other high-up fast-moving objects like cannonballs.”
AI Alignment: “We can’t think about AI behavior using actual AI under actual real conditions because there are so many factors and complications that can affect them. But, per the AI alignment problem, we need to understand the weird conditions that AI need to deal with when they’re extremely intelligent and these conditions should apply to a lot of things that are way simpler than modern AI. So instead of dealing with the incredibly hard problem of modeling AI, let’s try the really simple problem of using other intelligent decision-making things like Tiling Agents.”
3. The “We Need Better Mathematics to Know What We’re Talking About” Analogy
I really like just how perfect this analogy is. The way that AI “trajectory” and literal physical rocket trajectory line-up feels nice.
Rocket Alignment: “There’s a lot of trouble figuring out exactly where a rocket will go at any given moment as it’s going higher and higher. We need calculus to make claims about this.”
AI alignment: “There’s a lot of trouble figuring out exactly what an AI will do at any given moment as it gets smarter and smarter (ie self-modification but also just in general). We need to understand how to model logical uncertainty to even say anything about its decisions.”
4. The “Mathematics Won’t Give Us Accurate Models But It Will Give Us the Ability to Talk Intelligently” Analogy
This analogy basically works...
Rocket Alignment: “We can’t use math to accurately predict rockets in real life but we need some of if so we can even reason about what rockets might do. Also we expect our math to get more accurate when the rockets get higher up.”
AI alignment: “We can’t use math to accurately predict AGI in real life but we need some of if so we can even reason about what AGI might do. Also we expect our math to get more accurate when the AGI gets way smarter.”
I also enjoy the way this discussion lightly captures the frustration that the AI Safety community has felt. Many skeptics have claimed their AGIs won’t become misaligned but then never specify the details of why that wouldn’t have it. And when AI Safety proponents produce situations where the AGI does become misaligned, the skeptics move the goal posts.
4a. Flaws in the “Mathematics Won’t Give Us Accurate Models But It Will Give Us the Ability to Talk Intelligently” Analogy
On a cursory glance, the above analogy seems to make sense. But, again, this analogy breaks down on the object level. I’d expect being able to talk precisely about what conditions affect movement in space to help us make better claims about how a rocket would go to the moon because that is just moving in space in a particular way. The research (if successful) completes the set of knowledge needed to reach the goal.
But being able to talk precisely about the trajectory of an AGI doesn’t really help us talk precisely about getting to the “destination” of friendly AGI for a couple reasons:
For rocket trajectories, there are clear control parameters that can be used to exploit the predictions made by a good understanding of how trajectories work. But for AI alignment, I’m not sure what would constitute a control parameter that would exploit a hypothetical good understanding of what strategies superintelligent beings use to make decisions.
For rocket trajectories, the knowledge set of how to get a rocket into a point in outer-space and how to predict the trajectories of objects in outer-space basically encompass the things one would need to know to get that rocket to the moon. For AGI trajectories, the trajectories depend on three things: it’s decision theory (a la logical uncertainty, tiling agents, decision theory...), the actual state of the world that the AGI perceives (which is fundamentally unknowable to us humans, since the AGI will be much more perceptive than us), and its goals (which are well-known to be orthogonal to the AGI’s actual strategy algorithms).
Given the above, we know scenarios where we understand agent foundations but not the goals of our agents won’t work. But, if we do figure out the goals of our agents, it’s not obvious that controlling those superintelligent agents’ rationality skills will be a good use of our time. After all, they’ll come up with better strategies than we would.
Like I guess you could argue that we can view our goals as the initial conditions and then use our agent foundations to reason about the AGI behavior given those goals and decide if we like its choices… But again, the AGI is more perceptive than us. I’m not sure if we could capably design toy circumstances for an AGI to behave under that would reflect the circumstances of reality in a meaningful way
Also, to be fair, MIRI does work on goal-oriented stuff in addition to agent-oriented stuff. Corrigibility ,which the post later links to, is an example of this. But, frankly, my expectation that this kind of thing will pan out is pretty low.
In principle, the rocket alignment analogy could’ve written in a way that captured the above concerns. For instance, instead of asking the question “How do we get this rocket to the moon when we don’t understand how things move in outer-space?”, we could ask “How do we get this rocket to the moon when we don’t understand how things move in outer-space, we have a high amount of uncertainty about what exactly is up there in outer-space, and we don’t have specifics about what exactly the moon is?”
But that would make this a much different, and much more epistemologically labyrinthian post.
Minor Comments
1. I appreciate the analogizing of an awesome thing (landing on the moon) to another awesome thing (making a friendly AGI). The AI safety community is quite rationally focused mostly on how bad a misaligned AI would be but I always enjoy spending some time thinking about the positives.
2. I noticed that Alfonso keeps using the term “spaceplanes” and Beth never does. I might be reading into it but my understanding is that this is done to capture how deeply frustrating when people talk about the thing you’re studying (AGI) like it’s something superficially similar but fundamentally different (modern machine-learning but like, with better data).
However, coming into this dialogue without any background on the world involved, the apparent interchangeability of spaceplane and rocket just felt confusing.
3.
Because of the deliberate choice to analogize tiling agents and tiling positions, I spent probably five minutes trying to figure out exactly what the relationship between tiling positions and rocket alignment meant about tiling agents and AI alignment. It seems to me tiling isn’t clearly necessary in the former (understanding any kind of trajectory should do the job) while it is in the latter (understanding how AI can guarantee similar behavior in agents it creates seems fundamentally important).
My impression now is that this was just a conceptual pun on the idea of tiling. I appreciate that but I’m not sure it’s good for this post. The reason I thought so hard about this was also because the Logical Discreteness/Logical Uncertainty analogy seemed deeper.