One common question we hear about alignment research runs analogously to: “If you don’t develop calculus, what bad thing happens to your rocket? Do you think the pilot will be struggling to make a course correction, and find that they simply can’t add up the tiny vectors fast enough? That scenario just doesn’t sound plausible.”
Actually, that sounds entirely plausible.
The case is similar with, e.g., attempts to develop theories of logical uncertainty. The problem is not that we visualize a specific AI system encountering a catastrophic failure because it mishandles logical uncertainty; the problem is that all our existing tools for describing the behavior of rational agents assume that those agents are logically omniscient, making our best theories incommensurate with our best practical AI designs.
Well, of course, part of the problem is that the best theories of “rational agents” try to assume Homo Economicus into being, and insist on cutting off all the ways in which physically-realizable minds cannot fit. So we need a definition of rationality that makes sense in a world where agents don’t have completed infinities of computational power and can be modified by the environment and don’t come with built-in utility functions that necessarily map physically realizable situations to the real numbers.
If we could program that computer to reliably achieve some simple goal (such as producing as much diamond as possible), then a large share of the AI alignment research would be completed.
Wait wait wait. You’re saying that the path between Clippy and a prospective completed FAI is shorter than the path between today’s AI state-of-the-art and Clippy? Because it sounds like you’re saying that, even though I really don’t expect you to say that.
On the upside, I do think we can spell out a research program to get us there, which will be grounded in current computational cog-sci and ML literature, which will also help with Friendliness/alignment engineering, which will not engender arguments with Jessica over math this time.
But now for the mandatory remark: you are insane and will kill us all ;-), rabble rabble rabble.
Clippy is a thought experiment used to illustrate two ideas: terminal goals are orthogonal to capabilities (“the AI does not love you”), and they tend to have instrumental goals like resource acquisition and self-preservation (“the AI does not hate you, but...”). This highlights the fact that highly capable AI can be dangerous even if it’s reliably pursuing some known goal and the goal isn’t ambitious or malicious. For that reason, Clippy comes up a lot as an intuition pump for why we need to get started early on safety research.
But ‘a system causes harm in the course of reliably pursuing some known, stable, obviously-non-humane goal’ is a very small minority of the actual disaster scenarios MIRI researchers are worried about. Not because it looks easy to go from a highly reliable diamond maximizer to an aligned superintelligence, but because there appear to be a larger number of ways things can go wrong before we get to that point.
We can fail to understand an advanced AI system well enough to know how ‘goals’ are encoded in it, forcing us to infer and alter goals indirectly.
We can understand the system’s ‘goals,’ but have them be in the wrong idiom for a safe superintelligence (e.g., rewards for a reinforcement learner).
We can understand the system well enough to specify its goals, but not understand our own goals fully or precisely enough to specify them correctly. We come up with an intuitively ‘friendly’ goal (something more promising-sounding than ‘maximize the number of paperclips’), but it’s still the wrong goal.
Similarly: We can understand the system well enough to specify safe behavior in its initial context, but the system stops being safe after it or its environment undergoes a change. An example of this is instability under self-modification.
We can design advanced AI systems we don’t realize (or don’t care) have consequentialist goals. This includes systems we don’t realize are powerful optimizers, e.g., ones whose goal-oriented behavior may depend in complicated ways on the interaction of multiple AI systems, or ones that function as unnoticed subsystems of non-consequentialists.
Ok, so now I’m understanding, and I think our models match up better than I’d thought. You’re basically saying that (1)-(2) and (4)-(5) are a major portion of the alignment research that actually needs doing, even while (3) has become, so to speak, the famous “Hard Problem of” FAI, when in fact it’s only (let’s lazily call it) 20% of what actually needs doing.
I can also definitely buy, based on what I’ve read, that better formalisms for 1, 2, 4, and 5 can all help make (3) easier.
Programming a computer to reliably make lots of diamonds (or paperclips) is not creating Clippy for the same reason that programming Google Maps to produce the shortest distance between two locations is not creating Clippy. People program computers to do X, where X doesn’t consider the welfare of humans, all the time. The programming is not really “do X no matter what”, it’s “do X using these methods”. Google Maps will not start trying to hack the computers of construction equipment in order to build a bridge and shorten the distance it finds between two points.
Programming a computer to reliably make lots of diamonds (or paperclips) is not creating Clippy for the same reason that programming Google Maps to produce the shortest distance between two locations is not creating Clippy.
Ok, but that makes Nate’s statement very confusing. We already understand, “up to” R&D effort, how to program computers to use various peripherals to perform a task in the physical world without intelligence, using fixed methods. I’m left confused at what industrial automation has to do with AI alignment research.
Actually, that sounds entirely plausible.
Well, of course, part of the problem is that the best theories of “rational agents” try to assume Homo Economicus into being, and insist on cutting off all the ways in which physically-realizable minds cannot fit. So we need a definition of rationality that makes sense in a world where agents don’t have completed infinities of computational power and can be modified by the environment and don’t come with built-in utility functions that necessarily map physically realizable situations to the real numbers.
Wait wait wait. You’re saying that the path between Clippy and a prospective completed FAI is shorter than the path between today’s AI state-of-the-art and Clippy? Because it sounds like you’re saying that, even though I really don’t expect you to say that.
On the upside, I do think we can spell out a research program to get us there, which will be grounded in current computational cog-sci and ML literature, which will also help with Friendliness/alignment engineering, which will not engender arguments with Jessica over math this time.
But now for the mandatory remark: you are insane and will kill us all ;-), rabble rabble rabble.
Clippy is a thought experiment used to illustrate two ideas: terminal goals are orthogonal to capabilities (“the AI does not love you”), and they tend to have instrumental goals like resource acquisition and self-preservation (“the AI does not hate you, but...”). This highlights the fact that highly capable AI can be dangerous even if it’s reliably pursuing some known goal and the goal isn’t ambitious or malicious. For that reason, Clippy comes up a lot as an intuition pump for why we need to get started early on safety research.
But ‘a system causes harm in the course of reliably pursuing some known, stable, obviously-non-humane goal’ is a very small minority of the actual disaster scenarios MIRI researchers are worried about. Not because it looks easy to go from a highly reliable diamond maximizer to an aligned superintelligence, but because there appear to be a larger number of ways things can go wrong before we get to that point.
We can fail to understand an advanced AI system well enough to know how ‘goals’ are encoded in it, forcing us to infer and alter goals indirectly.
We can understand the system’s ‘goals,’ but have them be in the wrong idiom for a safe superintelligence (e.g., rewards for a reinforcement learner).
We can understand the system well enough to specify its goals, but not understand our own goals fully or precisely enough to specify them correctly. We come up with an intuitively ‘friendly’ goal (something more promising-sounding than ‘maximize the number of paperclips’), but it’s still the wrong goal.
Similarly: We can understand the system well enough to specify safe behavior in its initial context, but the system stops being safe after it or its environment undergoes a change. An example of this is instability under self-modification.
We can design advanced AI systems we don’t realize (or don’t care) have consequentialist goals. This includes systems we don’t realize are powerful optimizers, e.g., ones whose goal-oriented behavior may depend in complicated ways on the interaction of multiple AI systems, or ones that function as unnoticed subsystems of non-consequentialists.
Ok, so now I’m understanding, and I think our models match up better than I’d thought. You’re basically saying that (1)-(2) and (4)-(5) are a major portion of the alignment research that actually needs doing, even while (3) has become, so to speak, the famous “Hard Problem of” FAI, when in fact it’s only (let’s lazily call it) 20% of what actually needs doing.
I can also definitely buy, based on what I’ve read, that better formalisms for 1, 2, 4, and 5 can all help make (3) easier.
Programming a computer to reliably make lots of diamonds (or paperclips) is not creating Clippy for the same reason that programming Google Maps to produce the shortest distance between two locations is not creating Clippy. People program computers to do X, where X doesn’t consider the welfare of humans, all the time. The programming is not really “do X no matter what”, it’s “do X using these methods”. Google Maps will not start trying to hack the computers of construction equipment in order to build a bridge and shorten the distance it finds between two points.
Ok, but that makes Nate’s statement very confusing. We already understand, “up to” R&D effort, how to program computers to use various peripherals to perform a task in the physical world without intelligence, using fixed methods. I’m left confused at what industrial automation has to do with AI alignment research.