By the way, here’s my account of the motivation for this problem:
Let’s say you start with an AI that is superhuman at engineering. You want to ask it to do a simple task (like make you burritos) without risking vast unforeseen consequences. So you let it passively scan a bunch of human-made burritos, and ask it to make you a burrito. There are a couple of interesting failure modes:
The space of acceptable burritos, as a subset of configurations of atoms, is a really narrow and twisty target. If you take the set of configurations which are closer to burrito 1 in the training set than any other training burrito is, the vast majority of those configurations would be toxic to humans, and some of them contain self-replicating nanobots, etc. Of course, there are ways of representing concepts such that the essential aspects of acceptable burritos (like being made out of a specific set of organic molecules) are more likely to be found. This is the problem of identifying the correct measure b in the first place.
Having the AI create nanotech is pretty risky, and for this task we’d prefer if it stuck to more boring engineering like agricultural and culinary robots. But “don’t make any nanotech” is not a natural command, since how do you specify “nanotech” without examples, and since there are plenty of creative nanotech-like things that wouldn’t even occur to us to rule out. So we want to either give it parameters for what it can do (this gives us the feasible set f, which is unlikely to exactly contain any of our examples), or somehow set things up so that the boring engineering tasks are the optimal way to satisfy the problem. (This is also why “exactly clone one of the example burritos” is not a great solution, since this obviously requires nanotech.)
By the way, here’s my account of the motivation for this problem:
Let’s say you start with an AI that is superhuman at engineering. You want to ask it to do a simple task (like make you burritos) without risking vast unforeseen consequences. So you let it passively scan a bunch of human-made burritos, and ask it to make you a burrito. There are a couple of interesting failure modes:
The space of acceptable burritos, as a subset of configurations of atoms, is a really narrow and twisty target. If you take the set of configurations which are closer to burrito 1 in the training set than any other training burrito is, the vast majority of those configurations would be toxic to humans, and some of them contain self-replicating nanobots, etc. Of course, there are ways of representing concepts such that the essential aspects of acceptable burritos (like being made out of a specific set of organic molecules) are more likely to be found. This is the problem of identifying the correct measure b in the first place.
Having the AI create nanotech is pretty risky, and for this task we’d prefer if it stuck to more boring engineering like agricultural and culinary robots. But “don’t make any nanotech” is not a natural command, since how do you specify “nanotech” without examples, and since there are plenty of creative nanotech-like things that wouldn’t even occur to us to rule out. So we want to either give it parameters for what it can do (this gives us the feasible set f, which is unlikely to exactly contain any of our examples), or somehow set things up so that the boring engineering tasks are the optimal way to satisfy the problem. (This is also why “exactly clone one of the example burritos” is not a great solution, since this obviously requires nanotech.)