If the demons understand harm and are very clever in figuring out what will lead to it, what happens when we ask them minimize harm, or maximize utility, or do the opposite of what it would want to do otherwise, or {rigidly specified version of something like this}?
Can we force demons to tell us (for instance) how they’d rank various policy packages in government, what personal choices they’d prefer I make, &c., so we can back-engineer what not to do? They’re not infinitely clever, but how clever are they?
There are ten thousand wrong solutions and four good solutions. You don’t get much info from being told a particular bad solution. The opposite of a bad solution is a bad solution.
So ask a series of “which of X and Y would you prefer that we do”. The demon always prefers the worst thing, but is constrained to truthfully describe its preferences. This is a single bit of data, but it’s really useful.
Actually, I can think of another loophole. Just ask the demon to do X in a manner which causes, by the demon’s own standards, the least harm. Because it is stipulated that the demon always wants to do things that cause the most harm by human standards, it follows that the demons must have a concept of “harm” that is congruent with human standards. The demon is not only a malevolent genie, it’s a consistently malevolent genie and you can take advantage of this.
It may seem that we have not really stipulated that the demon ranks everything by human standards, just that the demon’s topmost preference is the one ranked the worst by human standards. However, you can ask the demon “do X in a way that is not (topmost preference)” and by stipulation it will still do the most harm, thus implying that the demon’s second preference is also ranked by human standards; by induction all the demon’s preferences are ranked by human standards.
This can break if the demon does things that do the most harm by human standards because it has its own standards opposite from a human and does the least harm by its own standards. If so, just ask it for something that causes the most harm by its standards instead.
(If you’re wondering what happens if the demon picks the definition of “the demon’s standards” that it prefers, it can’t actually do that. One of the choices would be a lie, and the demon is a non-lying genie, not a lying-if-plausible-deniability genie.)
If the demons understand harm and are very clever in figuring out what will lead to it, what happens when we ask them minimize harm, or maximize utility, or do the opposite of what it would want to do otherwise, or {rigidly specified version of something like this}?
Can we force demons to tell us (for instance) how they’d rank various policy packages in government, what personal choices they’d prefer I make, &c., so we can back-engineer what not to do? They’re not infinitely clever, but how clever are they?
There are ten thousand wrong solutions and four good solutions. You don’t get much info from being told a particular bad solution. The opposite of a bad solution is a bad solution.
So ask a series of “which of X and Y would you prefer that we do”. The demon always prefers the worst thing, but is constrained to truthfully describe its preferences. This is a single bit of data, but it’s really useful.
Actually, I can think of another loophole. Just ask the demon to do X in a manner which causes, by the demon’s own standards, the least harm. Because it is stipulated that the demon always wants to do things that cause the most harm by human standards, it follows that the demons must have a concept of “harm” that is congruent with human standards. The demon is not only a malevolent genie, it’s a consistently malevolent genie and you can take advantage of this.
It may seem that we have not really stipulated that the demon ranks everything by human standards, just that the demon’s topmost preference is the one ranked the worst by human standards. However, you can ask the demon “do X in a way that is not (topmost preference)” and by stipulation it will still do the most harm, thus implying that the demon’s second preference is also ranked by human standards; by induction all the demon’s preferences are ranked by human standards.
This can break if the demon does things that do the most harm by human standards because it has its own standards opposite from a human and does the least harm by its own standards. If so, just ask it for something that causes the most harm by its standards instead.
(If you’re wondering what happens if the demon picks the definition of “the demon’s standards” that it prefers, it can’t actually do that. One of the choices would be a lie, and the demon is a non-lying genie, not a lying-if-plausible-deniability genie.)