This paper frames the problem as “look at a program and figure out whether it will be harmful” and correctly observes that there is no way to solve that problem with perfect accuracy if the programs being analysed are arbitrary. But its arguments have nothing to say about, e.g., whether there’s some way of preventing harm as it’s about to happen; nor about whether it is possible to construct a program that provably does something useful without harming humans.
E.g., imagine a world where it is known that the only way to harm humans is to press a certain big red button labelled “Harm the Humans”. The arguments in this paper show that there is no general procedure for deciding whether a computer with the ability to press this button will do so. But they don’t rule out the possibility that you can make a useful machine with no access to the button, or a useful machine with a little bit of hardware in it that blows it up if it gets too close to the button.
(There are reasons to be concerned about such machines because in practice you probably can’t causally isolate them from the button in the way required. The paper’s introductory material discusses some such reasons. But they play no role in the technical argument of the paper, at least on the cursory reading I’ve given it.)
I think that it is difficult but may be possible to create superintelligent program which will provably do some formally specified thing.
But the main problem is that we can’t specify formally what is “harming human”. Or we can, but we can’t be sure that it is safe definition.
So it results in some kind of circularity: we could prove that the machine will do X, but we can’t prove that X is actually good and safe.
We may try to return the burden of prove to the machine. We must prove that it will prove that X is really good and safe. I have bad feelings about computability of this task.
That is why I generally skeptical of idea of mathematical prove of AI safety. It doesn’t provide 100 per cent safety, because prove can have holes in it and the task is too complex to be solved in time.
This paper frames the problem as “look at a program and figure out whether it will be harmful” and correctly observes that there is no way to solve that problem with perfect accuracy if the programs being analysed are arbitrary. But its arguments have nothing to say about, e.g., whether there’s some way of preventing harm as it’s about to happen; nor about whether it is possible to construct a program that provably does something useful without harming humans.
E.g., imagine a world where it is known that the only way to harm humans is to press a certain big red button labelled “Harm the Humans”. The arguments in this paper show that there is no general procedure for deciding whether a computer with the ability to press this button will do so. But they don’t rule out the possibility that you can make a useful machine with no access to the button, or a useful machine with a little bit of hardware in it that blows it up if it gets too close to the button.
(There are reasons to be concerned about such machines because in practice you probably can’t causally isolate them from the button in the way required. The paper’s introductory material discusses some such reasons. But they play no role in the technical argument of the paper, at least on the cursory reading I’ve given it.)
I think that it is difficult but may be possible to create superintelligent program which will provably do some formally specified thing.
But the main problem is that we can’t specify formally what is “harming human”. Or we can, but we can’t be sure that it is safe definition.
So it results in some kind of circularity: we could prove that the machine will do X, but we can’t prove that X is actually good and safe.
We may try to return the burden of prove to the machine. We must prove that it will prove that X is really good and safe. I have bad feelings about computability of this task.
That is why I generally skeptical of idea of mathematical prove of AI safety. It doesn’t provide 100 per cent safety, because prove can have holes in it and the task is too complex to be solved in time.
This is a real and important difficulty, but it isn’t what the paper is about—they assume one can always readily tell whether people are being harmed.