Obviously, this approach would rule out a large number of use cases for AI wherein it may need to make things much better, but it risks hurting someone else. It is thus not a fully general solution, but it could be useful anyway for many other uses.
Another, more pressing problem is that a tool AI is likely to not be sophisticated enough to realize what the actual likelihoods its actions have to harm humans are. For instance, a language model can’t conceive of the harm its words could cause (assuming they actually advance enough as to pose actual danger.).
This seems more like something to make an AGI that acts like a tool AI. It could be interesting to attempt that if the time comes, but it seems to have that little disconnect. Giving extra capabilities to an AI in order to reduce the risk it does something unexpected seems like a plan with a reasonably high chance of failure.
For use on actual tool AI, it seems we would be stuck with whatever narrow rules humans can come up with for it, but ‘harm’ is very difficult to define in rules. (A utilitarian making the rules might very well argue that it would check to see if it caused negative utility to any humans, and that would be the definition of ‘harm’, but it is not widely agreed to, especially deontologically, and it still has the issue of not knowing the values of the people it is affecting.)
I do agree with the sentiment that it is better to have humanity itself be the risk case, rather than an alien intelligence we don’t know how to deal with. (I am personally very skeptical that there is any real danger of super-powerful AI any time soon though.)
Obviously, this approach would rule out a large number of use cases for AI wherein it may need to make things much better, but it risks hurting someone else. It is thus not a fully general solution, but it could be useful anyway for many other uses.
Another, more pressing problem is that a tool AI is likely to not be sophisticated enough to realize what the actual likelihoods its actions have to harm humans are. For instance, a language model can’t conceive of the harm its words could cause (assuming they actually advance enough as to pose actual danger.).
This seems more like something to make an AGI that acts like a tool AI. It could be interesting to attempt that if the time comes, but it seems to have that little disconnect. Giving extra capabilities to an AI in order to reduce the risk it does something unexpected seems like a plan with a reasonably high chance of failure.
For use on actual tool AI, it seems we would be stuck with whatever narrow rules humans can come up with for it, but ‘harm’ is very difficult to define in rules. (A utilitarian making the rules might very well argue that it would check to see if it caused negative utility to any humans, and that would be the definition of ‘harm’, but it is not widely agreed to, especially deontologically, and it still has the issue of not knowing the values of the people it is affecting.)
I do agree with the sentiment that it is better to have humanity itself be the risk case, rather than an alien intelligence we don’t know how to deal with. (I am personally very skeptical that there is any real danger of super-powerful AI any time soon though.)