I haven’t thought this through very deeply, but couldn’t the working of the machine be bounded by hard safety constraints that the AI was not allowed to change, rather than trying to work safety into the overall utility function?
e.g. the AI is not allowed to construct more resources for itself. No matter how inefficient it may be, the AI has to ask a human for more hardware and wait for the hardware to be installed by humans.
What I would want of a super-intellient AI is more or less what I would want of a human who has power over me: don’t do things to me or my stuff without asking, don’t coerce me, don’t lie to me.
I don’t know how you would code all that, but if we can’t design simple constraints, we can’t correctly design more complex ones. I’m thinking layers of simple constraints would be safer than one unprovably-friendly utility function.
There’s been a great deal of discussion here on such issues already; you might be interested in some of what’s been said. As for your initial point, I think you have a model of a ghost in the machine who you need to force to do your bidding; that’s not quite what happens.
Nope, I’m a software engineer, so I don’t have that particular magical model of how computer systems work.
But suppose you design even a conventional software system, to run something less dangerous than a general AI, like say a nuclear reactor. Would you have every valve and mechanism controlled only by the software, with no mechanical failsafes or manual overrides, trusting the software to have no deadly flaws?
Designing that software and proving that it would work correctly in every possible circumstance would be a rich and interesting research topic, but never be completed.
One difference with AI is that it is theoretically capable of analyzing your failsafes and overrides (and their associated hidden flaws) more thoroughly than you. Manual, physical overrides aren’t yet amenable to rigorous, formal analysis, but software is. If we employ a logic to prove constraints on the AI’s behavior, the AI shouldn’t be able to violate its constraints without basically exploiting an inconsistency in the logic, which seems far less likely than the case where, e.g., it finds a bug in the overrides or tricks the humans into sabotaging them.
I haven’t thought this through very deeply, but couldn’t the working of the machine be bounded by hard safety constraints that the AI was not allowed to change, rather than trying to work safety into the overall utility function?
e.g. the AI is not allowed to construct more resources for itself. No matter how inefficient it may be, the AI has to ask a human for more hardware and wait for the hardware to be installed by humans.
What I would want of a super-intellient AI is more or less what I would want of a human who has power over me: don’t do things to me or my stuff without asking, don’t coerce me, don’t lie to me.
I don’t know how you would code all that, but if we can’t design simple constraints, we can’t correctly design more complex ones. I’m thinking layers of simple constraints would be safer than one unprovably-friendly utility function.
There’s been a great deal of discussion here on such issues already; you might be interested in some of what’s been said. As for your initial point, I think you have a model of a ghost in the machine who you need to force to do your bidding; that’s not quite what happens.
Nope, I’m a software engineer, so I don’t have that particular magical model of how computer systems work.
But suppose you design even a conventional software system, to run something less dangerous than a general AI, like say a nuclear reactor. Would you have every valve and mechanism controlled only by the software, with no mechanical failsafes or manual overrides, trusting the software to have no deadly flaws?
Designing that software and proving that it would work correctly in every possible circumstance would be a rich and interesting research topic, but never be completed.
One difference with AI is that it is theoretically capable of analyzing your failsafes and overrides (and their associated hidden flaws) more thoroughly than you. Manual, physical overrides aren’t yet amenable to rigorous, formal analysis, but software is. If we employ a logic to prove constraints on the AI’s behavior, the AI shouldn’t be able to violate its constraints without basically exploiting an inconsistency in the logic, which seems far less likely than the case where, e.g., it finds a bug in the overrides or tricks the humans into sabotaging them.
What makes you think these constraints are at all simple?