I remember reading the argument in one of the sequence articles, but I’m not sure which one. The essential idea is that any such rules just become a problem to solve for the AI, so relying on a superintelligent, recursively self-improving machine to be unable to solve a problem is not a very good idea (unless the failsafe mechanism was provably impossible to solve reliably, I suppose. But here we’re pitting human intelligence against superintelligence, and I, for one, wouldn’t bet on the humans). The more robust approach seems to be to make the AI motivated to not want to do whatever the failsafe was designed to prevent it from doing in the first place, i.e. Friendliness.
I remember reading the argument in one of the sequence articles, but I’m not sure which one. The essential idea is that any such rules just become a problem to solve for the AI, so relying on a superintelligent, recursively self-improving machine to be unable to solve a problem is not a very good idea (unless the failsafe mechanism was provably impossible to solve reliably, I suppose. But here we’re pitting human intelligence against superintelligence, and I, for one, wouldn’t bet on the humans). The more robust approach seems to be to make the AI motivated to not want to do whatever the failsafe was designed to prevent it from doing in the first place, i.e. Friendliness.