In your example, can it just lie? You’d have to make sure it either doesn’t know the consequences of your interlocks, or for it to not care about them (this is the problem of corrigibility).
If the tests are obvious tests, your AI will probably notice that and react accordingly—if it has enough intelligence it can notice that they’re hard and probably are going to be used to gauge it’s level, which then feeds into the whole thing about biding your time and not showing your cards until you can take over.
If they’re not obvious, then you’re in a security type situation, where you hope your defenses are good enough. Which should be fine on weak systems, but they’re not the problem. The whole point of this is to have systems that are much more intelligent than humans, so you’d have to be sure they don’t notice your traps. It’s like if a 5 year old set up booby traps for you—how confident are you that the 5 year old will trap you?
This is a story of how that looks at the limit. A similar issue is boxing. In both cases you’re assuming that you can contain something that is a lot smarter than you. It’s possible in theory (I’m guessing?), but how sure are you that you can outsmart it in the long run?
It depends a lot on how much it values self-preservation in comparison to solving the tests (putting aside the matter of minimal computation). Self-preservation is an instrumental goal, in that you can’t bring the coffee if you’re dead. So it seems likely that any intelligent enough AI will value self-preservation, if only in order to make sure it can achieve its goals.
That being said, having an AI that is willing to do its task and then shut itself down (or to shut down when triggered) is an incredibly valuable thing to have—it’s already finished, but you could have a go at the shutdown problem.
A more general issue is that this will handle a lot of cases, but not all of them. In that an AI that does lie (for whatever reason) will not be shut down. It sounds like something worth having in a swiss cheese way.
(The whole point of these posts are to assume everyone is asking sincerely, so no worries)
In your example, can it just lie? You’d have to make sure it either doesn’t know the consequences of your interlocks, or for it to not care about them (this is the problem of corrigibility).
If the tests are obvious tests, your AI will probably notice that and react accordingly—if it has enough intelligence it can notice that they’re hard and probably are going to be used to gauge it’s level, which then feeds into the whole thing about biding your time and not showing your cards until you can take over.
If they’re not obvious, then you’re in a security type situation, where you hope your defenses are good enough. Which should be fine on weak systems, but they’re not the problem. The whole point of this is to have systems that are much more intelligent than humans, so you’d have to be sure they don’t notice your traps. It’s like if a 5 year old set up booby traps for you—how confident are you that the 5 year old will trap you?
This is a story of how that looks at the limit. A similar issue is boxing. In both cases you’re assuming that you can contain something that is a lot smarter than you. It’s possible in theory (I’m guessing?), but how sure are you that you can outsmart it in the long run?
Why would it lie if you program its utility function in a way that puts:
solving these tests using minimal computation > self-preservation?
(Asking sincerely)
It depends a lot on how much it values self-preservation in comparison to solving the tests (putting aside the matter of minimal computation). Self-preservation is an instrumental goal, in that you can’t bring the coffee if you’re dead. So it seems likely that any intelligent enough AI will value self-preservation, if only in order to make sure it can achieve its goals.
That being said, having an AI that is willing to do its task and then shut itself down (or to shut down when triggered) is an incredibly valuable thing to have—it’s already finished, but you could have a go at the shutdown problem.
A more general issue is that this will handle a lot of cases, but not all of them. In that an AI that does lie (for whatever reason) will not be shut down. It sounds like something worth having in a swiss cheese way.
(The whole point of these posts are to assume everyone is asking sincerely, so no worries)