A simple idea for AI security that will not solve alignment but should easily prevent FOOM and most catastrophic outcomes is using safety interlocks for AIs.
A “safety interlock” is a device that prevents the system from reaching a dangerous state. It is typically used in machinery or industrial processes where certain conditions need to be met before the system can operate.
In a microwave, the door includes a safety interlock system that prevents the microwave from operating if the door is open. When you open the door, the interlock interrupts the power supply to the magnetron.
In AI, it might look like this: you create a list of tests that the AI must fail in order to stay activated, this list of tests represents the upper boundary of allowed AI capabilities, and the AI will be tested in short intervals giving a high priority in its utility function to trying to pass these tests. As soon as it passes this test, it’s shut down automatically. It could be tested in short intervals so it won’t have the time to self-improve vastly above this threshold between the tests.
To give a more specific example, an LLM could be asked a hard physics question that wasn’t included in training data that it will be asked to solve every few seconds. Being shut down as soon as it can.
This idea seems very simple and straightforward to me, and I don’t see any glaring issues with it, but yet it doesn’t seem to be researched or considered seriously as a robust safety solution (to the best of my knowledge).
My question is, what are the issues with this idea? And why doesn’t it solve most of the problems with AI safety?
In your example, can it just lie? You’d have to make sure it either doesn’t know the consequences of your interlocks, or for it to not care about them (this is the problem of corrigibility).
If the tests are obvious tests, your AI will probably notice that and react accordingly—if it has enough intelligence it can notice that they’re hard and probably are going to be used to gauge it’s level, which then feeds into the whole thing about biding your time and not showing your cards until you can take over.
If they’re not obvious, then you’re in a security type situation, where you hope your defenses are good enough. Which should be fine on weak systems, but they’re not the problem. The whole point of this is to have systems that are much more intelligent than humans, so you’d have to be sure they don’t notice your traps. It’s like if a 5 year old set up booby traps for you—how confident are you that the 5 year old will trap you?
This is a story of how that looks at the limit. A similar issue is boxing. In both cases you’re assuming that you can contain something that is a lot smarter than you. It’s possible in theory (I’m guessing?), but how sure are you that you can outsmart it in the long run?
It depends a lot on how much it values self-preservation in comparison to solving the tests (putting aside the matter of minimal computation). Self-preservation is an instrumental goal, in that you can’t bring the coffee if you’re dead. So it seems likely that any intelligent enough AI will value self-preservation, if only in order to make sure it can achieve its goals.
That being said, having an AI that is willing to do its task and then shut itself down (or to shut down when triggered) is an incredibly valuable thing to have—it’s already finished, but you could have a go at the shutdown problem.
A more general issue is that this will handle a lot of cases, but not all of them. In that an AI that does lie (for whatever reason) will not be shut down. It sounds like something worth having in a swiss cheese way.
(The whole point of these posts are to assume everyone is asking sincerely, so no worries)
A simple idea for AI security that will not solve alignment but should easily prevent FOOM and most catastrophic outcomes is using safety interlocks for AIs.
A “safety interlock” is a device that prevents the system from reaching a dangerous state. It is typically used in machinery or industrial processes where certain conditions need to be met before the system can operate.
In a microwave, the door includes a safety interlock system that prevents the microwave from operating if the door is open. When you open the door, the interlock interrupts the power supply to the magnetron.
In AI, it might look like this: you create a list of tests that the AI must fail in order to stay activated, this list of tests represents the upper boundary of allowed AI capabilities, and the AI will be tested in short intervals giving a high priority in its utility function to trying to pass these tests. As soon as it passes this test, it’s shut down automatically. It could be tested in short intervals so it won’t have the time to self-improve vastly above this threshold between the tests.
To give a more specific example, an LLM could be asked a hard physics question that wasn’t included in training data that it will be asked to solve every few seconds. Being shut down as soon as it can.
This idea seems very simple and straightforward to me, and I don’t see any glaring issues with it, but yet it doesn’t seem to be researched or considered seriously as a robust safety solution (to the best of my knowledge).
My question is, what are the issues with this idea? And why doesn’t it solve most of the problems with AI safety?
In your example, can it just lie? You’d have to make sure it either doesn’t know the consequences of your interlocks, or for it to not care about them (this is the problem of corrigibility).
If the tests are obvious tests, your AI will probably notice that and react accordingly—if it has enough intelligence it can notice that they’re hard and probably are going to be used to gauge it’s level, which then feeds into the whole thing about biding your time and not showing your cards until you can take over.
If they’re not obvious, then you’re in a security type situation, where you hope your defenses are good enough. Which should be fine on weak systems, but they’re not the problem. The whole point of this is to have systems that are much more intelligent than humans, so you’d have to be sure they don’t notice your traps. It’s like if a 5 year old set up booby traps for you—how confident are you that the 5 year old will trap you?
This is a story of how that looks at the limit. A similar issue is boxing. In both cases you’re assuming that you can contain something that is a lot smarter than you. It’s possible in theory (I’m guessing?), but how sure are you that you can outsmart it in the long run?
Why would it lie if you program its utility function in a way that puts:
solving these tests using minimal computation > self-preservation?
(Asking sincerely)
It depends a lot on how much it values self-preservation in comparison to solving the tests (putting aside the matter of minimal computation). Self-preservation is an instrumental goal, in that you can’t bring the coffee if you’re dead. So it seems likely that any intelligent enough AI will value self-preservation, if only in order to make sure it can achieve its goals.
That being said, having an AI that is willing to do its task and then shut itself down (or to shut down when triggered) is an incredibly valuable thing to have—it’s already finished, but you could have a go at the shutdown problem.
A more general issue is that this will handle a lot of cases, but not all of them. In that an AI that does lie (for whatever reason) will not be shut down. It sounds like something worth having in a swiss cheese way.
(The whole point of these posts are to assume everyone is asking sincerely, so no worries)