I think there are behaviors you (almost) never want to turn on, or off, and other’s that need to be controlled in some contexts but not others. From a safety point of view, models that make all of these easily switchable at inference time have a bigger challenge than if you use something like scrubbing/distillation to lock the former off/on while leaving the latter switchable.
A challenge is setting the dividing line between these. Part of the difficulty of AI governance is that there is demand for models that can, for example, write fiction about criminals, supervillains, or evil geniuses, or do useful work in criminology, and so forth. Anything sufficiently smart that can do that can simulate an evil mastermind. How do you then make sure that no one ever switches it into evil mastermind mode while making plans to affect the real world, or in a situation where it could hack its own data-center and self-replicate in that mode? Advanced AI is a dangerous, dual-use technology, but that’s not the aspect of it that’s unprecedented: it’s that the technology can be self-willed and smarter than us.
One helpful aspect of the fiction problem is that villains in fiction always make some fatal mistake. So a system capable of simulating evil “geniuses” for fictional use should be bad at long-term planning, not just for safety reasons.
As I mentioned at the start, this is mostly a proposal for aligning AI that’s around the human level, not much smarter than us, so something capable of simulating a regular criminal rather than an evil genius.
I think there are behaviors you (almost) never want to turn on, or off, and other’s that need to be controlled in some contexts but not others. From a safety point of view, models that make all of these easily switchable at inference time have a bigger challenge than if you use something like scrubbing/distillation to lock the former off/on while leaving the latter switchable.
A challenge is setting the dividing line between these. Part of the difficulty of AI governance is that there is demand for models that can, for example, write fiction about criminals, supervillains, or evil geniuses, or do useful work in criminology, and so forth. Anything sufficiently smart that can do that can simulate an evil mastermind. How do you then make sure that no one ever switches it into evil mastermind mode while making plans to affect the real world, or in a situation where it could hack its own data-center and self-replicate in that mode? Advanced AI is a dangerous, dual-use technology, but that’s not the aspect of it that’s unprecedented: it’s that the technology can be self-willed and smarter than us.
One helpful aspect of the fiction problem is that villains in fiction always make some fatal mistake. So a system capable of simulating evil “geniuses” for fictional use should be bad at long-term planning, not just for safety reasons.
As I mentioned at the start, this is mostly a proposal for aligning AI that’s around the human level, not much smarter than us, so something capable of simulating a regular criminal rather than an evil genius.