The <good> <bad> thing is really cool, although it leaves open the possibility of a bug (or leaked weights) causing the creation of a maximally misaligned AGI.
As long as it’s carefully boxed, there are situations in which being able to reliably replicate specific misaligned behavior can be useful, such as when testing other alignment measures or doing interpretability. But yes, open-sourcing weights of a model that had, say, a <criminality> tag trained into it, so allowing its use by anyone, including criminals who’d turn that on, would seem unwise. Possibly one could do some sort of causal scrubbing or distillation process down to a model with behavior equivalent to the original with <criminality> permanently turned off (which would be bad at writing crime fiction) that might then be safe to open-source. AI governance mechanisms will still be necessary.
I think there are behaviors you (almost) never want to turn on, or off, and other’s that need to be controlled in some contexts but not others. From a safety point of view, models that make all of these easily switchable at inference time have a bigger challenge than if you use something like scrubbing/distillation to lock the former off/on while leaving the latter switchable.
A challenge is setting the dividing line between these. Part of the difficulty of AI governance is that there is demand for models that can, for example, write fiction about criminals, supervillains, or evil geniuses, or do useful work in criminology, and so forth. Anything sufficiently smart that can do that can simulate an evil mastermind. How do you then make sure that no one ever switches it into evil mastermind mode while making plans to affect the real world, or in a situation where it could hack its own data-center and self-replicate in that mode? Advanced AI is a dangerous, dual-use technology, but that’s not the aspect of it that’s unprecedented: it’s that the technology can be self-willed and smarter than us.
One helpful aspect of the fiction problem is that villains in fiction always make some fatal mistake. So a system capable of simulating evil “geniuses” for fictional use should be bad at long-term planning, not just for safety reasons.
As I mentioned at the start, this is mostly a proposal for aligning AI that’s around the human level, not much smarter than us, so something capable of simulating a regular criminal rather than an evil genius.
The <good> <bad> thing is really cool, although it leaves open the possibility of a bug (or leaked weights) causing the creation of a maximally misaligned AGI.
As long as it’s carefully boxed, there are situations in which being able to reliably replicate specific misaligned behavior can be useful, such as when testing other alignment measures or doing interpretability. But yes, open-sourcing weights of a model that had, say, a <criminality> tag trained into it, so allowing its use by anyone, including criminals who’d turn that on, would seem unwise. Possibly one could do some sort of causal scrubbing or distillation process down to a model with behavior equivalent to the original with <criminality> permanently turned off (which would be bad at writing crime fiction) that might then be safe to open-source. AI governance mechanisms will still be necessary.
I think that it’s risky to have a simple waluigi switch that can be turned on at inferencing time. Not sure how risky.
I think there are behaviors you (almost) never want to turn on, or off, and other’s that need to be controlled in some contexts but not others. From a safety point of view, models that make all of these easily switchable at inference time have a bigger challenge than if you use something like scrubbing/distillation to lock the former off/on while leaving the latter switchable.
A challenge is setting the dividing line between these. Part of the difficulty of AI governance is that there is demand for models that can, for example, write fiction about criminals, supervillains, or evil geniuses, or do useful work in criminology, and so forth. Anything sufficiently smart that can do that can simulate an evil mastermind. How do you then make sure that no one ever switches it into evil mastermind mode while making plans to affect the real world, or in a situation where it could hack its own data-center and self-replicate in that mode? Advanced AI is a dangerous, dual-use technology, but that’s not the aspect of it that’s unprecedented: it’s that the technology can be self-willed and smarter than us.
One helpful aspect of the fiction problem is that villains in fiction always make some fatal mistake. So a system capable of simulating evil “geniuses” for fictional use should be bad at long-term planning, not just for safety reasons.
As I mentioned at the start, this is mostly a proposal for aligning AI that’s around the human level, not much smarter than us, so something capable of simulating a regular criminal rather than an evil genius.