I think the weakness with KL divergence is that the potentially harmful model can do things the safe model would be exponentially unlikely to do. Even if the safe model has a 1 in 1 trillion chance of stabbing me in the face, the KL penalty to stabbing me in the face is log(1 trillion) (and logs make even huge numbers small).
What about limiting the unknown model to chose one of the cumulative 98% most likely actions for the safe model to take? If the safe model never has more than a 1% chance of taking an action that will kill you, then the unknown model won’t be able to take an action that kills you. This isn’t terribly different from the Top-K sampling many language models use in practice.
I think this is the sort of conversation we should be having! [Side note: I think restricting compute is more effective than restricting research because you don’t need 100% buy in.
it’s easier to prevent people from manufacturing semiconductors than to keep people from learning ideas that fit on a napkin
It’s easier to prevent scientists in Eaccistan from having GPUs than to prevent scientists in Eaccistan from thinking.
The analogy to nuclear weapons is, I think, a good one. The science behind nuclear weapons is well known—what keeps them from being built is access to nuclear materials.
(Restricting compute also seriously restricts research. Research speed on neural nets is in large part bounded by how many experiments you run rather than ideas you have.)]