I think the weakness with KL divergence is that the potentially harmful model can do things the safe model would be exponentially unlikely to do. Even if the safe model has a 1 in 1 trillion chance of stabbing me in the face, the KL penalty to stabbing me in the face is log(1 trillion) (and logs make even huge numbers small).
What about limiting the unknown model to chose one of the cumulative 98% most likely actions for the safe model to take? If the safe model never has more than a 1% chance of taking an action that will kill you, then the unknown model won’t be able to take an action that kills you. This isn’t terribly different from the Top-K sampling many language models use in practice.
I think the weakness with KL divergence is that the potentially harmful model can do things the safe model would be exponentially unlikely to do. Even if the safe model has a 1 in 1 trillion chance of stabbing me in the face, the KL penalty to stabbing me in the face is log(1 trillion) (and logs make even huge numbers small).
What about limiting the unknown model to chose one of the cumulative 98% most likely actions for the safe model to take? If the safe model never has more than a 1% chance of taking an action that will kill you, then the unknown model won’t be able to take an action that kills you. This isn’t terribly different from the Top-K sampling many language models use in practice.