I agree that one major mitigation is to keep training your policy online, but that doesn’t necessarily prevent a misaligned policy from taking over the world before the training has time to fix its mistakes. In particular, if the policy is reasoning “I’ll behave well until the moment I strike”, and your reward function can’t detect that (it only detects whether the output was good), then the policy will look great until the moment it takes over.
I agree that one major mitigation is to keep training your policy online, but that doesn’t necessarily prevent a misaligned policy from taking over the world before the training has time to fix its mistakes. In particular, if the policy is reasoning “I’ll behave well until the moment I strike”, and your reward function can’t detect that (it only detects whether the output was good), then the policy will look great until the moment it takes over.